Multicore Microcontrollers Drive Performance

By European Editors

Contributed By Digi-Key's European Editors

While the profile of multicore processing has definitely been elevated through their widespread adoption in personal computers, the real benefits behind this trend may not be quite so apparent, particularly to developers of deeply-embedded applications.

In reality, many compute-intensive devices make use of a multi-processor architecture at some level; mobile phones have, for some time, used multicore devices which themselves supplanted multiple discrete processors. As multicore processors, based on a range of architectures, have become widely available, the same level of consolidation has swept other application areas, such as industrial control, telecommunications, and networking. It is now becoming clear that the same dynamics that have driven the use of multicore devices is promoting the development of multicore microcontrollers for the embedded space.

The need for more processing power is not in doubt, but the efficacy of multicore devices to deliver it sometimes is; there is a commonly held belief that in most applications the benefits of multiple cores does not scale beyond four cores.

In deeply-embedded applications the benefits may be even more difficult to identify, particularly with the proliferation of low-end and low-cost 32-bit microcontrollers. However, while these devices are clearly very capable, their use is not necessarily complementary; creating a customized multi-processing platform could prove difficult. Integrated device manufacturers are now stepping in, creating devices that have been designed at a low level to offer homogenous and heterogeneous multicore microcontrollers.

Unified architectures

At a basic level, a multicore device provides the best of two worlds; most microcontroller instruction sets are very good at control tasks but not necessarily so good at data processing. Modifying that instruction set (and therefore the underlying architecture) to add data-centric instructions has become more common and given rise to what are sometimes called unified architectures, but perhaps more commonly known as Digital Signal Controllers (DSC).

DSCs have been available for some time and represent the first step towards a full ‘multicore’ microcontroller. The applications they address typically reflect a growing need for ‘more than control’ in applications where microcontrollers are either traditionally strong or seeking to become so. Motor control is one example of this, but many DSCs are aimed at more general control-centric applications, and integrate a range of peripherals to support that.

One example is Freescale’s 56F807 which is based on Freescale’s 56800 core, combining MCU-like functionality and the processing power of a DSP. The core follows a Harvard architecture and comprises three execution units that are able to run in parallel, resulting in up to six operations per instruction cycle. It also retains a ‘C-friendly’ format, making it less necessary to program the device at the assembler level in order to achieve efficient code execution. The 56F807 features two Pulse Width Modulators, each of which delivers three complementary programmable outputs, which Freescale describe as an application-specific feature.

Microchip is also strong in the DSC product area with its dsPIC family. This also features a Harvard architecture with a modified instruction set and an instruction prefetch mechanism designed to help maintain throughput and predictable execution, aided by the majority of instructions executing in a single cycle. The DSP engine features a 17 x 17-bit multiplier, 40-bit ALU, two 40-bit saturating accumulators, and a 40-bit bidirectional barrel shifter.

Microchip’s dsPIC features a unified instruction set

Figure 1: Microchip’s dsPIC features a unified instruction set to combine DSP and MCU functionality in a single device.

Symmetric processing

The next evolutionary step along the multicore path for microcontrollers came with the introduction of devices that feature two instantiations of the same core, or homogenous multicore devices. The instant benefit of these devices is twice the processing performance for less than twice the price (or system power), due to their shared resources.

More significantly is the access to the symmetric multiprocessing it affords; the ability to execute a given task in less time by running it simultaneously on two cores instead of one. Each of the identical cores represents a ‘unified’ architecture in that it is often based on a modified Harvard architecture with an instruction set specifically designed to offer both DSP and MCU functionality.

In the case of Analog Devices’ ADSP-BF561, the processing block features two Blackfin® cores. Known as the Micro Signal Architecture (MSA), it is described as an orthogonal RISC-like instruction set that offers SIMD (Single Instruction Multiple Data) capabilities and multimedia features in a single instruction-set architecture. Each identical core features two 16-bit multipliers, two 40-bit accumulators, two 40-ALUs, four 8-bit video ALUs, and a 40-bit shifter, with independent L1 caches but shared L2 cache and unified memory architecture.

Analog Devices’ Blackfin core

Figure 2: Analog Devices’ Blackfin core is at the heart of its ADSP-BF561.

Multimedia processing is a demanding application and one that is increasingly tackled at an embedded level, hence the emergence of devices like the ADSP-BF561, and more recently, the i.MX6 family from Freescale. This family of scalable devices features a common platform that supports single-, dual-, and quad-core variants in a largely pin- and software-compatible format, based on the ARM® Cortex™-A9 core. The core(s) run at up to 1.2 GHz with ARMv7, NEON, VFPv3, and Trustzone support.

These devices are categorized by Freescale as Application Processors and are largely aimed at the automotive (infotainment) and consumer (smart device) markets, but have rapidly been adopted by SBC (Single Board Computer) vendors as an alternative to the more established x86 architecture, thanks to their low power and high-performance credentials, and ability to run a range of operating systems including Linux. Their scalability from single- to quad-core makes them ideal for SBC vendors, as they are able to offer a range of performance points using a common platform.

Freescale’s i.MX6 family of Application Processors

Figure 3: Freescale’s i.MX6 family of Application Processors scale from single- to quad-core devices.

Asymmetric processing

While raw processing power is always in demand, the need to process more, and often disparate, data is also increasing. This has given rise to a newer class of multicore microcontrollers that bring together two or more cores with specific but different functional strengths. These devices are commonly referred to as heterogeneous multicore devices and often feature two cores with very different profiles.

A conventional form of asymmetric multicore processing comes from NXP’s LPC43xx family. Described as an asymmetric DSC, it features ARM’s Cortex-M4 and Cortex-M0, which communicate using an inter-processor communication protocol. This allows the -M4 to focus on digital signal-processing tasks while the -M0 meets the control aspect of an application. The concept allows simpler tasks to be offloaded onto the smaller core, thereby maximizing the processing bandwidth of the more powerful core for compute-intensive number crunching, which is really at the heart of co-processing in general and asymmetric processing in particular.

The protocol implemented to allow the processors to communicate uses a shared SRAM as a mailbox, with one processor raising an interrupt for the other to ‘check’ the mailbox, which is acknowledged when the receiving processor raises an interrupt in response. In addition to the conventional analog and digital general-purpose I/O, the LPC43xx features a state-configurable timer sub-system, while each device produced also has a unique identifier.

Massively parallel

While Freescale’s i.MX6 family extends (currently) to a quad-core device, the next step is to move further along this curve towards massively parallel architectures. In the processing world this is not uncommon; many network processors feature multiple instantiations of the same core in a symmetric processing architecture. In the embedded space, however, massively parallel microcontrollers are far less common, and in the view of many engineers may still be unnecessary.

However, there is a case for their use, particularly in distributed real-time applications such as industrial automation and robotics. This is an area where sheer processing power cannot always overcome system-level limitations such as the speed of the communications backbone or the need to react in hard real-time.

In these cases it makes sense to deploy low-cost yet powerful microcontrollers at the point of application, so they are better able to interface to sensors and actuators without the delay of passing large amounts of data back to a central processor.

This describes a key application area for a recently developed architecture by start-up XMOS®, which implements large numbers of relatively simple MCUs in a massively parallel format. The xCORE multicore architecture means it is able to run multiple real-time tasks simultaneously as it maintains completely deterministic timing, a key benefit of the XMOS approach. It achieves this by combining multiple logical cores which have shared resources but separate register files. Each logical core gets a guaranteed amount of processing power, controlled by the xTIME timing and synchronizing technology; it is this technology that gives xCORE its unique predictability and real-time responsiveness.

The logical cores are arranged in tiles, each tile having up to eight logical cores. These tiles are then arranged in multiples in a given device, meaning there are devices available with 1, 2, or 4 tiles delivering 4, 6, 8, 10, 12, 16, and 32 logical processor cores. A major feature of this approach is the low latency it can return; XMOS states that while other MCUs report response times of tens of milliseconds, the response time of the xCORE devices are measured at a resolution of 10 ns. Another feature of this embedded performance is its flexibility; instead of ‘hard wired’ peripherals, the xCORE devices implement peripherals in software. These software-defined peripherals are built using a combination of I/O ports, logical processors, and SRAM, with most peripherals fitting in just one logical processor.


While already numerous, the availability of multicore MCUs is set to increase; ARM’s ‘big.LITTLE’ technology matches two of its Cortex-A class cores together in a lock-step format, which allows the application software to run on either core depending on the prevailing performance demands without missing a clock beat. This extends the battery life of portable equipment that spends much of its time in a sleep mode, without sacrificing performance when needed.

Another example of why multiple cores could be useful in a single device is to overcome the speed limitations of embedded Flash memory. By partitioning tasks in an asymmetric way across two small, low-cost cores, such as the Cortex-M0, the full performance of the cores can be accessed while still utilizing low-cost embedded memory. It is often the cost of implementing embedded Flash that dictates the cost of an MCU today, so by using two cores working in a complementary way, the bottleneck can be effectively removed. One example of this approach came from Toshiba, who recently announced its first multicore MCU based on the Cortex-M0. By using two cores, the speed limitation of the embedded Flash is mitigated.

Combining high- and low-end processors

Figure 4: Combining high- and low-end processors can deliver performance with low power.

With 32-bit MCUs now becoming available at sub-$1 in volume, it is clear that demand for this level of performance already exists and it follows that, as with most application areas, more performance is better, particularly if it comes with little or no increase in system power. While ‘free’ performance may be too much to hope for, embedded cores are now physically smaller and require less power than many of the other resources normally found in an MCU. It would seem reasonable to expect more multicore MCUs to appear in the near future.

Disclaimer: The opinions, beliefs, and viewpoints expressed by the various authors and/or forum participants on this website do not necessarily reflect the opinions, beliefs, and viewpoints of Digi-Key Electronics or official policies of Digi-Key Electronics.

About this author

European Editors

About this publisher

Digi-Key's European Editors