Memory Considerations for Faster MCUs

By Tom Starnes

Contributed By Convergence Promotions LLC

Flash memory completely dominates microcontrollers (MCUs) now, but memory considerations have become more complex as the processors have advanced to 32-bit architectures and peripherals have become much more capable. It is easy to forget about memory among the exotic peripherals that make the MCU more of a system-on-a-chip (SoC) encompassing advanced motor control, graphical user interfaces (GUIs), and networking.

Still, the details of Flash and other memory types deserve attention to ensure that the selected MCU has memory that matches the needs of the system. Rapidly increasing use of the ARM® Cortex™-M processor architecture at higher speeds in MCUs also invites closer inspection of memory support. Vendors take different approaches to their on-chip memory options, which may tip the balance of which MCU to choose.

As much as a megabyte of Flash memory is available on larger MCUs today. The ratio of program store, data tables, and scratchpad RAM changed as MCUs found new applications and took on new functionality. Higher-end MCUs are probably programmed in a high-level language (HLL), running a real-time operating system (RTOS), and utilizing off-the-shelf stacks and software packages. Each of these factors affects memory needs and usage.

The Flash memory typically used on MCUs has access times that let it keep up with 25 to 50 MHz processor clocks. When high performance processor cores are clocked over 200 MHz, there could be a huge gap to fill with time- and power-consuming wait-states. Multiple on-chip buses and special routing mechanisms can be quite beneficial in alleviating the traffic problems in some busy microcontrollers.

The ARM processor core architecture has very good code density due to its compact Thumb2® instructions which require only 16 bits to store instructions rather than the expected 32 bits. The ARM Cortex-M0, Cortex-M3, and Cortex-M4 cores are popular in microcontrollers from numerous vendors, and some memory choices will be investigated here.

NXP Semiconductors LPC4000 – Real-time aids data processing

NXP Semiconductors has been successful with ARM-based microcontrollers since adapting ARM7TDMI® cores for MCU use long ago. NXP quickly added the Cortex-M cores to its portfolio once they became available, and has been one of the leaders integrating the Cortex-M3, Cortex-M0, and Cortex -M4 into MCUs with high speed at one end while pushing prices lower at the other. NXP was one of the first to go to an extra-wide Flash memory organization to buffer subsequent memory locations to ensure availability without delay.

NXP's latest MCU family, the LPC4000 (see Figure 1), is interesting for a couple of reasons, such as the inclusion of a Cortex-M4 – which has digital signal processing (DSP), single instruction multiple data (SIMD), and optional floating-point instructions – as well as a separate Cortex-M0 processor core on each. To keep the Cortex-M4 fed fast enough, and to keep power consumption low, NXP expanded the on-chip Flash to 256 bits wide, the widest in the industry.

NXP LPC4000 architecture

Figure 1: NXP LPC4000 architecture (Courtesy NXP).

NXP uses a fairly-straightforward buffer system to hold 32 lines of recent Flash memory accesses, giving immediate availability to recently fetched instructions. This provides more consistent execution performance than might be experienced using more exotic schemes. Some cache replacement algorithms can work against compiler-generated code and can be more difficult to simulate and debug. NXP finds execution from their Flash can run within five percent of the performance from RAM, running up to 150 MHz in the current 90 nm process.

Two banks of Flash give isolation and partitioning that also provide absolute confidence when re-Flashing one bank while the application continues to run from the other.

LPC4000s have up to 1 MB of Flash with a whopping 264 KB of SRAM on-chip – a 4:1 ratio of program to data memory. If desired, instructions can be executed directly from much of the RAM with zero wait-states – ideal for the fastest deterministic real-time processing without concern for code bouncing around at a fine-grain level. Assortments of SRAM blocks are available, so different routines and input/output (I/O) do not fight for bus time.

It is easy to use inexpensive external Flash memory with the LPC4000 for expansion program space, code that will be copied into SRAM first for fastest execution, or even large graphics images destined for display screens. Readily available Flash with a serial peripheral interface (SPI) port, including quad-SPI Flash, can actually be direct-mapped into the normal memory space of the processor, and the programmer does not have to think whether it is on-chip or connected serially off-chip. The SPI Flash interface (SPIFI) offers four lanes to external Flash, and allows images in Flash to be DMA'd directly to the LCD controller at up to 40 MBps.

The Cortex-M0 has its own 8 KB program memory and passes messages to the bigger brother Cortex-M4, via shared memory.

This series of MCUs also includes 32 KB ROM containing software drivers, boot code, and other handy bits of code to relieve the system designer from having to write this code and leaving more precious Flash for more application-specific routines. The speed and power efficiency of ROM execution is naturally better than Flash as well. A library to perform dependable fixed-time DIVIDE operations is offered with some MCUs.

NXP's earlier versions of ARM Cortex MCUs may be on larger 180 nm or 140 nm process nodes, and most utilize a 128-bit wide Flash rather than the 256-bit architecture just described. All the Flash was developed by NXP specifically for MCUs, and it has built-in single-error correction/double-error detection with logging for better Flash integrity and monitoring. NXP has a broad spectrum of ARM-based MCUs that incorporate the Cortex-M0, Cortex-M3, and Cortex-M4, with the smallest fitting into just 16-pin packages and selling at prices one would expect for 8-bit MCUs.

STMicroelectronics STM32 – Quick, artful memory

is another company that soon embraced the ARM Cortex-M3 in microcontrollers with its STM32 product line after working the earlier ARM7™ and ARM9™ cores into 32-bit MCUs. STMicroelectronics's latest STM32F4 series (see Figure 2) can push the Cortex-M4 to 168 MHz in a 90 nm process while offering up to 1 MB of Flash and 192 KB of RAM on chip.

STMicroelectronics STM32F4 architecture

Figure 2: STMicroelectronics STM32F4 architecture (Courtesy STMicroelectronics).

To get that kind of performance, STMicroelectronics developed its adaptive real-time memory accelerator (ART Accelerator™). This is a microprocessor-system-like cache controller tailored to the needs of programs executing from Flash. Flash is organized by 128 bits so a single read contains four 32-bit instructions, which with the Thumb2 instructions can be six to eight real instructions.

The ART Accelerator uses a prefetch queue and a 64-entry branch cache to mitigate delays from a change-of-flow in the instructions due to branching, subroutine calls, and possibly even system calls or interrupts. If the redirected program counter wants a recently-fetched location, the target probably still resides in the branch cache, in which case it can be loaded immediately into the prefetch queue for execution, saving cycles. More intelligent (adaptive) cache management by on-chip logic should yield more positive results (a higher bit rate) than simpler methods.

To alleviate Flash stalls on data accesses, such as data lookup tables or image data, the ART Accelerator has eight 128-bit buffers. Locality-of-reference is pretty poor for data, but it can be improved by cleverly arranging data based on detailed understanding of its use in the program. This is akin to hand-coding in assembly.

STMicroelectronics is seeing Flash execution up to the 168 MHz speed within 2.5 percent of execution from zero-wait-state memory. It touts CoreMark™ benchmarks as proof of their efficiency and speed, although compiler effectiveness and settings also influence those results. First, a 168 MHz STM32F4 MCU executes the routines much quicker than any other MCU in this class and shows linearity over frequency. Second, the "Coremarks/MHz" (effective work done per clock cycle) is one of the highest.

A real-time clock module on the STM32F4 includes a 4 KB battery-backed SRAM for holding variable and state information during extremely low power conditions. More distinctively, 528 bytes of one-time programmable ROM is available for serial numbers, MAC addresses, cryptography keys, calibration settings, and storage of other data unique to each device shipped.

STMicroelectronics also utilizes a 7-level ARM high-speed bus (AHB) matrix that allows simultaneous data transfers between masters like the ARM processor, general-purpose DMAs, DMAs associated with USB or network controllers, and slaves like the multitude of peripherals and memories.

STMicroelectronics has numerous MCU configurations of ARM Cortex-M0 and the original Cortex-M3 ranging from lower cost, lightly loaded controllers to fast clocked devices with sophisticated peripherals. They also have a low-power line. STMicroelectronics claims a 45 percent market share in cumulative units shipped of Cortex-M-based MCUs, so many of these products have been used.

Freescale Semiconductor Kinetis – Flexible memory

Freescale Semiconductor's primary microcontrollers based on ARM processors took a while to get started, although it has sold 32-bit MCUs based on the Power Architecture™ and its proprietary ColdFire® architecture for decades. Jumping quickly on the ARM Cortex-M4 core with its enhanced capabilities, Freescale filled out its new Kinetis™ product families fairly well (see Figure 3).

Freescale Kinetis architecture

Figure 3: Freescale Kinetis architecture (Courtesy Freescale).

Ranging from the smallish K10 to today's full-bore K70, on-chip Flash is available from 32 KB to 1 MB, organized from 32 bits to 128 bits wide depending on the chip. Manufactured on a 90 nm process node, the Flash responds in around 30 ns depending on voltage, but the Kinetis MCUs run up to 100 MHz with promises of double the speed. Freescale's thin film storage (TFS) Flash can read, erase, and write at voltages down to 1.71 volts, which is nice because it is within the limits of two almost-spent 1.5 volt AA batteries (which degrade rapidly once they hit 0.9 volts each).

Kinetis MCUs have their own instruction and data caches to help overcome Flash read delays, and they address off-chip memory as well. This is effective enough that Kinetis MCUs look as efficient as the others up to Kinetis' rated speed. A memory protection unit helps the operating system keep one task's program from getting into another task's memory space.

The primary Flash is supplemented with something Freescale calls FlexMemory, a special variety of Flash that can also operate as E2PROM. The programmer decides how much to use as program Flash with the balance being used as E2 – up to 16 KB. The portion that operates as E2 automatically engages special logic that performs wear-leveling and writing algorithms to get one million and possibly up to 10 million endurance cycles as more FlexFlash is dedicated.

As is the case with some other vendors, Freescale utilizes a crossbar switch to let the main Flash, the FlexFlash, the SRAM, and various peripherals be accessed simultaneously by bus masters in order to keep data moving optimally.

Texas Instruments Stellaris – Firmware included

The Stellaris® microcontrollers were the first products to use the new ARM Cortex-M3 architecture when they were developed by lead partner Luminary Micro, now owned by Texas Instruments. Stellaris has a rich collection of MCUs serving applications from motor control to networking and user interfaces.

The Texas Instruments MCUs run at modest 80 MHz speeds, have up to 512 KB of error-checking Flash memory, up to 96 KB of data RAM, and some have their own 2 KB of traditional E2PROM on-chip. Stellaris' Flash memory can perform single-cycle reads up to 50 MHz, above which the effect of a prefetch buffer minimizes delays by fetching 64-bits per read and engaging speculative branching.

While ROM seems to have disappeared on most MCUs these days, many Stellaris LM3S and Cortex-M4-based LM4F MCUs (see Figure 4) make a special use of compact ROM to store some fundamental and often-accessed code that is likely to be used by all applications. These drivers and routines are called StellarisWare® and consist of peripheral driver libraries, boot loaders and vector tables, the pre-emptive real-time scheduler SafeRTOS™, cyclic redundancy check (CRC) error detection operations, and cryptography tables used for the Advanced Encryption Standard (AES) functions. Putting these useful functions and data in fast, cheap ROM (where appropriate) frees up a significant quantity of Flash that is better used for custom code that enhances the end equipment.

Texas Instruments Stellaris LM4F architecture

Figure 4: Texas Instruments Stellaris LM4F architecture (Courtesy Texas Instruments).

Remember your application – Memory may make it zing

The needs of every application are different and there are many factors to consider in choosing a microcontroller. A number of Flash, SRAM, ROM, and specialty memory features pertinent to higher-end MCUs from a variety of vendors have been reviewed here. While no one part might have precisely the ideal features to accommodate your application, many of the memory options should now be clearer.
Convergence Logo

Disclaimer: The opinions, beliefs, and viewpoints expressed by the various authors and/or forum participants on this website do not necessarily reflect the opinions, beliefs, and viewpoints of Digi-Key Electronics or official policies of Digi-Key Electronics.