Flash Memory Architecture Gates System-level MCU Performance

By Maury Wright

Contributed By Hearst Electronic Products


Integrated memory has been part of the microcontroller (MCU) landscape since the inception of the first system-on-chip products. Today, however, memory architecture – especially Flash – plays an increasingly important role in how well an MCU performs in an application. Design teams need to look beyond the basic memory-size parameters and instead evaluate how they to map an application into the memory space to meet project performance requirements. MCU clock rates have escalated considerably but that only equates to better system performance if designers can efficiently feed instructions and data to the processor core.

Historically, MCUs have had relatively slow clock speeds – especially relative to general-purpose microprocessors. As a result, memory access time and bandwidth haven’t been a major concern with MCUs. While microprocessor vendors developed multi-level cache schemes to efficiently feed the core, MCU speeds remained relatively in sync with the performance characteristics of the integrated memory. Today, however, even low-cost 32-bit MCUs are hitting clock rates in the 100-MHz-or-faster range and such speeds demand that the design team take a closer look at the impact of memory on performance.

ARM Cortex-M3-based MCUs

The trend toward companies basing 32-bit MCUs on the ARM Cortex-M3 core provides a good landscape to initiate the Flash-memory discussion. ARM supplies the core, but each MCU vendor must surround that core with its own peripheral set and memory. The Flash implementation and performance is a result of the Flash IP each vendor owns and the fab capability of either the MCU vendor or its manufacturing partner.

Cotex-M3 licensees include Texas Instruments (TI), NXP Semiconductors, Freescale and STMicroelectronics, among others. Each manufacturer has developed a unique approach to the Flash issue and most have migrated to relatively-advanced fab processes using 90-nm lithography.

Even at 90-nm, most Flash implementations don’t offer the access time and bandwidth that matches the appetite of a 100 MHz or faster processor. While Flash access time is improving constantly, many MCU implementations only support clock speeds in the 25 MHz range with zero wait-states. If the Flash runs significantly slower than the processor, the wait-states can severely hamper performance.

To combat this problem, MCU vendors are employing other novel techniques to feed the processor. For example, the NXP Semiconductors LPC1800 MCU (Figure 1) based on the Cortex-M3 core is offered at clock speeds as fast as 150 MHz. NXP utilizes dual Flash banks to mitigate the performance penalty associated with wait-states. The LPC1800 family includes MCUs with as much as 1 Mbyte of Flash, many utilizing the dual-bank strategy.

NXP's LPC1800

Figure 1: The LPC1800 family of MCUs from NXP Semiconductors includes models with dual 256-bit wide Flash banks to enable no-wait-state operation. (Source: NXP Semiconductors).

Each of the dual banks is 256-bits wide. A read operation to one of the banks can fetch as many as eight instruction words, and serial read operations ping pong between the two banks effectively doubling the access time of the memory.

The downside to this approach is cost in terms of silicon real estate and manufacturing complexity, in addition to an increase in power consumption. Flash is typically one of the major power sinks on an MCU. However, because NXP offers the LPC1800 in versions with single and dual Flash banks, design teams can decide whether the performance benefit is needed for the project at hand and make a cost/benefit evaluation in the selection process.

Memory accelerators cache instructions

Other MCU vendors take different approaches to the Flash performance problem. The STMicroelectronics STM32F2 MCU family, also based on the Cortex-M3, offers a 128-bit wide, single Flash bank. To complement the MCU, the company developed an Adaptive Real-Time (ART) memory accelerator to cache instruction and data, and speed Flash access time (Figure 2).

STMicro's ART Memory Accelerator

Figure 2: STMicroelectronics uses the Adaptive Real-Time (ART) memory accelerator to cache instruction and data, and speed Flash access time. (Source: STMicroelectronics).

STMicroelectronics offers the STM32F2 family in speeds as fast as 120 MHz. Because a single Flash read can fetch four instruction words, the Flash access speeds could support zero wait-state operation – assuming linear access to code. But, of course, branches make up a large portion of typical programs. Nonetheless STMicroelectronics asserts that the ART accelerator enables zero wait-state operation once the target of a branch has been stored in the ART registers.

The challenge with caches is that, inevitably, programs suffer wait-states from cache misses. One way design teams can mitigate the Flash miss problem is to hand code key portions of an application and make sure the key routines fit in cache. Unfortunately, this can also add considerable time to any programming task.

STMicroelectronics points out that the cache approach makes more efficient use of the silicon real estate than using larger Flash arrays or multiple banks. It’s important that the embedded-system designer evaluate the cost requirements and potential performance downfalls relative to the project at hand.

Advanced fab processes speed access

Another approach to Flash performance limitation is to make improvements in the cell design and fabrication process. Both NXP Semiconductors and STMicroelectronics have moved to a 90-nm process and, in doing so, improved Flash access time. Unfortunately, the performance still delays the escalation in processor performance.

Renesas, on the other hand, has developed a Flash architecture that offers 10-nsec read access. That means a 100-MHz processor can access code or data with no-wait-states.

Renesas is not a licensee of the Cortex-M3 but offers a broad range of proprietary cores with a mixed heritage originating from the merger of Mitsubishi Semiconductor and Hitachi semiconductor and, more recently, NEC Electronics.

The company is focusing on the relatively new RX600 family of MCUs in the mainstream 32-bit market. To-date, the company has introduced 100-MHz products with up to 1 Mbyte of Flash, although it has indicated that it will offer as much as 4 Mbytes in the future.

While it seems as though a faster Flash process could add to the fabrication cost and result in more expensive MCUs, it doesn’t seem to be the case with the RX. The MCU family is relatively new and just beginning to win sockets in high-volume projects. Design teams should evaluate the costs associated with every MCU considered for a project. The number one thing to consider when looking at implementing a faster MCU is the system-level performance and Flash memory requirements for the application.

Electronic Products Logo

Disclaimer: The opinions, beliefs, and viewpoints expressed by the various authors and/or forum participants on this website do not necessarily reflect the opinions, beliefs, and viewpoints of Digi-Key Corporation or official policies of Digi-Key Corporation.