Using Your MCU’s Memory Architecture to Boost Application Efficiency

By Warren Miller

Contributed By Hearst Electronic Products


Modern MCUs have a variety of memory elements, and understanding their organization, performance limitations and power implications can be critical to implementing an application efficiently. In particular, the characteristics of on-chip Flash memory used for code storage, the organization of on-chip SRAM used for data storage and the access characteristics of off-chip memory will have dramatic impact on overall processing efficiency. Let’s examine each of these key memory elements to better understand how to most efficiently use them to maximize performance, reduce power dissipation and optimize system cost.

On-chip Flash memory

On-chip Flash memory is perhaps the most critical memory element in any application since it is most often the source for all the instructions for the processor. If instructions are not fetched efficiently, your overall MCU performance will suffer. There are two different approaches to delivering instructions to the CPU. In one approach, the memory operates as fast as needed to match the instruction cycle of the CPU. The Renesas RX600 group, for example, uses an advanced Flash technology that provides high-performance zero wait-state access to instruction memory. This approach results in a simplified CPU architecture and deterministic timing.

Access to Flash memory typically uses a two-port approach with CPU access, via a high-speed bus for read operations and slower access, using a Flash memory controller for write operations. The Flash memory interface for the RX600 is illustrated in Figure 1. Note that the Flash memory is further segmented into a data Flash section, for storing non-volatile information that is frequently modified, and the instruction section, which is typically considered as Read-Only Memory (ROM), even though it uses Flash technology and can be reprogrammed many times by the user during manufacturing or via system updates. The Flash Control Unit (FCU) is a standalone specialized processor that managed Flash writes and has its own RAM and firmware memory blocks. The CPU can initiate FCU operations; this is accomplished using the peripheral bus shown at the top of Figure 1.

Image of Renesas RX600 MCU Flash memory interface

Figure 1: Flash memory interface for the Renesas RX600 MCU (Courtesy of Renesas).

An alternative architecture uses an instruction memory that is slower than the CPU clock and may require the insertion of wait states. This can significantly reduce processing performance, so often a cache memory is inserted between the CPU and the slower instruction and data memory blocks. The cache stores recent memory accesses and if the same instruction or data element is needed again, it is available without having to access the slower main memory blocks. The organization of the data and instruction cache memories for the Atmel SAM9G MCU is illustrated in Figure 2. The 16 KB memories provide fast local storage, reducing the number of times the CPU needs to access the large Flash ROM or SRAM blocks via the multi-layer AHB matrix. Note that the ability to use local cache memory also reduces bus matrix traffic so that DMA or peripheral accesses will have additional bus bandwidth available.

Image of Atmel’s SAM9G cache memory interface

Figure 2: Cache memory interface for Atmel’s SAM9G MCU.

If cache memories are efficient, an entire “inner loop” can fit within the cache and this can result in virtually zero wait-state performance for the most critical portions of the application. Note that execution timing can be more difficult to estimate in this approach since cache “misses” result in unexpected processing slow-downs. Additionally, if a small inner loop is not available, or data is organized in such a way that the “locality” that cache algorithms count on is violated, processing may become very inefficient. In general, however, cache algorithms have proven to deliver efficiency improvements due to the locality characteristics of most algorithms.

More complex cache architectures

With high-bandwidth compute intensive MCUs, like the Texas Instruments DSP-oriented TMS320DM814x video processor, the cache memory system can have additional levels of complexity. The processor-to-memory interface for the TMS320DM814x (Figure 3) has three different levels of memory hierarchy. Closest to the processor are two Level 1 (L1) cache memories, one for instructions and one for data. When required data is not in the L1 cache, a request is made to the Level 2 (L2) memory. L2 memory is multi-ported and has multiple banks to further organize data. Bandwidth management is used for each cache controller to manage the priority of memory accesses to keep data flowing smoothly to and from the processor. Up to 9 levels of priority are available and if a low- priority access is blocked for too long (more than Max_Wait cycles) it can be allowed to take priority.

This multi-level memory architecture is not uncommon when very high bandwidth is required, and the inclusion of priority levels and other high-level management functions can be critical to easing the burden of optimizing bandwidth. Try to identify MCUs that include efficient caching, intelligent bandwidth management functions and multiple memory ports to automatically optimize your memory bandwidth.

Image of Texas Instruments’ TMS320DM814x DSP memory interface architecture


Figure 3: Texas Instruments’ TMS320DM814x DSP memory interface architecture (Courtesy of Texas Instruments).

On-chip SRAM

The organization of on-chip SRAM needs to be understood in order to organize data elements in your application for the best efficiency. In many cases, the MCU organizes SRAM into separate blocks that can be accessed independently by bus masters to overlap and improve data transfer efficiency. The NXP Semiconductors LPC15xx MCU separates SRAM into three different blocks and each are available to the processor, USB or DMA masters via the multilevel AHB matrix as illustrated in the top of Figure 4. The bottom of the illustration shows the characteristics of the SRAM blocks such as size, address range, and if it can be disabled to save power for each of the LPC15xx family members.

Allocating SRAM blocks with different sizes is not uncommon and can help partition your design in the most efficient manner, either from a processing standpoint or from a power standpoint. Let’s look in more detail at how intelligently matching your algorithm requirements to SRAM block organization can improve both processing and power efficiency.

Image of NXP LPC15xx SRAM connections


Figure 4: NXP LPC15xx SRAM connections via AHB matrix to bus masters and SRAM block characteristics (Courtesy of NXP).

Improving processing efficiency

One of the most common efficiency improvements in MCU-based designs is the use of the DMA capability to offload simple data transfer functions from the CPU. If the CPU can be put into a sleep mode or process in parallel with the data transfer, overall efficiency is improved. The existence of multiple SRAM blocks can be an important element in supporting conflict-free parallel operations. Furthermore, advanced MCUs that also feature multi-level bus interfaces, like the NXP LPC15xx, can provide prioritized access to shared resources to automatically improve processing efficiency.

As an example, if an algorithm must receive data over the USB interface, process the data, store the data, and when enough data is available send the result over another interface, the location of the various data buffers can be critical to overall performance. It might be best to separate input and output buffers into different SRAM blocks so master requests from the CPU, DMA, and the USB port do not all try and access the same block at the same time. Establishing the correct priority settings for master accesses will help eliminate algorithm stalls. Making sure received data is captured with a higher priority over data processing could be critical in eliminating data reception errors and lengthy retry cycles. Understanding the data flow requirements of your algorithm is a key requirement for efficient memory block utilization.

As shown in the lower part of Figure 4 above, some of the NXP LPC15xx SRAM blocks can be enabled or disabled to reduce power. Organizing data in order to take advantage of this can be useful in hitting aggressive power targets. For example, many algorithms use data buffers to store large data during CPU calculations. Once the calculation is completed, that data need not be saved and the associated memory block can be disabled to save power. If the SRAM memory block needs some extra time to “wake-up” prior to being used, a smaller buffer in an always-enabled SRAM block can store data until the newly enabled block is ready. In some cases, detailed calculations will need to be done to determine the amount of power savings, if any, these power management techniques can generate; but having multiple SRAM blocks with power saving options will usually provide increased power efficiency.

External memory interface

Accessing off-chip memory resources can add significant latency, so finding opportunities to buffer data on-chip and pre-fetch memory from off-chip can significantly improve overall bandwidth. Matching on-chip memory buffers to the proper on-chip SRAM block is an important consideration and can be considered an extension of the techniques previously described. Often external memory interfaces combine multiple types of accesses, however. Understanding how to avoid conflicts when accessing multiple external memories can be just as important. For example, the external memory interface on the Atmel SAM9G, shown in Figure 5, supports a combined DDR, LPDDR, and SDRAM controller, a static memory controller, and a NAND Flash controller. Dual slave interfaces connect to the multi-level bus matrix so that transfers can be overlapped when initiated by different bus masters. Note that the static memory controller and the NAND controller share a common slave port. It may be less efficient to try to overlap NAND and static memory accesses than overlapping DDR2 and NAND accesses. Just as much care should be taken to allocate data in external memory blocks as with internal memory blocks to avoid affecting efficiency.

Image of Atmel’s SAM9G MCU external memory interface

Figure 5: External memory interface on Atmel’s SAM9G MCU (Courtesy of Atmel).

Many memory interface subsystems also provide cache or local memory buffers to reduce access latency. Some advanced DDR controllers can also automatically prioritize accesses and combine operations to take advantage of the block nature of DDR memory architectures. If external memory traffic is an important part of your algorithm, it will be important to examine the details of the memory controller features included on your MCU to better estimate the type of transfer efficiency you can expect.

Summary

On-chip Flash, on-chip SRAM, and external memory interfaces are critical components when implementing any MCU-based application. Understanding the key memory characteristics latency, wait states, caching algorithms and block organization described in this article will help you better design and implement your application to achieve challenging performance, power, and cost requirements.

For more information on the MCUs and the memory features discussed here, use the links provided to access product information pages on the Digi-Key website.

Electronic Products Logo

Disclaimer: The opinions, beliefs, and viewpoints expressed by the various authors and/or forum participants on this website do not necessarily reflect the opinions, beliefs, and viewpoints of Digi-Key Corporation or official policies of Digi-Key Corporation.