Evolutionary design of the memory subsystem☆
Introduction
Memory hierarchy has a significant impact on performance and energy consumption in the system. This impact is estimated about 50% of the total energy consumption in the chip [1]. This places the memory subsystem as one of the most important sources to improve both performance and energy consumption. Concerns such as thermal issues or high energy consumption can cause a significant performance degradation, as well as irreversible damages to the devices therefore increasing the energy cost. Previous works have shown that saving energy in the memory subsystem can effectively control transistors aging effect and can significantly extend lifetime of the internal structures [2].
Technological changes combined with the development of communications have led to the great expansion of mobile devices such as smartphones, tablets, etc. Mobile devices have evolved rapidly to adapt to the new requirements, giving support to multimedia applications. These devices are supplied with embedded systems, which are mainly battery-powered and usually have less computing resources than desktop systems.
Additionally, multimedia applications are usually memory intensive, so they have high performance requirements which implies a high energy consumption. These features increase the pressure on the whole memory subsystem.
Processor registers, smaller in size, work at the same speed than the processor and consume less energy compared with other levels of the memory subsystem. However, the energy consumption and access time rise when the file size increases due to a higher number of registers and ports.
Regarding the cache memory, it has been identified as a cold area in the chip, although the peripheral circuits and the size of the cache are the most influencing factors to cause a temperature increase [3], facing the accesses to the cache memory because of specific applications. However, cache memory affects both performance and energy consumption. In fact, energy consumption of the on chip cache memory is considered to be responsible of 20–30% of the total consumption in the chip [4]. A suitable cache configuration will improve both metrics.
In terms of performance, the main memory is the slowest component compared with the cache memory and processor registers. Running programs request the allocation and deallocation of memory blocks, and the dynamic memory manager (DMM) is in charge of this task. Current multimedia applications have highly dynamic memory requirements, so optimizing the memory allocator is a crucial task. Solving a memory allocation request is a complex task and the allocation algorithm must minimize internal and external fragmentation problems. Therefore, efficient tools must be provided to DMM designers for evaluating the cost and the efficiency of DMMs, facilitating the design of customized DMMs.
In this paper we present a methodology based on evolutionary algorithms (EA), which is divided into three layers tackling different components of the memory hierarchy and performing the optimization process of each layer according to the running applications. Then, the first layer is the registers file, the second is the cache memory and the last one is the DMM, which works on the main memory. Fig. 1 shows the three optimization layers surrounded with different dashed lines, and the tools involved within each optimization process, which will be deeply explained in the rest of the paper.
In a previous work [5], we presented an approach based on grammatical evolution (GE) with a wide design space, where the complete set of parameters defined is considered and a specific cache memory configuration was chosen as a baseline. The GE approach had good results, in the absence of other results to be compared with. The problem is clearly multi-objective and thus the GE approach considered a weighted objective function. Hence, the optimization problem was later addressed through a multi-objective approach with NSGA-II [6]. On the one hand, this approach was customized with a fixed cache size for both the instructions and data cache. On the other hand, a different cache memory configuration was used as the baseline. Thus, GE and NSGA-II approaches use a different set of parameters. As a consequence, results could not be directly compared in order to take a decision.
In this paper we provide several new contributions regarding the cache design. Firstly, we perform the experiments using the NSGA-II algorithm in the same conditions of the GE proposal, both the design space and the baseline. This configuration allows a direct comparison among both algorithms. Additionally, two baseline caches, included in general purpose devices, are added to the analysis because the first one belonged to a specific purpose device. Finally, we have added a statistical test to verify the relevance of the results. Therefore, this work completes the set of tests previously made and provide us enough information to decide the algorithm to be applied in the cache design optimization.
In addition to the cache design, we propose in this paper to apply evolutionary techniques to the register file configuration and the DMM which, considered in conjunction with the cache, comprise the whole memory subsystem in a computer. For both the register file and the DMM we propose the algorithms, perform the experiments and analyze the results on both objectives of our fitness function: execution time and energy consumption. Besides, we have incorporated statistical tests to verify the relevance or our results in both the register file and the DMM optimizations. Up to our knowledge, a complete 3-layer approach as the one we propose has not been reported previously in the literature.
We have also focused our experiments on the ARM architecture, which is present in many of the current embedded multimedia systems. Selected applications have been adapted in order to better fit to each one of the memory layers that we optimize. As we will show later in this work, the cache memory policies and the DMMs are most sensitive to improvement.
All the algorithms are coded in Java using the JECO library [7]. Besides, the experimentation has been conducted in a computer provided with an Intel i5 660 processor running at 3.3 GHz, with 8 GB of RAM and using the Ubuntu Desktop 14.04 operating system.
The rest of the paper is organized as follows. Next section summarizes the related work on the topic. Section 3 describes the thermal, performance and energy models applied. Section 4 addresses the thermal impact on the processor registers. Section 5 presents the optimization process aimed to automatically design cache configurations in order to improve performance and reduce energy consumption. Section 6 describes the optimization process to automatically evaluate and design customized DMMs, which will improve performance and reduce the memory fragmentation problem. In Section 7, we present our conclusions and describe the future work.
Section snippets
Related work
Many works can be found in the literature regarding the memory optimization. Next, we will review the closest literature to our work, separating the papers into the three memory layers we have studied.
Concerns about thermal problems, performance degradation and high energy consumption are neither new nor insignificant in the memory subsystem. The register file is identified as a component that consumes high energy, between 15% and 36% in embedded processors [8]. Multimedia applications increase
Thermal, energy and performance model
The proposed framework is based on the simulation of performance and energy consumption models. These models are used as the input for the optimization algorithms, which find an optimized design. In order to address these works, we have to apply thermal, energy and performance models, which are next described.
Register file optimization
The first layer of our methodology is the register file optimization. We present a methodology that takes into account the temperature increase due to the accesses that happen while a multimedia application is running. Then it evaluates the thermal impact of different spatial distributions of the logical registers. It applies a multi-objective evolutionary algorithm (MOEA) to obtain the optimized solutions, finally proposing the spatial distributions which better reduce the thermal impact. This
Cache memory optimization
The second layer of our methodology is the cache memory, as previously shown in Fig. 1. We propose an optimization approach which is able to determine cache configurations for multimedia embedded systems and require less execution time and energy consumption.
As seen in Fig. 5, this layer is divided into two off-line phases (labeled as 1 and 2) and a third phase devoted to optimization (labeled as 3). Firstly, the off-line phases are executed just once before the optimization. Next, the
Dynamic memory management optimization
The third layer of our methodology consists on an optimization framework based on GE and static profiling of applications to improve the dynamic memory manager (DMM) for multimedia applications, which have high dependence of dynamic memory. This is a non-intrusive method that allows to automatically evaluate complex implementations of DMMs.
In order to evaluate our proposal, we have selected six memory intensive applications: hmmer, dealII, soplex, calculix, gcc and perl. In addition, we have
Conclusions and future work
We have presented a method to optimize the memory subsystem of a computer addressing three different levels: register file, cache memory and dynamic memory management in the main memory. In all these levels we propose an evolutionary algorithm as the optimization engine, which is helped by other applications, either in a closed loop, either in off-line phases.
The optimization of the register file is based on a first step where a static profiling of the target applications is performed. Then, a
References (41)
- et al.
Simulation of high-performance memory allocators
Microprocess. Microsyst.
(2011) - et al.
Runtime data center temperature prediction using grammatical evolution techniques
Appl. Soft Comput.
(2016) Reducing energy consumption of multiprocessor SoC architectures by exploiting memory bank locality
ACM Trans. Des. Autom. Electron. Syst.
(2006)- et al.
Cache aging reduction with improved performance using dynamically re-sizable cache
Proceedings of the Conference on Design, Automation & Test in Europe
(2014) - et al.
Analysis of SRAM and eDRAM cache memories under spatial temperature variations
IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.
(2010) - et al.
Instruction-level power dissipation in the Intel XScale embedded microprocessor
SPIEs 17th Annual Symposium on Electronic Imaging Science & Technology
(2005) - et al.
Optimizing L1 cache for embedded systems through grammatical evolution
Soft Comput.
(2015) - et al.
Multi-objective optimization of energy consumption and execution time in a single level cache memory for embedded systems
J. Syst. Softw.
(2016) JECO (Java Evolutionary COmputation) Library
(2017)- et al.
Application-guided power gating reducing register file static power
IEEE Trans. Very Large Scale Integr. Syst.
(2014)
On reducing register pressure and energy in multiple-banked register files
21st International Conference on Computer Design, 2003. Proceedings
Compiler-driven leakage energy reduction in banked register files
Temperature-aware register reallocation for register file power-density minimization
ACM Trans. Des. Autom. Electron. Syst.
Thermal-aware compilation for system-on-chip processing architectures
Proceedings of the 20th Symposium on Great Lakes Symposium on VLSI, GLSVLSI’10
Futility scaling: high-associativity cache partitioning
Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-47
Energy-efficient phase-based cache tuning for multimedia applications in embedded systems
2014 IEEE 11th Consumer Communications and Networking Conference (CCNC)
Dynamic cache reconfiguration for soft real-time systems
ACM Trans. Embed. Comput. Syst.
A survey on cache tuning from a power/energy perspective
ACM Comput. Surv.
Technical Reference Manual
Dynamic access distance driven cache replacement
ACM Trans. Archit. Code Optim.
Cited by (0)
- ☆
This research has been partially supported by the Ministerio de Economía y Competitividad of Spain (Grant Refs. TIN2015-65460-C2, TIN2014-54806-R and TIN2017-85727-C4-4-P) and also by the European Regional Development Fund FEDER (Grants Refs EphemeCH TIN2014-56494-C4-1,2,3-P and IB16035) and Junta de Extremadura FEDER, project GR15068.