2012

Adaptive Energy-Efficient Task Partitioning for Heterogeneous Multi-Core Multiprocessor Real-Time Systems

Shivashis Saha
University of Nebraska-Lincoln, ssaha@cse.unl.edu

Jitender S. Deogun
University of Nebraska-Lincoln, jdeogun1@unl.edu

Ying Lu
University of Nebraska-Lincoln, ying@unl.edu

Follow this and additional works at: http://digitalcommons.unl.edu/cseconfwork

Part of the Computer Sciences Commons

http://digitalcommons.unl.edu/cseconfwork/198

This Article is brought to you for free and open access by the Computer Science and Engineering, Department of at DigitalCommons@University of Nebraska - Lincoln. It has been accepted for inclusion in CSE Conference and Workshop Papers by an authorized administrator of DigitalCommons@University of Nebraska - Lincoln.
Adaptive Energy-Efficient Task Partitioning for Heterogeneous Multi-Core Multiprocessor Real-Time Systems

Shivashis Saha, Jitender S. Deogun, and Ying Lu
Department of Computer Science and Engineering,
University of Nebraska-Lincoln, Lincoln, NE 68588-0115, U.S.A.
Email: {ssaha,deogun,ylu}@cse.unl.edu

Abstract—The designs of heterogeneous multi-core multiprocessor real-time systems are evolving for higher energy efficiency at the cost of increased heat density. This adversely effects the reliability and performance of the real-time systems. Moreover, the partitioning of periodic real-time tasks based on their worst case execution time can lead to significant energy wastage.

In this paper, we investigate adaptive energy-efficient task partitioning for heterogeneous multi-core multiprocessor real-time systems. We use a power model which incorporates the impact of temperature and voltage of a processor on its static power consumption. Two different thermal models are used to estimate the peak temperature of a processor. We develop two feedback-based optimization and control approaches for adaptively partitioning real-time tasks according to their actual utilizations. Simulation results show that the proposed approaches are effective in minimizing the energy consumption and reducing the number of task migrations.

Keywords—Adaptive Task Partitioning; Thermal-Constrained Task Partitioning; Energy Minimization; Heterogeneous Multi-Core Multiprocessor Real-Time Systems

I. INTRODUCTION

Energy-efficient designs of computer systems have received significant interest in recent years due to an increased need for energy conservation [1]. Heterogeneous multiprocessor real-time systems are known to be energy efficient and have better performance as compared to homogeneous systems [2]. The energy efficiency of recent multiprocessor systems is achieved by increasing the power density. This in turn results in high heat density which can significantly impact the reliability and performance of heterogeneous real-time systems [3].

The power consumption of a processor is divided into static and dynamic power consumption. The power consumed by the processor to maintain its activeness is called the static power consumption [1]. Similarly, the power needed by the processor while executing a task is called the dynamic power consumption. The static power is generated by the leakage current while the dynamic power is a function of the speed of the processor [4]. This function is known to be a strictly convex and monotonically increasing function and is represented by a polynomial of at least second degree [5]. This convex relationship is exploited by the Dynamic Voltage Scaling (DVS) techniques for minimizing the total energy consumption of a processor [6]. There has been a significant research in energy-aware scheduling for homogeneous multiprocessor systems which have negligible static power consumption [7]. It has been shown that static power consumption is significant and is comparable to dynamic power consumption [8]. Leakage aware scheduling strategies for heterogeneous systems have been recently investigated [1], [9]. There has also been a recent interest in temperature-aware multiprocessor scheduling strategies for minimizing the temperature of a processor and thus improving its reliability [10], [11], [12]. In an ongoing project, we have been investigating thermal-constrained energy-efficient partitioning for heterogeneous multi-core multiprocessor real-time systems. Worst case execution time (WCET) of tasks are generally known to be pessimistic estimates. Thus, WCET based task partitioning may result in significant energy wastage as the processors may be unnecessarily running at high speeds due to the pessimistic estimation. Thus, there is a need for an adaptive approach for partitioning tasks in order to minimize the energy consumption in real-time systems.

In this paper, we investigate adaptive thermal-constrained energy-efficient partitioning of periodic tasks in heterogeneous multi-core multiprocessor real-time systems. We consider a system which is heterogeneous across multiprocessors, but homogeneous within a multiprocessor. Thus, our objective is to find an optimal set of active cores and partitioning of the tasks based on their actual utilization that results in minimum energy consumption while satisfying processor constraints and meeting task deadlines. We use a power model that considers the impact of temperature [12] and voltage [11] of a processor on its leakage power consumption. Two thermal models, heat-independent thermal (HIT) model and heat-dependent thermal (HDT) model, are used for estimating the peak temperature of a processor. We consider negligible heat transfer among the cores in HIT model [11], and non-negligible heat transfer in HDT model [10], [12]. We present Distributed Utilization Control (DUC) and Greedy Utilization Control (GUC) heuristics for adaptive task partitioning which are feedback-based optimization and control approaches. Simulation results show the effectiveness of the proposed heuristics in minimizing energy consumption and reducing the number of task migrations.

The rest of the paper is organized as follows: Related work is given in Section II. The system models and the problem statement are given in Section III. The adaptive task partitioning algorithms are presented in Section IV. In Section V, we present and discuss the simulation results. Finally, conclusion and future works are given in Section VI.
II. RELATED WORK

Real-time scheduling of periodic tasks in multiprocessor environment is well investigated [13]. The existing techniques can be broadly classified into global and partitioning-based scheduling techniques. In global scheduling techniques, the global scheduler selects tasks for execution primarily from a queue of tasks [14]. However, in partitioning-based scheduling techniques the processors are independently scheduled and each task is assigned to a single processor for execution [15]. Partitioning-based scheduling techniques are more widely used than global scheduling techniques due to their simplicity in design and implementation [16]. Thus, our heuristics are partitioning-based scheduling techniques aimed to minimize energy consumption while satisfying the thermal constraints.

There has been a significant amount of research in energy-aware scheduling strategies for homogeneous multiprocessor systems [7], [14]. However, energy-aware scheduling strategies for heterogeneous multiprocessor systems is still in its early stages [1]. Most of these existing work is based on power models with negligible static power consumption [17]. However, a power model with non-negligible static power consumption was recently investigated [1]. Recent research has shown that leakage current of a processor changes super linearly with its temperature [18]. Thus, thermal aware scheduling strategies for homogeneous multiprocessor systems are still in its early stages [7], [14]. However, energy-aware scheduling strategies for heterogeneous multiprocessor systems is still in its early stages [1].

There are needs for real-time systems to adapt to environments with changing and non-stationary data conditions, which assume dynamic leakage current [10], [11], [12], [19], [20]. Moreover, it has also been shown recently that heat transfer between different cores in a multi-core system has been used for developing adaptive centralized scheduling techniques in real-time systems [20]. Probabilistic distributions of execution time of tasks have also been used for developing adaptive scheduling strategies [21].

III. SYSTEM MODELS AND PROBLEM DEFINITION

In this section, we describe our models and the problem.

A. Multiprocessor Model

Let \( \Omega = \{ M_1, M_2, \ldots, M_m \} \) denotes a set of interconnected heterogeneous multiprocessor units. Each multiprocessor unit has \( k \) identical cores (or processors), i.e. \( M_i = \{ M_{i1}, M_{i2}, \ldots, M_{ik} \} \) (\( i = 1, \ldots, m \)). A core \( M_{ij} \) (\( j = 1, \ldots, k \)) supports dynamic voltage scaling (DVS) and varies its voltage/speed/frequency \( f_{ij} \) to one of the discrete levels in the range \([ f_{ij}^{\min}, f_{ij}^{\max} ]\), where \( f_{ij}^{\max} (f_{ij}^{\min}) \) is the maximum (minimum) operating frequency of the multiprocessor unit \( M_i \). Thus, our system is heterogeneous across multiprocessors, but homogeneous within a multiprocessor. For simplicity, the frequency of a core is normalized with respect to (wrt) \( f_{\text{core}}^{\max} \), i.e., we assume \( f_{ij}^{\max} = 1 \). The throughput (or capacity) of a core is proportional to its operating frequency [22]. The capacity of a core \( M_{ij} \) is denoted by \( \mu_{ij} \) and is equal to \( \alpha_i f_{ij} \), where \( \alpha_i \) is the performance coefficient of \( M_i \). In a heterogeneous multiprocessor system, higher values of \( \alpha_i \) correspond to more powerful multiprocessor units.

B. Task Model

Let \( \Gamma = \{ \tau_1, \tau_2, \ldots, \tau_n \} \) denotes a set of independent periodic real-time tasks. A periodic task \( \tau_i \) (\( i = 1, \ldots, n \)) is an infinite number of task instances (jobs) released with periodicity \( P_i \) [23]. Thus, the relative deadline of a current instance (job) of \( \tau_i \) is represented by its period \( P_i \). \( W_i \) denotes the worst-case execution time of \( \tau_i \) on a core of a standard multiprocessor \( \varphi \) with performance coefficient \( \alpha_{\varphi} = 1 \) and it is equal to \( \frac{W_i}{\varphi} \), if the core \( \varphi \) is running at a constant frequency \( f_\varphi \). The worst-case utilization of \( \tau_i \) under the maximum frequency of a standard core is denoted by \( u_i \) and it is equal to \( \frac{W_i}{P_i} \). Thus, the total worst-case utilization of task set \( \Gamma \) under the maximum frequency of a standard core is \( U_{\text{tot}} = \sum_{i=1}^{n} u_i = \sum_{i=1}^{n} u_i \), where \( r = 1, \ldots, n \). As each task is allocated to exactly one core, \( U_{\text{tot}} \) is the hyper-period of task set \( \Gamma \), i.e. the minimum positive number such that the jobs are released every \( P_i \) time units. If \( P_1, \ldots, P_n \) are integers, then \( P \) is the least common multiple (LCM) of all task periods.

In addition to the task period \( P_i \) and worst-case execution time \( W_i \) of a task \( \tau_i \) (\( i = 1, \ldots, n \)), let the actual execution time of the \( q^{\text{th}} \) job of \( \tau_i \) be \( c_{ij,q} \). Thus, under a constant frequency \( f_\varphi \), the actual execution time of the \( q^{\text{th}} \) job of \( \tau_i \) is \( c_{ij,q} \). Therefore, the actual utilization of \( \tau_i \) under maximum frequency \( (f_{\text{max}} = 1) \) on a standard core \( \varphi \) is \( \bar{u}_i = c_{ij,q} u_i \). A partitioning of tasks into cores, the actual utilization of \( M_{ij} \), \( (j=1, \ldots, m; r=1, \ldots, k) \) under its maximum frequency is \( \bar{U}_{ij} = \sum_{r=1}^{k} u_i \). Thus, we aim to minimize energy consumption in hyper-period \( P \) while satisfying all the constraints.

C. Power Model

The power consumed by a core \( M_{ij} \) is given by \( \Phi_{ij} \) (\( i=1, \ldots, m; j=1, \ldots, k \)). It is composed of two parts: \( \Phi_{ij} \) and \( \Phi_{ij}^d \) (Eq. 1a). \( \Phi_{ij} \) is the static (or leakage) power consumption of a core which is generated by the leakage current for maintaining the activeness of the core [1], [8]. \( \Phi_{ij}^d \) is the dynamic power consumption of a core required for executing a task [4]. Eq. 1b gives the static power consumption of a core which is dependent on both frequency [10], [12] (proportional to the supply voltage [24]) and temperature [11], [24] of the core, \( \gamma_i \), \( \delta_i \), and \( \chi_i \) are non-negative constants dependent on the architecture of \( M_i \). The dynamic power consumption of a core is a function of its frequency. Using current DVS technologies, \( g(f_{ij}) \) is assumed to be a strictly convex and monotonically increasing function which is represented by
polynomial of at least second degree. We assume $\Phi^d_{ij}$ is a cubic polynomial in frequency (i.e. $f_{ij}^3$) [10], [11], [12] and is given by Eq. 1c.

\[
\Phi^d_{ij}(f_{ij}) = \Phi^d_{ij}(f_{ij}) + \Phi^d_{ij}(f_{ij})
\]

\[
\Phi^d_{ij}(f_{ij}) = \gamma_i f_{ij} + \delta_i f_{ij} T_{ij}
\]

\[
\Phi^d_{ij}(f_{ij}) = \chi_i f_{ij}^3
\]

(1a)
(1b)
(1c)

D. Temperature Model

In this section, we present two different thermal models, heat-independent thermal model [11] and heat-dependent thermal model [10], [12].

1) Heat-Independent Thermal (HIT) Model: In this model we assume there is negligible or no heat transfer among cores of a multiprocessor unit or among different multiprocessors [11], [24]. Using the RC thermal model [1], [10], [11], [12], [24], the temperature of a core $M_{ij}$ ($i = 1, \ldots, m; j = 1, \ldots, k$) at time $t$ is denoted by $T_{ij}(t)$ and is given by Eq. 2, where $T_{amb}$ is the ambient temperature (in °C), $R_i$ is the thermal resistance of $M_i$ (in J/°C), $C_i$ is the thermal capacitance of $M_i$ (in Watt/°C), $\Phi_{ij}(t)$ is the power consumption of $M_{ij}$ at time (in Watt), and $\frac{dT_{ij}}{dt}$ is the time derivative of the temperature of $M_{ij}$ at time $t$.

\[
R_i C_i \frac{dT_{ij}(t)}{dt} + T_{ij}(t) - R_i \Phi_{ij}(t) = T_{amb}
\]

(2)

If initial temperature of $M_{ij}$ at time $t_0$ is $T_{amb}^0$ and $M_{ij}$ is running at a constant frequency $f_{ij}$, then the final temperature of $M_{ij}$ at time $t$ is denoted by $T_{ij}^t$ and is given by Eq. 3.

\[
\int_{t_0}^{t} \frac{T_{amb}^0 + \gamma_i R_i f_{ij} + \delta_i R_i f_{ij} T_{ij} + \chi_i R_i f_{ij}^3 - T_{ij}}{R_i C_i} dt = \int_{t_0}^{t} C_i \frac{dT_{ij}(t)}{dt}
\]

(3a)

\[
T_{ij}^t = \frac{T_{amb}^0 + \gamma_i R_i f_{ij} + \delta_i R_i f_{ij} T_{ij} + \chi_i R_i f_{ij}^3 - T_{ij}}{R_i C_i}
\]

(3b)

If $M_{ij}$ is running at a constant frequency, then its temperature is a non-decreasing function (Eq. 3) [10], [11], [12], [24]. The temperature of $M_{ij}$ becomes steady when the system reaches steady state condition. The peak temperature of $M_{ij}$ running at constant frequency $f_{ij}$ is denoted by $T^*_{ij}$ (\(\frac{dT_{ij}(t)}{dt}|_{T_{ij}(t)=T^*_{ij}} > 0\)) and is given by Eq. 4.

\[
T^*_{ij} = \frac{T_{amb}^0 + \gamma_i R_i f_{ij} + \delta_i R_i f_{ij} T_{ij} + \chi_i R_i f_{ij}^3}{1 - \delta_i R_i f_{ij}}
\]

(4)

2) Heat-Dependent Thermal (HDT) Model: In this model we assume non-negligible amount of heat transfer among the cores of a multiprocessor unit and negligible or no heat transfer among different multiprocessor units [10], [12]. We also assume that there is a set of heat sinks $\Xi = \{\Xi_{1}, \Xi_{2}, \ldots, \Xi_{n}\}$ for each multiprocessor unit $M_i$ ($i = 1, \ldots, m$). These heat sinks are only used for heat dissipation and are placed on top of the cores. These heat sinks do not generate any power. Fourier’s laws can be used to model the dynamic heat transfer between the core and heat sinks of a multiprocessor unit where each core acts as a discrete thermal element [10], [12]. Using RC thermal model [10], [12], let the thermal conductance between two cores $M_{ij}$ and $M_{ij'}$ is $\omega_{ij}^{ij'}$ ($\forall q \in 1, \ldots, k$), $j' < k$, $\zeta_{ij}'$ denotes the vertical thermal conductance between core $M_{ij}$ and sink $\Xi_{iq}$ ($q=1, \ldots, h$). The horizontal thermal conductance between the heat sinks $\Xi_{iq}$ and $\Xi_{iq'}$ ($\forall q, q' \in 1, \ldots, h$) is denoted by $\omega_{ij}^{ij'}$. We assume that $\omega_{ij}^{ij'} = \omega_{ij}^{ij'}$, $\omega_{ij}^{ij'} = 0$ and $\omega_{ij}^{ij'} = 0$. $\omega_{ij}^{ij'}$ denotes the thermal conductance between the heat sink and the environment. The thermal capacitance of $M_i$ and $\Xi_i$ are denoted by $C_i$ and $C^\Xi_{iq}$ respectively. Eq. 5 gives the temperature of $M_{ij}$ at time which is denoted by $T_{ij}(t)$, where $\frac{dT_{ij}(t)}{dt}$ and $\Phi_{ij}(t)$ are respectively the derivative of temperature of $M_{ij}$ and power consumption of $M_{ij}$ at time $t$. $T_{ij}(t)$ denotes temperature of $T_{ij}(t)$ wrt time and is given by Eq. 6, where $\frac{dT_{ij}(t)}{dt}$ and $\Phi_{ij}(t)$ are respectively the derivative of temperature of $M_{ij}$ wrt time and ambient temperature of the system.

\[
C_i \frac{dT_{ij}(t)}{dt} = \Phi_{ij}(t) - \sum_{j'=1}^{k} \omega_{ij}^{ij'} (T_{ij}(t) - T_{ij'}(t)) - \sum_{q=1}^{h} \zeta_{ij}' (T_{ij}(t) - T_{iq}(t))
\]

(5)

\[
C_{iq} \frac{dT_{ij}(t)}{dt} = -\omega_{ij}^{ij'} (T_{ij}(t) - T_{ij'}(t)) - \sum_{q'=1}^{h} \omega_{iq'}^{iq} (T_{ij}(t) - T_{iq'}(t))
\]

(6)

As long as $M_{ij}$ is running at a constant frequency, its temperature is a non-decreasing function [10], [11], [12], [24]. When the system reaches a steady state condition, the temperature of $M_{ij}$ becomes steady, $T^*_{ij}$ denotes the peak (or maximum) temperature of $M_{ij}$ running at constant frequency $f_{ij}$ (\(\frac{dT_{ij}(t)}{dt}|_{T_{ij}(t)=T^*_{ij}} > 0\)) and $T_{ij}(t)$ denotes the peak temperature of the $\Xi_{iq}$ (\(\frac{dT_{ij}(t)}{dt}|_{T_{ij}(t)=T^*_{ij}} > 0\)). Eq. 7 approximates the values of $T^*_{ij}$ and $T_{ij}^{sink}$.

\[
0 = (\gamma_i f_{ij} + \chi_i f_{ij}^3) + T^*_{ij} \left(\delta_i f_{ij} - \sum_{j'=1}^{k} \omega_{ij}^{ij'} - \sum_{q=1}^{h} \zeta_{ij}' \right) + \sum_{j'=1}^{k} \omega_{ij}^{ij'} T^*_{ij} + \sum_{q=1}^{h} \zeta_{ij}' T_{iq}
\]

\[
0 = \omega_{ij}^{ij'} T_{ij}^{} + T_{ij}^{sink} \left(\omega_{ij}^{ij'} - \sum_{j'=1}^{k} \omega_{ij}^{ij'} - \sum_{q=1}^{h} \omega_{iq}' \right) + \sum_{j'=1}^{k} \omega_{ij}^{ij'} T_{ij}^{} + \sum_{q=1}^{h} \omega_{iq}' T_{iq}^{sink}
\]

(7a)
(7b)
We get Eq. 8 from simplifying Eq. 7 [12], where

\[ A_{j,j} = \delta_i f_{i,j} - \sum_{j'=1}^{k} \omega_{j',j} - \sum_{q=1}^{h} \xi_{j,q} \]

\[ A_{j,j'} = \omega_{j,j'} \]

\[ A_{j,k+q} = A_{k+q,j} = \epsilon_{j,q} \]

\[ A_{k+q,k+q} = -\omega_{amb} - \sum_{j=1}^{k} \sum_{q'=1}^{h} \omega_{j,q'} \]

\[ A_{k+q,k+q'} = \omega_{q,q'} \]

\[
\begin{pmatrix}
A_{1,1} & \ldots & A_{1,k+h} \\
A_{2,1} & \ldots & A_{2,k+h} \\
A_{k+1,1} & \ldots & A_{k+1,k+h} \\
A_{k+h,1} & \ldots & A_{k+h,k+h}
\end{pmatrix}
\begin{pmatrix}
T_{s}^{1} \\
T_{s}^{k} \\
T_{s+k+1}^{k} \\
\vdots \\
T_{s+k+h}^{k}
\end{pmatrix}

= 
\begin{pmatrix}
\gamma f_{1,i} + \chi f_{1,i} \\
\gamma f_{k+1,i} + \chi f_{k+1,i} \\
\gamma f_{k+1,k} + \chi f_{k+1,k} \\
\gamma f_{1,1} + \chi f_{1,1} \\
\gamma f_{1,k} + \chi f_{1,k} \\
\gamma f_{1,k+1} + \chi f_{1,k+1} \\
\gamma f_{1,k+h} + \chi f_{1,k+h} \\
\gamma f_{1,1} + \chi f_{1,1} \\
\gamma f_{1,1} + \chi f_{1,1}
\end{pmatrix}
\]

Therefore, \[ |T|_{k+h,1} = - [A]^{-1} A_{1,k+h,k+h} \times |\lambda|_{k+h,1} \] (8b)

\[ \text{E. Problem Definition} \]

**Proposition 1:** If the frequency of \( M_{i,j} \) is at least \( \hat{U}_{i,j} \), then any periodic hard real-time scheduling policy which can fully utilize the core (e.g., Earliest Deadline First (EDF), Least Laxity First) can be used to obtain a feasible schedule [25].

According to Proposition 1, we assume that the frequency of \( M_{i,j} \) is tightly related to the actual utilization \( \hat{U}_{i,j} \) instead of the worst-case utilization \( U_{i,j} \). The actual frequency of \( M_{i,j} \) denoted by \( \hat{f}_{i,j} \) is thus the lowest discrete frequency that is greater than or equal to \( \hat{U}_{i,j} \). The energy consumption of \( M_{i,j} \) during the time interval \( [0, P] \) is thus estimated by \( P \times \Phi(\hat{f}_{i,j}) \). We also need to make sure that by running \( M_{i,j} \) at frequency \( \hat{f}_{i,j} \) the peak temperature of \( M_{i,j} \) denoted by \( T_{s}^{i,j} \) is no more than \( T_{s}^{\text{max}} \), which is the maximum feasible operating temperature of \( M_{i,j} \).

1) **Problem Statement:**

Given a set of periodic real-time tasks \( (\Gamma) \) and a set of interconnected multi-core multiprocessor units \( (\Omega) \), the problem is to identify a subset of cores to be activated \( (\Psi) \) and partition tasks to the active cores such that the overall energy consumption is minimized, peak temperature constraints are satisfied, and task deadlines are not violated. This problem is known to be \( NP \)-Hard in the strong sense [4], [26]. Thus, the problem is formally defined as follows:

**Minimize:** \[ P \sum_{M_{i,j} \in \Psi} \Phi_{i,j}(\hat{f}_{i,j}) \] \[ \text{Subject to:} \]

\[ 0 \leq \hat{U}_{i,j} \leq \beta_{i,j} \] \[ \sum_{M_{i,j} \in \Psi} \alpha_{i} \hat{U}_{i,j} = \gamma U_{\text{tot}} \] \[ \hat{f}_{i,j} : \text{lowest discrete frequency where} \] \[ \hat{f}_{i,j} \geq \hat{U}_{i,j} \] \[ T_{s}^{i,j} \leq T_{s}^{\text{max}} \]

\[ \text{where,} \beta_{i,j} \text{is the schedulable utilization bound (e.g. 1.0 if} \ M_{i,j} \text{is under EDF and} \gamma \text{is the ratio between actual and worst-case total utilization of the tasks.} \]

**IV. ADAPTIVE PARTITIONING ALGORITHMS**

In this section, we present adaptive partitioning algorithms which are based on feedback control and real-time scheduling theories and use actual utilization of tasks to minimize the overall energy consumption.

During the initial partitioning of tasks, the values of the actual utilization of tasks are hard to estimate. Thus, for the initial partitioning, it is assumed that actual and worst-case utilization of tasks are same, i.e., \( \hat{u}_{i} = \frac{U_{tot}}{P} (i = 1 \ldots n) \) and \( \gamma = 1 \). This information is used to compute the initial partitioning using the Min-core Worst-fit (MW) heuristic which is described below. Thus, we achieve the initial partitioning of tasks and also identify the set of active cores \( (\Psi) \) needed to execute all the tasks without violating any constraints.

In MW heuristic, we use worst-case execution time of tasks to partition the tasks into cores according to the worst-fit scheduling strategy. Therefore, the core with maximum available capacity is selected to execute the next task in the queue [4], [9]. But, worst-fit scheduling just solves the partitioning part of the problem. It does not find the least number of active cores \( (\Psi) \) needed to execute all the tasks. In order to find \( \Psi \), the cores are sorted in decreasing order of their respective maximum capacity and the number of active cores is sequentially decreased when using worst-fit scheduling. Each partitioning scheme ensures that peak temperature constraints of cores and deadlines of tasks are not violated. Finally, the partitioning scheme resulting in the least energy consumption is identified as the initial partitioning strategy for the tasks. Thus, MW heuristic also identifies the set of active cores \( (\Psi) \) needed to execute all the tasks.

The initial partitioning of tasks and identification of active cores are based on the WCET of tasks which is generally known to be a pessimistic estimate and can be significantly greater than the actual execution time. Thus, we use a dynamic feedback-based approach to adapt the initial partitioning using the actual utilization of tasks. We use a feedback predictor which monitors the current jobs and estimates the actual utilization of future jobs [20]. The information of \( \hat{u}_{i} \) is periodically updated and used to recalculate the partitioning of tasks for minimizing the energy consumption. However, unlike the initial partitioning, this new solution is not going to be implemented exactly since this dramatic change of task partitioning may require a large number of tasks migrations. Therefore, the new solution of the task partitioning is only used to determine the total utilization target for a core’s tasks, denoted by \( B_{j,r} \) (\( j = 1 \ldots m; r = 1 \ldots k \)), i.e., the target value of \( \hat{U}_{j,r} \). We set \( B_{j,r} = \sum_{\tau \in \Gamma_{j,r}} \hat{u}_{i} \), where \( \Gamma_{j,r} \) denotes the tasks allocated to \( M_{j,r} \) when the new partitioning scheme was to be followed. Since only a subset of active cores \( (\Psi) \) are needed to be activated, we have \( B_{j,r} = \{0|\forall M_{j,r} \notin \Psi \} \). Thus, we develop utilization control methods to adapt the initial task partitioning and achieve the target utilization of the cores.
We use an approach similar to the controlled task migration approach [13] for changing the utilization of cores. Each job of a task is executed only on a single core while different jobs of the task can execute on different cores. Thus, the runtime context of a job is maintained in only one core but the task-level context may be migrated [13]. The task migration decisions of a core is done by the utilization controller of the core. The control period $t_s$ is selected such that multiple jobs of a task can be released during the period. Therefore, $\bar{U}_{j,r}(s+1)$ is close to $B_{j,r}$ during the $s$th sampling point (time $st_s$). Below we present two different task migration heuristics for accomplishing our objective.

**Distributed Utilization Control (DUC) Heuristic:** In DUC heuristic, a task is migrated to an under-utilized core ($\bar{U}_{j,r} < B_{j,r}$) from an over-utilized core based on the migration probability. Let, $P_{j,r,j^{'},r^{'}}$ denotes the probability of core $M_{j,r}$ migrating a task to core $M_{j^{'},r^{'}}$ ($j' = 1 \ldots m; r'= 1 \ldots k$) and $\delta_{j,r}(s)$ denotes the total actual utilization of tasks migrated out of $M_{j,r}$ during the $(s+1)^{th}$ control period. Therefore, the load dynamics of a core is given by Eq. 11.

$$\bar{U}_{j,r}(s+1) = \bar{U}_{j,r}(s) - \delta_{j,r}(s) + \sum_{j'=1}^{m} \sum_{r'=1}^{k} \frac{P_{j,r,j^{'},r^{'}}}{\alpha_{j,r}} \delta_{j^{'},r^{'}}(s)$$

The current utilization of all under-utilized cores is broadcasted to all active cores. Based on this information, a core $M_{j,r}$ computes the values of $P_{j,r,j^{'},r^{'}}$ using Eq. 12, where $\Theta$ denotes the set of under-utilized cores.

$$P_{j,r,j^{'},r^{'}} = \begin{cases} 0 & \text{If } M_{j^{'},r^{'}} \notin \Theta \\ \frac{B_{j^{'},r^{'}} - \bar{U}_{j^{'},r^{'}}(s)}{\sum_{M_{a,b} \in \Theta(B_{a,b} - U_{a,b}(s))} \alpha_{a,b}} & \text{Otherwise} \end{cases}$$

The control rules are given in Eq. 13. The value of positive control gain $K_{j,r}$ is selected to ensure overall system stability [27]. The admission controller in a core accepts a task migration if and only if the utilization and the task deadline miss ratios of the core are bounded (e.g. $\bar{U}_{j,r}(s+1) \leq \beta_{j,r}$ and $M_{R_j,r}(s+1) \leq 1\%$, $M_{R_j,r}$ denotes the task deadline ratio of a core $M_{j,r}$). A small bound on miss ratio is selected to ensure soft real-time properties [28]. The admission controller also ensures that the peak temperature constraint of the core is not violated by accepting any new task. An idle core is turned off or put to a sleep state for saving energy.

$$y_{j,r}(s) = \bar{U}_{j,r}(s) - B_{j,r}$$

$$\delta_{j,r}(s) = \begin{cases} K_{j,r}y_{j,r}(s) & \text{If } y_{j,r}(s) \geq 0 \\ 0 & \text{Otherwise} \end{cases}$$

**Greedy Utilization Control (GUC) Heuristic:** In GUC heuristic, we follow a greedy approach for migrating the tasks. The current utilization of all under-utilized cores is broadcasted to all active cores. Based on this information, the most over-utilized core selects the task with the least actual utilization and migrates the task to the most under-utilized core. The admission controller ensures that the utilization and task deadline miss ratios of the core are bounded (e.g. $\bar{U}_{j,r}(s+1) \leq \beta_{j,r}$ and $M_{R_j,r}(s+1) \leq 1\%$) and the peak temperature constraint of a core is not violated by accepting a new task. An idle core is turned off or put to a sleep state for saving energy.

**V. RESULTS AND DISCUSSION**

In this section, we describe the simulation and analyze the results. Our results represent an average of 20 runs and all results have 95% confidence level. The maximum feasible operating temperature of a core and ambient temperate of the system are assumed to be 75°C and 0°C respectively. We used 6 different sets of periodic tasks, where the total number of tasks, $n$ are 80, 90, 110, 140, 200, and 300. The total worst-case utilizations of task sets ($U_{tot}$) are assumed to be similar, and the average task utilizations ($U_{tot}/n$) of task sets are different. The values of the average task utilization are increasingly smaller with an average of 0.205, 0.175, 0.145, 0.115, 0.085, and 0.055 respectively. The actual task utilization is considered to be 40% of the worst-case utilization in our experiments. The task hyper-period $P$ is 1000 seconds.

**A. Results using Heat-Independent Thermal Model**

We simulated a set of 8 interconnected multiprocessor units ($m = 8$), and each multiprocessor unit has 4 identical cores ($k = 4$). The parameters used in the simulation are given in Table I [11], [22], [24]. It is assumed that $f_{\text{min}}^{\text{max}} = 0.5f_{\text{max}}^{\text{max}}$.

<table>
<thead>
<tr>
<th>$M_i$</th>
<th>$f_{\text{min}}^{\text{max}}$</th>
<th>$\gamma_i$</th>
<th>$\delta_i$</th>
<th>$\chi_i$</th>
<th>$R_i$</th>
<th>$C_i$</th>
<th>$\alpha_i$</th>
</tr>
</thead>
<tbody>
<tr>
<td>$M_1$</td>
<td>3.3</td>
<td>20.5060</td>
<td>0.1666</td>
<td>3.656</td>
<td>0.282</td>
<td>340</td>
<td>2.152</td>
</tr>
<tr>
<td>$M_2$</td>
<td>3.4</td>
<td>5.0187</td>
<td>0.1942</td>
<td>2.316</td>
<td>0.487</td>
<td>295</td>
<td>1.666</td>
</tr>
<tr>
<td>$M_3$</td>
<td>3.8</td>
<td>12.7880</td>
<td>0.2043</td>
<td>3.645</td>
<td>0.288</td>
<td>320</td>
<td>1.148</td>
</tr>
<tr>
<td>$M_4$</td>
<td>3.0</td>
<td>15.6262</td>
<td>0.1942</td>
<td>4.556</td>
<td>0.238</td>
<td>320</td>
<td>1.044</td>
</tr>
<tr>
<td>$M_5$</td>
<td>3.2</td>
<td>20.6393</td>
<td>0.1574</td>
<td>4.204</td>
<td>0.278</td>
<td>295</td>
<td>0.869</td>
</tr>
<tr>
<td>$M_6$</td>
<td>3.1</td>
<td>11.9759</td>
<td>0.1586</td>
<td>2.719</td>
<td>0.480</td>
<td>255</td>
<td>0.540</td>
</tr>
<tr>
<td>$M_7$</td>
<td>3.0</td>
<td>10.3490</td>
<td>0.1124</td>
<td>2.074</td>
<td>0.661</td>
<td>335</td>
<td>0.348</td>
</tr>
<tr>
<td>$M_8$</td>
<td>2.6</td>
<td>13.1568</td>
<td>0.1754</td>
<td>2.332</td>
<td>0.680</td>
<td>380</td>
<td>0.300</td>
</tr>
</tbody>
</table>

Fig. 1(a) gives the number of active cores ($|\Psi|$) needed to execute the respective task sets. Using WCET-based partitioning approach, 16 to 20 active cores are needed to execute all the task sets. However, using the actual utilization (AET) based DVS scaling, 7 or 8 active cores are sufficient.

For comparative analysis, we implemented MW heuristic using the actual task utilizations to get the baseline centralized solution. The tasks are migrated in the centralized solution for implementing the new task partitioning scheme, while tasks are migrated in the feedback approaches for achieving the utilization target (i.e. $B_{j,r}$, where $j=1, \ldots, m;r = 1, \ldots, k$).

The task migrations are classified into two categories: indispensable and normal task migrations. The indispensable task migrations are composed of those unavoidable task migrations which results from turning off the cores. The normal task migrations represent the task migrations among the cores that remain active. Therefore, the number of normal task migrations are compared for evaluating the migration overhead of the heuristics.
The number of required normal task migrations is compared in Fig. 1(b). In our experiments, GUC and DUC heuristics reduce the number of normal task migrations on an average by 61.5% and 59.4% respectively as compared to the centralized solution. The number of indispensable tasks migrations in our experiments are 24, 32, 34, 39, 68, and 88 respectively when task set size are 80, 90, 110, 140, 200, and 300.

The total energy consumed by the different partitioning approaches is compared in Fig. 1(c). The energy consumption using the actual utilization (i.e. $\tilde{U}_i$, $i=1,\ldots,n$) based DVS scaling is significantly lower than the worst-case utilization (i.e. $U_i$) based DVS scaling. The energy consumption is further minimized by using adaptive or centralized strategies which use actual execution time of tasks for obtaining the partitioning of the tasks. In our experiments, the centralized solution reduces the energy consumption by 51.86% to 62.17% as compared to the WCET-based partitioning. DUC heuristic achieves similar reductions in energy consumption but requires 51.4% to 67.8% fewer normal task migrations as compared to the centralized solution. GUC heuristic also has similar energy savings but needs 58.5% to 65.8% fewer normal task migrations as compared to the centralized solution. Therefore, the proposed adaptive heuristics are effective in minimizing energy consumption according to the actual utilization of tasks and also minimizing the number of normal task migrations.

### B. Results using Heat-Dependent Thermal Model

We simulated a set of 4 interconnected multiprocessor units ($m = 4$) where each multiprocessor unit has a $2 \times 2$ layout with 2 sinks, i.e. each unit has 4 identical cores ($k = 4$) [10], [12]. The simulation parameters are given in Table II [12] and we assume that $\phi_i \text{min} = 0.5 \phi_i \text{max}$. The values of $A_{i,i}$ ($i = 1 \ldots 4$) in matrix A can only be computed after the frequencies of the cores have been determined (Eq. (8)). Therefore, these entries are left blank in Table II.

Fig. 2(a) compares the number of active cores ($|\Psi|$) needed to execute the respective task sets. 15 active cores are needed to execute all the task sets using WCET-based partitioning approach. However, 6 active cores are sufficient when actual utilization (AET) based DVS scaling is used.

The number of required normal task migrations is compared in Fig. 2(b). In our experiments, GUC and DUC heuristics reduce the number of normal task migrations on an average by 47.5% and 43.0% respectively as compared to the centralized solution. Due to heat transfer among cores, HDT model requires a higher number of normal task migrations as compared to HIT model. The number of indispensable tasks migrations in our experiments are 23, 27, 31, 41, 62, and 89 respectively when the task set size are 80, 90, 110, 140, 200, and 300.

The total energy consumed by the partitioning schemes is compared in Fig. 2(c). The energy consumption using the actual utilization (i.e. $\tilde{U}_i$) based DVS scaling is significantly smaller than the worst-case utilization (i.e. $U_i$) based DVS scaling. There is significant energy savings by using adaptive or centralized solutions that use actual execution time of tasks for obtaining the task partitioning. In our experiments, the centralized solution reduces the energy consumption by 58.5% to 62.2% as compared to the WCET-based solution.

### Table II

**Simulation Parameters using HDT model**

<table>
<thead>
<tr>
<th>(a) Matrix A for $M_1$</th>
<th>0.0097</th>
<th>0.004</th>
<th>0.000</th>
<th>0.200</th>
<th>0.050</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>0.0004</td>
<td>0.000</td>
<td>0.004</td>
<td>0.050</td>
<td>0.060</td>
</tr>
<tr>
<td></td>
<td>0.0004</td>
<td>0.000</td>
<td>0.009</td>
<td>0.200</td>
<td>0.050</td>
</tr>
<tr>
<td></td>
<td>0.200</td>
<td>0.050</td>
<td>0.200</td>
<td>0.050</td>
<td>-1.725</td>
</tr>
<tr>
<td></td>
<td>0.050</td>
<td>0.060</td>
<td>0.050</td>
<td>0.060</td>
<td>0.300</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>(b) Matrix A for $M_2$</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>(c) Matrix A for $M_3$</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>(d) Matrix A for $M_4$</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>(e) Other Simulation Parameters</th>
</tr>
</thead>
<tbody>
<tr>
<td>$M_i$</td>
</tr>
<tr>
<td>--------</td>
</tr>
<tr>
<td>$M_1$</td>
</tr>
<tr>
<td>$M_2$</td>
</tr>
<tr>
<td>$M_3$</td>
</tr>
<tr>
<td>$M_4$</td>
</tr>
</tbody>
</table>
DUC heuristic achieves similar energy savings but requires 34.8% to 50% fewer normal task migrations as compared to the centralized solution. GUC heuristic also has similar energy savings and needs 39.5% to 50% fewer normal task migrations as compared to the centralized solution. Therefore, the proposed adaptive heuristics are effective in minimizing the number of normal task migrations and minimizing energy consumption according to the actual utilization of tasks.

VI. Conclusion

We present adaptive energy-efficient task partitioning for heterogeneous multi-core multiprocessor real-time systems. We use a power model which incorporates the impact of temperature and voltage of a core on its static power consumption. Two different thermal models, namely HIT and HDT models are used for estimating the peak temperature of a core. We present DUC and GUC heuristics for adaptive thermal-constrained energy-efficient partitioning of tasks which are feedback-based optimization and control approaches. In our simulations with HIT model, DUC and GUC heuristics minimize the energy consumption by an average of 55% as compared to a WCET-based task partitioning scheme and require an average of 60% fewer normal task migrations as compared to a centralized solution for obtaining similar energy savings. Similarly in our simulations with HDT model, DUC and GUC heuristics minimize the energy consumption by an average of 60% as compared to a WCET-based task partitioning scheme and require an average of 45% fewer normal task migrations as compared to a centralized solution for obtaining similar energy savings.

In future, we plan to investigate strategies for dealing with modeling inaccuracies of power and thermal parameters and task execution times. We also want to evaluate our solutions on several multiprocessor systems, e.g., multi-core computers, smart phones, and vehicle computing platforms.

REFERENCES