

# Tikrit University Electrical Engineering Department

EE-317
Computer Engineering 2024-2025

# Introduction: Computer Performance, Power & Multi-core CPU

Jalal Nazar Abdulbaqi, Ph.D.

jalal.abdulbaqi@tu.edu.iq

### Outline

- Performance
- Power Wall
- Multi-processor
- Benchmarking
- Fallacies and Pitfalls

# **Defining Performance**

# Which airplane has the best performance?









# Response Time and Throughput

- Response time
  - How long it takes to do a task
- Throughput
  - Total work done per unit time
    - e.g., tasks/transactions/... per hour
- How are response time and throughput affected by
  - Replacing the processor with a faster version?
  - Adding more processors?
- We'll focus on response time for now...

#### Relative Performance

- Define Performance = 1/Execution Time
- "X is n time faster than Y"

$$\frac{\text{Performance}_{X}}{\text{Performance}_{Y}} = \frac{\text{Execution time}_{Y}}{\text{Execution time}_{X}} = n$$

- Example: time taken to run a program
  - 10s on A, 15s on B
  - Execution Time<sub>B</sub> / Execution Time<sub>A</sub>
     = 15s / 10s = 1.5
  - So A is 1.5 times faster than B

# Measuring Execution Time

#### Elapsed time

- Total response time, including all aspects
  - Processing, I/O, OS overhead, idle time
- Determines system performance

#### CPU time

- Time spent processing a given job
  - Discounts I/O time, other jobs' shares
- Comprises user CPU time and system CPU time
- Different programs are affected differently by CPU and system performance

# **CPU Clocking**

Operation of digital hardware governed by a constant-rate clock



- Clock period: duration of a clock cycle
  - e.g.,  $250ps = 0.25ns = 250 \times 10^{-12}s$
- Clock frequency (rate): cycles per second
  - e.g., 4.0GHz = 4000MHz =  $4.0 \times 10^9$ Hz

#### **CPU Time**

$$CPU Time = CPU Clock Cycles \times Clock Cycle Time$$

$$= \frac{CPU Clock Cycles}{Clock Rate}$$

#### Performance improved by

- Reducing number of clock cycles
- Increasing clock rate
- Hardware designer must often trade off clock rate against cycle count

# **CPU Time Example**

- Computer A: 2GHz clock, 10s CPU time
- Designing Computer B, how fast must Computer B clock be if
  - Aim for 6s CPU time
  - Can do faster clock, but causes 1.2 × clock cycles

$$\begin{aligned} \text{Clock Rate}_{\text{B}} &= \frac{\text{Clock Cycles}_{\text{B}}}{\text{CPU Time}_{\text{B}}} = \frac{1.2 \times \text{Clock Cycles}_{\text{A}}}{6\text{s}} \\ \text{Clock Cycles}_{\text{A}} &= \text{CPU Time}_{\text{A}} \times \text{Clock Rate}_{\text{A}} \\ &= 10\text{s} \times 2\text{GHz} = 20 \times 10^9 \\ \text{Clock Rate}_{\text{B}} &= \frac{1.2 \times 20 \times 10^9}{6\text{s}} = \frac{24 \times 10^9}{6\text{s}} = 4\text{GHz} \end{aligned}$$

### Instruction Count and CPI

```
Clock Cycles = Instruction Count \times Cycles per Instruction CPU Time = Instruction Count \times CPI \times Clock Cycle Time = \frac{Instruction Count \times CPI}{Clock Rate}
```

- Instruction Count for a program
  - Determined by program, Instruction Set Archetecture (ISA) and compiler
- Average cycles per instruction
  - Determined by CPU hardware
  - If different instructions have different CPI
    - Average CPI affected by instruction mix

# **CPI Example**

- Computer A: Cycle Time = 250ps, CPI = 2.0
   Computer B: Cycle Time = 500ps, CPI = 1.2
- Which is faster, and by how much?

$$\begin{aligned} \text{CPU Time}_{A} &= \text{Instruction Count} \times \text{CPI}_{A} \times \text{Cycle Time}_{A} \\ &= \text{I} \times 2.0 \times 250 \text{ps} = \text{I} \times 500 \text{ps} & \text{A is faster....} \end{aligned}$$
 
$$\begin{aligned} \text{CPU Time}_{B} &= \text{Instruction Count} \times \text{CPI}_{B} \times \text{Cycle Time}_{B} \\ &= \text{I} \times 1.2 \times 500 \text{ps} = \text{I} \times 600 \text{ps} \end{aligned}$$
 
$$\begin{aligned} &= \text{CPU Time}_{B} \\ &= \text{CPU Time}_{A} \end{aligned}$$
 
$$\begin{aligned} &= \frac{\text{I} \times 600 \text{ps}}{\text{I} \times 500 \text{ps}} = 1.2 \end{aligned}$$
 ....by this much

#### **CPI in More Detail**

 If different instruction classes take different numbers of cycles

Clock Cycles = 
$$\sum_{i=1}^{n} (CPI_i \times Instruction Count_i)$$

Weighted average CPI

$$CPI = \frac{Clock \ Cycles}{Instruction \ Count} = \sum_{i=1}^{n} \left( CPI_i \times \frac{Instruction \ Count_i}{Instruction \ Count} \right)$$

Relative frequency

# **CPI Example**

Alternative compiled code sequences using instructions in classes A,
 B, C

| Class            | Α | В | С |
|------------------|---|---|---|
| CPI for class    | 1 | 2 | 3 |
| IC in sequence 1 | 2 | 1 | 2 |
| IC in sequence 2 | 4 | 1 | 1 |

# Performance Summary

$$CPU Time = \frac{Instructions}{Program} \times \frac{Clock \ cycles}{Instruction} \times \frac{Seconds}{Clock \ cycle}$$

#### Performance depends on

- Algorithm: affects IC, possibly CPI
- Programming language: affects IC, CPI
- Compiler: affects IC, CPI
- Instruction Set Architecture (ISA): affects IC, CPI, T<sub>c</sub>

## **Power Trends**



# Reducing Power

- Suppose a new CPU has
  - 85% of capacitive load of old CPU
  - 15% voltage and 15% frequency reduction

$$\frac{P_{\text{new}}}{P_{\text{old}}} = \frac{C_{\text{old}} \times 0.85 \times (V_{\text{old}} \times 0.85)^2 \times F_{\text{old}} \times 0.85}{C_{\text{old}} \times V_{\text{old}}^2 \times F_{\text{old}}} = 0.85^4 = 0.52$$

- The power wall
  - We can't reduce voltage further
  - We can't remove more heat
- How else can we improve performance?

# **Uni-processor Performance**



# Multiprocessors

- Multi-core microprocessors
  - More than one processor per chip
- Requires explicitly parallel programming
  - Compare with instruction level parallelism
    - Hardware executes multiple instructions at once
    - Hidden from the programmer
  - Hard to do
    - Programming for performance
    - Load balancing
    - Optimizing communication and synchronization

## SPEC CPU Benchmark

- Programs used to measure performance
  - Supposedly typical of actual workload
- Standard Performance Evaluation Corp (SPEC)
  - Develops benchmarks for CPU, I/O, Web, ...
- SPEC CPU2017
  - SPEC speed 2017 Integer #10 (integer)
  - CFP2006 #13 (floating-point)
  - Elapsed time to execute a selection of programs
    - Negligible I/O, so focuses on CPU performance
  - Normalize relative to reference machine

## SPEC CPU Benchmark

SPEC ratio

SPEC<sub>ratio</sub> = Execution time (CPU<sub>Reference</sub>) / Execution time (CPU<sub>Evaluated</sub>)

geometric mean of SPEC ratio summarized SPECspeed 2017

$$\sqrt[n]{\prod_{i=1}^{n} Execution time ratio_{i}}$$

# SPEC speed 2017 Integer benchmarks on a 1.8 GHz Intel Xeon E5-2650L

| Description                                                          | Name      | Instruction<br>Count x 10^9 | CPI  | Clock cycle time (seconds x 10^-9) | Execution<br>Time<br>(seconds) | Reference<br>Time<br>(seconds) | SPECratic |
|----------------------------------------------------------------------|-----------|-----------------------------|------|------------------------------------|--------------------------------|--------------------------------|-----------|
| Perl interpreter                                                     | perlbench | 2684                        | 0.42 | 0.556                              | 627                            | 1774                           | 2.83      |
| GNU C compiler                                                       | gcc       | 2322                        | 0.67 | 0.556                              | 863                            | 3976                           | 4.61      |
| Route planning                                                       | mcf       | 1786                        | 1.22 | 0.556                              | 1215                           | 4721                           | 3.89      |
| Discrete Event<br>simulation -<br>computer network                   | omnetpp   | 1107                        | 0.82 | 0.556                              | 507                            | 1630                           | 3.21      |
| XML to HTML conversion via XSLT                                      | xalancbmk | 1314                        | 0.75 | 0.556                              | 549                            | 1417                           | 2.58      |
| Video compression                                                    | x264      | 4488                        | 0.32 | 0.556                              | 813                            | 1763                           | 2.17      |
| Artificial Intelligence:<br>alpha-beta tree<br>search (Chess)        | deepsjeng | 2216                        | 0.57 | 0.556                              | 698                            | 1432                           | 2.05      |
| Artificial Intelligence:<br>Monte Carlo tree<br>search (Go)          | leela     | 2236                        | 0.79 | 0.556                              | 987                            | 1703                           | 1.73      |
| Artificial Intelligence:<br>recursive solution<br>generator (Sudoku) | exchange2 | 6683                        | 0.46 | 0.556                              | 1718                           | 2939                           | 1.71      |
| General data compression                                             | xz        | 8533                        | 1.32 | 0.556                              | 6290                           | 6182                           | 0.98      |
| Geometric mean                                                       |           |                             |      |                                    |                                |                                | 2.36      |

#### **SPEC Power Benchmark**

- Power consumption of server at different workload levels
  - Performance: ssj\_ops/sec
  - Power: Watts (Joules/sec)

Overall ssj\_ops per Watt = 
$$\left(\sum_{i=0}^{10} ssj_ops_i\right) / \left(\sum_{i=0}^{10} power_i\right)$$

## SPEC power\_ssj 2008 for Xeon E5-2650L

| Target<br>Load % | Performance<br>(ssj_ops) | Average<br>Power<br>(watts) |
|------------------|--------------------------|-----------------------------|
| 100%             | 4,864,136                | 347                         |
| 90%              | 4,389,196                | 312                         |
| 80%              | 3,905,724                | 278                         |
| 70%              | 3,418,737                | 241                         |
| 60%              | 2,925,811                | 212                         |
| 50%              | 2,439,017                | 183                         |
| 40%              | 1,951,394                | 160                         |
| 30%              | 1,461,411                | 141                         |
| 20%              | 974,045                  | 128                         |
| 10%              | 485,973                  | 115                         |
| 0%               | 0                        | 48                          |
| Overall Sum      | 26,815,444               | 2,165                       |
| ∑ssj_ops / ∑pov  | 12,385                   |                             |

## Pitfall: Amdahl's Law

 Improving an aspect of a computer and expecting a proportional improvement in overall performance

$$T_{\text{improved}} = \frac{T_{\text{affected}}}{\text{improvement factor}} + T_{\text{unaffected}}$$

- Example: multiply accounts for 80s/100s
  - How much improvement in multiply performance to get 5× overall?

$$20 = \frac{80}{n} + 20$$
 • Can't be done!

Corollary: make the common case fast

# Fallacy: Low Power at Idle

- Look back at i7 power benchmark
  - At 100% load: 258W
  - At 50% load: 170W (66%)
  - At 10% load: 121W (47%)
- Google data centre
  - Mostly operates at 10% 50% load
  - At 100% load less than 1% of the time
- Consider designing processors to make power proportional to load

#### Pitfall: MIPS as a Performance Metric

- MIPS: Million Instructions Per Second
  - Doesn't account for
    - Differences in ISAs between computers
    - Differences in complexity between instructions

$$\begin{aligned} \text{MIPS} &= \frac{Instruction \, count}{Execution \, time \times 10^6} \\ &= \frac{Instruction \, count}{\frac{Instruction \, count \times CPI}{Clock \, rate}} = \frac{Clock \, rate}{CPI \times 10^6} \end{aligned}$$

CPI varies between programs on a given CPU

#### Pitfall: MIPS as a Performance Metric

#### • Example:

| Measurement       | Computer A | Computer B |
|-------------------|------------|------------|
| Instruction Count | 10 billion | 8 Billion  |
| Clock rate        | 4 GHz      | 4 GHz      |
| CPI               | 1.0        | 1.1        |

- Which computer has the higher MIPS rating?
- Which computer faster?

#### Pitfall: MIPS as a Performance Metric

- MIPS = clock rate / (CPI  $\times$  10<sup>6</sup>)
  - MIPS<sub>A</sub> =  $4 \times 10^9 / (1 \times 10^6) = 4000$
  - MIPS<sub>B</sub> =  $4 \times 10^9 / (1.1 \times 10^6) = 3636$
- $CPU_{TIMF} = IC \times CPI / clock rate$ 
  - CPU<sub>TIME</sub> $\mathbf{A} = 10 \times 10^9 \times 1.0 / 4 \times 10^9 = \mathbf{2.5}$  [sec.]
  - $CPU_{TIMF}$ **B** =  $8 \times 10^9 \times 1.1 / 4 \times 10^9 =$ **2.2** [sec.] Faster
- Computer B is faster than Computer A, although it has lower MIPS than Computer A.

# **Concluding Remarks**

- Cost/performance is improving
  - Due to underlying technology development
- Hierarchical layers of abstraction
  - In both hardware and software
- Instruction set architecture (ISA)
  - The hardware/software interface
- Execution time: the best performance measure
- Power is a limiting factor
  - Use parallelism to improve performance