Classwork 3

Question 1:
In this question, we will assume that moving the logic for the evaluation of the branch condition and target address from the ALU stage of the 5-stage pipeline to the ID stage causes an increase in the clock cycle time by 5%. We will also assume that 20% of the instructions executing on the pipeline are branch instructions and that 60% of the branches are taken.

(a) Assume that the CPI = 2.12 when the branch condition and target are evaluated in the ALU stage. Will resolving the branch condition/target in the ID stage result in a more efficient execution?
(b) In addition to increasing the cycle time, this change creates a new type of data hazard that cannot be resolved with the forwarding paths discussed so far. For example, the following sequence of instructions will cause a hazard

```
add $6, $2, $2
beq $5, $6, L
```

Explain why will the above sequence of instructions cause a hazard and describe the forwarding path(s) that is (are) needed to resolve that hazard.

Questions 2: Consider a five stage pipelined processor in which the branch target address is determined in the ID stage (second stage) and the branch condition is determined in the MEM stage (fourth stage). Assume that the instruction mix executing on the processor contains 20% branch instructions, 70% of which are taken. Assume also that the processor’s CPI is 1.9 without accounting for the overhead of dealing with control hazards (but accounting for other hazards). Compute the CPI when each of the following techniques is used to avoid control hazards.

(a) Always predict “branch not taken” and take a corrective action if the prediction is wrong
(b) Always predict “branch taken” and take a corrective action if the prediction is wrong
(c) Assume a branch predictor with an 80% accuracy. That is, the predictor makes the right prediction 80% of the time and the wrong prediction 20% of the time. For this part, assume that a predictor always makes a prediction and ignore the cases where the predictor cannot make a prediction because of a BTB collision or invalid entry (branch seen for the first time).

Question 3: Consider the following loop which computes $a[i] = b[i] + a[i]$, $i=1,2,…$

```
L: lw $t0, 1000($s4) /* load $a[i] */
lw $t1, 4000($s4) /* load $b[i] */
addi $s4, $s4, 4
add $t0, $t0, $t1 /* compute temp = $a[i] + $b[i] */
sw $t0, 996($s4) /* store temp in $a[i] */
beq $s4, $s6, L /* repeat until $i reaches a maximum value */
```

(a) Using a table similar to the one below, show the scheduling of one iteration of the loop on a superscalar (static) 2-issue architecture with two pipelines, one for ALU/branch
instructions and the other for lw/sw instructions (see Figure 4.69 in the textbook). Assume that the architecture has forwarding and data hazard hardware.

<table>
<thead>
<tr>
<th>Cycle</th>
<th>ALU/branch pipeline</th>
<th>Load/store pipeline</th>
</tr>
</thead>
<tbody>
<tr>
<td>Cycle 1</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Cycle 2</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Cycle 3</td>
<td></td>
<td></td>
</tr>
<tr>
<td>....</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

(b) How many cycles will it take to execute the above loop 1000 times on the 2-issue architecture assuming perfect branch prediction (always correct prediction)?

(c) If the branch is resolved when the instruction is in the ID stage (second stage), how many cycles will it take to execute the above loop 1000 times on the 2-issue architecture assuming no branch prediction (always predict “branch is not taken”)?

(d) How many cycles will it take to execute the same loop 1000 times on a single issue architecture (the usual 5-stage pipeline) assuming perfect branch prediction?

(e) If the branch is resolved when the instruction is in the ID stage (second stage), how many cycles will it take to execute the above loop 1000 times on the single issue architecture assuming no branch prediction?

Question 4:

(a) Assuming that the loop in Question 3 executes an even number of times, show that loop after unrolling it once (reducing the number of iterations by a factor of 2).

(b) Show the scheduling of one iteration of the unrolled loop on a superscalar (static) 2-issue architecture with two pipelines, one for ALU/branch instructions and the other for lw/sw instructions. Assume that the architecture has forwarding and data hazard hardware.

(c) How many cycles will it take to execute the unrolled loop 1000 times on the 2-issue architecture assuming perfect branch prediction (always correct prediction)?