Questions 1: Consider a five stage pipelined processor in which the branch target address is determined in the ID stage (second stage) and the branch condition is determined in the MEM stage (fourth stage). Assume that the instruction mix executing on the processor contains 20% branch instructions, 70% of which are taken. Assume also that the processor’s CPI is 1.9 without accounting for the overhead of dealing with control hazards (but accounting for other hazards). Compute the CPI when each of the following techniques is used to avoid control hazards.

(a) Always predict “branch not taken” and take a corrective action if the prediction is wrong
(b) Always predict “branch taken” and take a corrective action if the prediction is wrong (see slide 70)
(c) Assume a branch predictor with an 80% accuracy. That is, the predictor makes the right prediction 80% of the time and the wrong prediction 20% of the time. For this part, assume that a predictor always makes a prediction and ignore the cases where the predictor cannot make a prediction because of a BTB collision or invalid entry (branch seen for the first time).

Question 2: Consider the following loop which computes a[i]=b[i]+a[i], i=1,2,…:

L: lw $t0, 1000($s4)  /* load a[i] */
    lw $t1, 4000($s4)  /* load b[i] */
    addi $s4, $s4, 4
    add $t0, $t0, $t1  /* compute temp = a[i] + b[i] */
    sw $t0, 996($s4)  /* store temp in a[i] */
    beq $s4, $s6, L  /* repeat until i reaches a maximum value */

(a) Using a table similar to the one below, show the scheduling of one iteration of the loop on a superscalar (static) 2-issue architecture with two pipelines, one for ALU/branch instructions and the other for lw/sw instructions  (see Figure 4.69 in the textbook). Assume that the architecture has forwarding and data hazard hardware.

<table>
<thead>
<tr>
<th>Cycle 1</th>
<th>ALU/branch pipeline</th>
<th>Load/store pipeline</th>
</tr>
</thead>
<tbody>
<tr>
<td>Cycle 2</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Cycle 3</td>
<td></td>
<td></td>
</tr>
<tr>
<td>....</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

(b) How many cycles will it take to execute the above loop 1000 times on the 2-issue architecture assuming perfect branch prediction (always correct prediction)?
(c) If the branch is resolved when the instruction is in the ID stage (second stage), how many cycles will it take to execute the above loop 1000 times on the 2-issue architecture assuming no branch prediction (always predict “branch is not taken”)?
(d) How many cycles will it take to execute the same loop 1000 times on a single issue architecture (the usual 5-stage pipeline) assuming perfect branch prediction?
(e) If the branch is resolved when the instruction is in the ID stage (second stage), how many cycles will it take to execute the above loop 1000 times on the single issue architecture assuming no branch prediction?
Question 3:
(a) Assuming that the loop in Question 3 executes an even number of times, show that loop after unrolling it twice (reducing the number of iterations by a factor of 2).
(b) Show the scheduling of one iteration of the unrolled loop on a superscalar (static) 2-issue architecture with two pipelines, one for ALU/branch instructions and the other for lw/sw instructions. Assume that the architecture has forwarding and data hazard hardware.
(c) How many cycles will it take to execute the unrolled loop 1000 times on the 2-issue architecture assuming perfect branch prediction (always correct prediction)?

Question 4:
In this question, consider the following series of address references (given as word addresses):
20, 33, 21, 4, 21, 39, 21, 17, 28, 48, 45, 33, 9, 4, 22, 44
For each of the following cache organizations, show the content of the cache after each memory reference and indicate whether the reference is a hit or a miss. Use [tag, M(address), ...] to describe the content of each entry (see the examples on the class progress page). For example [4,M(46)] indicates that the entry contains tag=4 and the data from memory location 46. Similarly, [4,M(46),M(47)] indicates that the entry contains a block of two words from memory locations 46 and 47. As discussed in class, avoid drawing the cache after each reference by drawing only one cache and indicating that an entry E1 is replaced by E2 by crossing E1 and writing E2 next to it. Assume Least Recently Used replacement and assume that the cache is initially empty (invalid entries).
   a) a direct mapped cache with 16 one-word blocks.
   b) a direct mapped cache with two-word blocks and total size of 16 words.
   c) a 2-way associative cache with two-word blocks and total size of 16 words

Question 5:
Compute the total number of bits required to implement each of the caches specified below. Assume that each memory word is 32-bit long and the memory is byte addressable with 32-bits addresses. Note that the number of bits needed to implement the cache represents the total amount of memory needed for storing all of the data, tags and valid bits (ignore any other bits).

<table>
<thead>
<tr>
<th>Configuration</th>
<th>(a)</th>
<th>(b)</th>
<th>(c)</th>
<th>(d)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Cache size</td>
<td>512KB</td>
<td>512KB</td>
<td>512KB</td>
<td>512KB</td>
</tr>
<tr>
<td>Block size</td>
<td>one word</td>
<td>4 words</td>
<td>one word</td>
<td>8 words</td>
</tr>
<tr>
<td>Associativity</td>
<td>direct mapped</td>
<td>direct mapped</td>
<td>4-way</td>
<td>2-way</td>
</tr>
<tr>
<td># of bits for data array</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td># of bits for tag array</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td># of bits for valid bits</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Total # of bits</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
Questions 6:
Consider a program in which 20% of the instructions are memory load or store instructions, and assume that the CPI for this machine is 2.5 when the data and instructions are always found in the cache.

1. How many cycles does it take to access the cache if the CPU operates at 2GHz and the memory access latency is 80 n. sec.
2. Assume that the cache miss penalty is equal to the memory access time computed above, what is the average memory access time if the instruction cache miss rate is 2.5% and the data cache miss rate is 5%?
3. Assume that an on-chip L2 cache is added to the system, and that the hit time for the L2 cache is 6 cycles. What would be the effective CPI if 30% of the references to the L2 cache (the misses from L1) are L2 misses?