The notes from last semester are available in chronological order at my backup site. Click on the category link on the right-hand side to view notes from that class only. Saves you from having to read bottom-up…
The notes from last semester are available in chronological order at my backup site. Click on the category link on the right-hand side to view notes from that class only. Saves you from having to read bottom-up…
Posted on January 10, 2006 in Computer Architecture, Computer Forensics I, Programming Languages | Permalink | Comments (2) | TrackBack (0)
Exam is on Memory, I/O, Virtual Memory, Cache, Memory Interleaving, Performance Evalutuation
Here are some questions to study for the final (sorry I came to class late so I may have missed some).
[12 4][9 1][ ][7 11]
Some hints that may or may not be helpful…
When Dr. Montagne says "n-way associativity" he means that there are n rows in the cache when the cache looks like this and the columns are m/n where m is the total amount of memory (the "size" of the cache).
You should memorize these equations:

U = Utilization
λ = Arrival Rate
μ = Throughput
S = Service Time
R = Response Time
N = Average number of IORBs in the system
W = Wait
q ⇒ Queue
Posted on December 01, 2005 in Computer Architecture | Permalink | Comments (0) | TrackBack (0)
Systolic Arrays can be classified as MISD. Flynn wrote a paper about seven years ago that was a revision of the taxonomy, and he said that MISD are systolic arrays. He considered the array to be a single piece of data floating through multiple instructions. The matrix is considered a data value and you take the matrix and flow it through the array and you do multiple operations on that single data. To do this, you must consider that the value array or matrix is the single data.
In a Data Flow machine, the instructions are executed when all of the data arrives into the instruction. The standard rule is parallelism, so you have a Multiple Instruction Single Data machine.
Why, if you have a CPU and several I/O Devices that are doing work, don't we call it a multiprocessor? The work that's getting done in the I/O devices can't be processed in the I/O device; it has to go back to the CPU for processing.
Posted on November 29, 2005 in Computer Architecture | Permalink | Comments (0) | TrackBack (0)
That's right, folks… It's that time again. It seems like these types of questions are going to be on the exam, so familiarize yourself with them and if you see any errors please let me know so that I can fix the site before too many people study incorrect material! This applies to any other page on Edulog… Don't be shy. Thanks!
Question: If we have a two-way set associative cache of size four, using the LRU replacement strategy, how many times would a block be replaced? What are the final blocks in the cache at the end of the program?

Saying that the cache is of size four means that there are four blocks, and it's two-way set associative so there are two blocks per set so there are two sets in all.
***** You need to do (reference mod 2) to get the set and then within the set, LRU is applied. *****
Reference string: 4, 5, 9, 8, 5, 4, 21, 9, 1, 13, 1, 9, 16, 4, 12
Answer:
⇒ 7 replacements, not including the first four inputs into the empty queue… (I guess by assuming that the queue is empty to begin with you don't count the first four inputs as "replacements" although their memory access time will probably be the same).
Question: How many sets are there in a two-way set associative cache with 32kb capacity and 64-byte lines (each block is 64 bytes), and how many bits of the address are used to select a set in this cache?
Answer: There are two lines per set in a two-way set associative cache. If it is at full capacity, we have 512 blocks with two blocks per set, so there are 256 sets. So, we will need eight bits to address a set in the cache.
We have two options here. First, Write Through: The information (values) is written in the cache and in main memory.

The other option is Write Back: The block or line is only written to main memory when it is replaced. All writes are done in cache until replaced.
Misses
Question: In a direct-mapped cache with capacity of 16kb and a line length of 32 bytes, how many bits are used to determine the byte (the byte within a word) that a memory operation references in a cache line, and how many bits are used to select the line in the cache?
Answer: 5 and 9, respectively. The toal size of the cache is 16kb where 1kb = 1024 bytes. So, the number of lines in memory is 16 * 1024/32 = 512… log2 512 = 9
* You can't apply LRU in direct-mapped.
Posted on November 22, 2005 in Computer Architecture | Permalink | Comments (0) | TrackBack (0)
For the final project…
|0> ≡ Vector 0 → |0> = [1]
[0]
|1> ≡ Vector 1 → |1> = [0]
[1]
Remember that the Identity Matrix looks like the following:
[1 0] [0 1]
And a nx2 matrix multiplied by the identity will return the nx2 matrix. A "not" matrix looks like the following:
[0 1] [1 0]
The "not" matrix will make NOT |0> = |1>. In the same way, NOT |1> = |0>. So, NOT acts like an inverter.
|00> |01> |10> |11>
|00> ≡ [1][1] → [1] 0
[0][0] [1] 1
[0] 2
[0] 3
0
1 1
0
↓
[ ][ ][ ] ← 0 0
[ ][ ][ ]
0 1 → [ ][ ][ ]
[ ][ ][ 0] ← 0
[ ][ ][ ]
0 → [1 ][ ][ ]
[ ][ 0][ ]0
[ ][ 0 ][ ]
0[ ][1 ][ ]
[ 0][ ][ 0]0
[ 1 ][ ][ 1 ]
0[0 ][ ][1 ]
...
We can represent the output in several ways. For example,
[1 1][1] = [1] = [1] + [0] = |0> + |1> [1 -1][0] [1] [0] [1]
Note: There are different ways to transform NOT into the Identity matrix. For instance, NOT times NOT equals the Identity matrix.
To implement the systolic array, you need the following set up:

How long does it take to do the computation. Theoretically, it will take 8 + 9 + 8 steps to execute, but depending if you have to implement a comparator it may actually be more. Your design should get the same results as the theoretical design, but if it's slower, you won't lose points.
Posted on November 17, 2005 in Computer Architecture | Permalink | Comments (0) | TrackBack (0)

Aw=y Bw=z
Assuming A and B are 2x2 matrices
|a11 a12| |x1| = |y1| |a21 a22| |x2| = |y2|
|b11 b12| |w1| = |z1| |b21 b22| |w2| = |z2|
Do a PowerPoint presentation step by step showing how, with a 2x2 matrix, how the process works. This is due a week from today, next Tuesday.
b22
b12 a22 b21
a12 b11 a21
a11
------------
w1 x2 y1→ | | | | → y1 z1 y2 z2
------------
We've talked about main memory, I/O, and virtual memory. We know that the CPU has to access memory somehow, and has to be connected to a device to initiate I/O operations, etc. When we were talking about virtual memory, we mentioned content addressable memory. The intent is to speed up access time in a virtual memory environment. We can have part of the page table in the TLB so we can find the memory reference immediately.
Based on the principle of locality, we can insert a new level in the hierarchy (between CPU and Main Memory): cache memory.
How to implement fully associative memory:
How to implement Direct Mapping:
Fully Associative Memory is very expensive, but with Direct Mapping you may have free memory that you aren't allowed to access. Set Associative is a combination of the two schemes. You can immediately locate the set and find free memory within the set.
How to implement Set Associative Cache:
Posted on November 15, 2005 in Computer Architecture | Permalink | Comments (0) | TrackBack (0)
» Project 2 is extended to next Tuesday midnight!
A systolic array is an arrangement of processors in an array (often rectangular) where data flows synchronously across the array between neighbours, usually with different data flowing in different directions.
Each processor at each step takes in data from one or more neighbours (e.g. North and West), processes it and, in the next step, outputs results in the opposite direction (South and East).
An example of a systolic algorithm might be matrix multiplication. One matrix is fed in a row at a time from the top of the array and is passed down the array, the other matrix is fed in a column at a time from the left hand side of the array and passes from left to right. Dummy values are then passed in until each processor has seen one whole row and one whole column. At this point, the result of the multiplication is stored in the array and can now be output a row or a column at a time, flowing down or across the array.
Suppose we have a matrix
a11 a12 a13 a21 a22 a23 a31 a32 a33
and a vector
x1 x2 x3
The product of the matrix and vector will be
a11*x1 + a12*x2 + a13*x3 a21*x1 + a22*x2 + a23*x3 a31*x1 + a32*x2 + a33*x3
With a systolic array, imagine that the matrix moves diagonally (top-left) through processing elements. Each processing element has two inputs and a multiplier and an output.
I wasn't wearing glasses so I couldn't take the best notes today, but there's more on systolic arrays here.
W = # of diagonals = # of processors = # of processing elements
T(1) = W + 2n-1 = 2n-1 + 2n-1 = 4n-2
The idea is to apply parallel processing to speed up the computation. Project 3 will be a choice whether to implement this in Verilog or add on to Project 2. If you choose to do the Systolic Arrays project, the matrix only needs to be 2x2. The vector should be 4 elements. You can use 2, 4, or 8 bits (preferably 2) in your implementation (must be more than 1).
Posted on November 10, 2005 in Computer Architecture | Permalink | Comments (0) | TrackBack (0)
Memory has the hardware to compare a value with all the answers in memory at once.
[ 10001110 ] [ 11 ] key
00 11 10 → f/f (match) 01
The idea is to compare one value with several entries in parallel. To implement paging, we need a content-addressable memory scheme.
If we have a program and main memory, to run the program in the memory may not be possible because the program is larger than the available memory. You should be able to run the program without a lot of memory without the user knowing anything about it. There is a technique called paging, which means that you take a program and divide it into pieces of the same size, and each piece is called a page. On the memory side, you divide the memory into pieces, called frames, equal in size to a page. Then, we can load pages into frames.
The page table is the translation mechanism, which has as many entries as pages in a program.

To execute the instruction
LOAD <P,OFFSET>
you'll have to do two memory accesses, one for the page table and one for the frame. This means that programs will run slowly. You take the frame number combined with the offset and you access the data. When you have two accesses, you can use content-addressable memory to speed up accesses.
If there is a "miss", then the entry is not in the TLB and you will have to do two memory accesses (pass control to the OS). TLB stands for Table Look-Aside Buffer. There can be a miss because the TLB won't have the entire program loaded. Depending on the size of the program, you cannot load the whole program into the TLB. Assuming that the program is in memory, if there is a miss, you go to the page table, obey the TLB.
You can load pages anywhere in memory (in any frame) because of the translation mechanism.
To implement virtual memory, you have to have a hard drive. The present bit will be set to 1 if the page is in memory.
The "dirty" bit tells us whether a page has been modified. If the page has been modified (dirty bit is 1), it has to be written back before it can be overwritten.
If there is a miss (page is not loaded in memory) there is a page fault. There can be another case… Assume that you have to additional frames in memory and you have Page 3 loaded into Frame 2 and Page 4 is loaded into Frame 3. Page 3 may not been in the TLB, but it does not necessarily mean there there will be a miss. You go back to the page table and see the location of Page 3.
We devide the program according to procedures and data structures — each procedure is a segment, and each data structure is a segment. The compiler does this, but when the program is going to be executed we create a Segment Table. The Segment Table is similar to the Page Table; it lets us know where segments will be loaded in memory. If we have four segments, we will have four entries in the segment table. Segments are different sizes, though, so we cannot use pointers to frames of memory! We need two values: the base (tells where/address the segment begins in memory) and the limit (ending address).
The gray areas are fragmented. We call this external fragmentation. Extra space after the last page, however, is called internal fragmentation. We can fix this with paging! Instructions are organized like the following:
[LOAD][S][P][D]
PPT ≡ Pointer to page table
The hardware for this looks like this:
When you have a miss in the TLB you have to go to the Segment Table.
If you have 6 pages in the page table, but only 4 memory frames, if there is a miss in the TLB then there might not be a page fault. You can't have the whole page table in the TLB, so when it misses the TLB you go to the Page Table and try to find it before you issue a page fault. If the page is in memory there is a page fault.
If there is a page fault while trying to access page #1 and it is not in memory there is a page fault. If all of the memory frames are used, you must choose the "least recently used" (LRU) page to remove from memory (Dr. Montagne calls this the "victim").
The idea is that we are going to create a stack mechanism that will help us control which pages are/are not in memory.
Reference to Memory:
7, 0, 1, 2, 0, 3, 0, 4, 2, 3, 0, 3, 2, 1, 2
Assume that you have 3 frames. This would work like… (the last few steps are skipped, but you can get the gist from what's here):
When numbers start coming out of the bottom of the stack, the dirty bit is checked and if necessary the data is written back. The total number of page faults for the memory reference list above is 10.
* * * * * * 7 0 1 2 0 3 0 4 2 3 0 ... . 7 0 1 2 0 3 0 4 2 3 ... . . 7 0 1 2 2 3 0 4 2 ... . . . 7 7 1 1 2 3 0 4 ... --------------------- ... . . . . . 7 7 1 1 1 1 ... . . . . . . . 7 7 7 7 ... . . . . . . . . . . . ... . . . . . . . . . . . ...
* Indicates Page Fault.
Posted on November 08, 2005 in Computer Architecture | Permalink | Comments (0) | TrackBack (0)
Warning: I had a hard time keeping up with all of the diagrams so beware these may not be entirely complete or well-documented.
Sometimes you can access memory through the CPU, but it is more efficient to have an I/O handler access memory.
So far we know that we have the MAR (memory address register) and the MDR (memory data register). It means that we have an address in the MAR and with that address we can access every word in memory and when we select a specific word the result goes into the MDR.
The following is rather incomplete so consult your own notes. You select memory and load it into the MDR or do the reverse.

Simple memory organization:

Memory Scheme

Dynamic Partitioning
Paging divides a program into pieces of the same size called pages. It works according to the following diagram:
We need a page table pointer in the CPU. Suppose we have something like
LOAD <ADDR> // user instruction LOAD <P,d> // supervisor instruction
I think this is to determine the page's location in memory for a user instruction…
Fragmentation
With pages, we don't have external fragmentation but with the last page we will have a little bit of internal fragmentation. Internal fragmentation occurs when you have extra space at the end of a page. It happens on fixed partitions where you have a block of memory in which you can only load one program. Internal fragmentation cannot be used by any other program. External fragmentation is the free space between programs.
Fixed partitioning is related to internal fragmentation. One example is paging; the last part of the program is covered by a frame but the program will not exactly match the size of the frame.
External fragmentation, again, is the space between programs, which cannot be used.
The Page Table
The page table is in memory in the OS area. It means that we need to access memory to get the frame value and then we need to access memory to access the word that we need (so we need two memory accesses). If you run a program on this machine, every instruction needs to accesses. To solve this problem, we are going to introduce associative memory, or content-addressable memory.

The mask is there to determine the number of bits to compare with the values of the table. When you do an "and" with the key register and the mask register everything will be zero except the two bits in the beginning so that you can compare just the first two bits.
Posted on November 03, 2005 in Computer Architecture | Permalink | Comments (0) | TrackBack (0)
Instructions take the following format:
LOAD Ri, <ADDR> STORE Ri, <ADDR> ADD Ri, Rj, Rk JMP
Fetch (1 Bus)
MAR ← (BUS) ← IP MDR ← MEM[MAR] || IP ← IP + 1 IR ← (BUS) ← MDR Decoder ← IR.OP
LOAD (1 Bus)
MAR ← (BUS) ← IR.ADDR MDR ← MEM[MAR] Ri ← (BUS) ← MDR
ADD (1 Bus)
A ← (BUS) ← Rk C ← A + (ALUB ← (BUS) ← Rj) Ri ← (BUS) ← C
Fetch (2 Buses)
MAR ← (BUS 2) ← IP MDR ← MEM[MAR] || IP ← IP + 1 IR ← (BUS 1) ← (ALU)BC (BUS 2) ← MDR Decoder ← IR.OP
ADD (2 Buses)
A ← (BUS 1) ← Rk Ri ← (BUS 1) ← A + ALUB ← (BUS 2) ← Rj
LOAD (2 Buses) *Uses the ALU as a bridge
MAR ← (BUS 2) ← IR.ADDR MDR ← MEM[MAR] RF ← (BUS 1) ← (ALU)BC ← (BUS 2) ← MDR
Professor Montagne will be giving us the equations for this exam but NOT for the final.
Posted on October 27, 2005 in Computer Architecture | Permalink | Comments (0) | TrackBack (0)
| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|---|---|---|---|---|---|---|
| 1 | 2 | |||||
| 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| 10 | 11 | 12 | 13 | 14 | 15 | 16 |
| 17 | 18 | 19 | 20 | 21 | 22 | 23 |
| 24 | 25 | 26 | 27 | 28 | 29 | 30 |
| 31 |