Bitstuff: May 2006

Epiphany

So I just realized that a multiple-issue superscalar processor is not at all efficient in an fpga. It simply will not provide the performance I'd like for one major reason: multiple write ports in FPGAs are prohibitively expensive.

So what I will do, is create a one-issue out of order processor that runs at a very high clock rate. The goal is 200MHz. We'll see. The scheduling hardware will be much cheaper this way. Also, it will be faster than a slower multiple-issue processor on an fpga.

Basically, my one choice for implementing multiple-write ports was to time-multiplex the register files. This creates some asynchronous fun. I figured as long as I was doing the time-multiplexing for the register files, I might as well time-multiplex the write back bus. Turns out, that is not such a bad idea. In fact, I should make things easy and run everything at the higher clock speed and pipeline where I have to. This would make a 200MHz scalar out of order cpu similiar to a 50MHz four-issue cpu. Tada! But it's even better, because many instructions will take only one cycle, while others, like addition, will take a couple.

Neato, right?

CPU Project Calendar (sortof)

Looking at this whole project at once is a bit intimidating, so I split it up into a few different stages.

Overall Architecture

Concrete design parameters
Number of functional units and their partitioning
Major units
Busses between units
ISA (likely to be a subset of the M-word)
Scheduling algorithm
Do some quantitative analysis to back up design choices
At this point, I should have a bunch of nifty drawings and all-around fuzzy feelings about the project

Basic Processor

Implement instruction decode for basic arithmetic instructions
Implement 4-issue scheduling hardware

Dependency FIFOs
Register file
Re-Order Buffer
Write-Back bus arbitrator

Implement basic ALU functional units
Bypass hardware
Instruction memory (trivial)
Trackers, trackers, and more trackers

Make sure to have good debugging information
Possibly implment a transaction checker

At this point, I should have a good testbench environment and be able to calculate the fibbonacci sequence along with some other basic programs

Branches

Implement branch functional unit

branch
jump
trap

Implement branch prediction unit

Branch prediction table
Branch target buffer
Return address stack

Implement branch-misprediction rollback mechanism (should be trivial)

Load/Store + Cache

Implement Load/store functional unit
Implement data and instruction caches
Implement bus interface
Implement cache system instructions

TLB + interrupts/exceptions + Finish implementing all instructions

TLB
interrupts/exceptions
finish all instructions

Floating Point Units?

This would be nifty. We'll see how much space is left.

At this point, I should have a working processor. Pat self on back.

x86 ISA

I kindof think it would be cool to implement a 80386-compatible CPU. It would be more relavent to my interests, even though the ISA isn't nearly as clean as a RISC CPU. Funny thing is, I've only been designing RISC CPUs, so maybe I should take a detour to the dark side of CISC anyways.

I think I will still piece together the requirements and solutions for the superscalar processor; but in the interim, I want to hack together an x86 CPU.

Fun stuff.

HOT16 ISA

I thought I'd post the ISA encoding of the "speed-demon" processor I created. It really did have a ridiculously high-clock rate for being implemented in an FPGA. It also was quite efficient with resources. It is a dual-issue statically scheduled processor. The two issue slots have their register file, registers from the other file are accessed through a few instructions. Operands from the other register file are denoted by a preceding "x.".

The major problem with this processor is that it is a super pain in the bum to program. I didn't even bother, because I knew it would vaporize my brain if I tried. It's only a 16-bit processor with 16-bit addressing. It has a carry flag and add with carry to facilitate adding larger numbers. It has several shift operations to implement a fast multi-cycle barrel-shifter of sorts. There are one to three delay slots after each instruction depending on how much forwarding logic is included when the CPU is instantiated. The forwarding logic costs cycle time, but I'm not sure this processor would be possible to efficiently schedule with 3 delay slots before the result of an instruction can be used, especially with only 8 registers per issue slot.

So even though it has all these shortcomings, I did some quantitative analysis to see how fast it is compared to another 32-Bit Pipelined RISC processor I designed. Turns out, with perfect instruction scheduling, the processor is 5 to 10 times faster than the 32 bit processor, leaning towards the lower-end with more 32-bit instructions. With more realistic scheduling, say, one instruction per cycle on average, it is still 2 to 5 times faster. Along with the fact that 12 of these can fit on a chip, it could actually get some pretty decent performance. Definately not general-purpose computing, however. It is more akin to the cell architecture with attached processing units.

Instruction	Type	[15:14]	[13:11]			[10:8]	[7:3]	[2:0]
ld rt, imm8(rs1)	Rm	10	rs1			rt	imm8
st rt, imm8(rs1)	Rm	11	rs1			rt	imm8
addi rs1, imm8	I	01	rs1			000	imm8
subi rs1, imm8	I	01	rs1			001	imm8
cmpi rs1, imm8	I	01	rs1			010	imm8
andi rs1, imm8	I	01	rs1			011	imm8
lui rs1, imm8	I	01	rs1			100	imm8
b.cc imm8	I	01	n	z	o	101	imm8
jr rs1	J	01	rs1			110	00000000
jal imm11	J	01	imm11			111	imm11
add rt, x.rs1, rs2	R	00	rs1			rt	00000	rs2
sub rt, x.rs1, rs2	R	00	rs1			rt	00001	rs2
or rt, x.rs1, rs2	R	00	rs1			rt	00010	rs2
and rt, x.rs1, rs2	R	00	rs1			rt	00011	rs2
add rt, rs1, rs2	R	00	rs1			rt	00100	rs2
sub rt, rs1, rs2	R	00	rs1			rt	00101	rs2
or rt, rs1, rs2	R	00	rs1			rt	00110	rs2
and rt, rs1, rs2	R	00	rs1			rt	00111	rs2
addc rt, rs1, rs2	R	00	rs1			rt	01000	rs2
xor rt, rs1, rs2	R	00	rs1			rt	01001	rs2
not rt, rs1	R	00	rs1			rt	01010	rs2
mulhi rt, rs1, rs2	R	00	rs1			rt	01011	rs2
mullo rt, rs1, rs2	R	00	rs1			rt	01100	rs2
sll.m rt, rs1, imm3	R	00	rs1			rt	0111m	imm3
srl.m rt, rs1, imm3	R	00	rs1			rt	1000m	imm3
sra.m rt, rs1, imm3	R	00	rs1			rt	1001m	imm3
sllv.m rt, rs1, rs2	R	00	rs1			rt	1010m	rs2
srlv.m rt, rs1, rs2	R	00	rs1			rt	1011m	rs2
srav.m rt, rs1, rs2	R	00	rs1			rt	1100m	rs2
test rs1	R	00	rs1			000	11010	000
RESERVED	R	00					11011

m: [L=shift 0,1,2,3 bits | H= shift 0,4,8,12 bits]

A Complexity-Effective Dynamic Scheduling Algorithm

I found this paper online:

http://courses.ece.uiuc.edu/ece512/Papers/Palacharla.1997.ISCA.pdf

This method uses register renaming along with dependency checking in instruction dispatch to reduce complexity in the wakeup-select logic (reservation stations). A FIFO is setup for each issue slot, instructions are issued to FIFOs so that dependency chains are located in FIFOs. This means that instructions only need to be checked at the head of the FIFO if they are ready to be dispatched to a function unit.

This could greatly reduce the cycle-time and resources used by the scheduler in an FPGA CPU.

New Reservation Station Operand Design

A couple things I can do to optimize the size of reservation stations.

Don't store data in the reservation station itself. Instead, store only the tag of the operand. This saves a ton of luts and flipflops because multiplexers are expensive in FPGAs, and without the data, we significantly reduce the width of the multiplexer required. We also don't need a multiplexer from each of the write-back busses because we don't need to save the data. Data for outstanding tags can be stored in a special register-file because register files can be implemented as distributed RAM. This basically combines the storage and multiplexer into one relatively tiny alternative. The drawback is that the in-flight register file will need two read ports for each functional unit and one write port for each write-back bus (there are some time-multiplexing tricks to alleviate this issue). This should still be cheaper.
Something like 90% of instructions have only one or none of their operands outstanding. Therefore, most reservation stations only need to compare one operand at a time. This allows us to save a bit on comparators (which are relatively cheap in the Spartan 3 fabric).

The utilization savings are huge:

7 slices VS 82 slices
6 flipflops VS 39 flipflops
10 LUTs VS 200 LUTs

Something like 10 times smaller. Also somewhat faster (~200MHz). Hopefully this is correct and will work out.

Dynamic Instruction Scheduling Algorithms

For my latest processor design, I want to do a modern processor design. This can mean many things. To me, it means it must have the following qualities:

Out of Order Execution
Speculative Execution
Superscalar
At least L1 Cache
Memory Management Unit

The most expensive quality is Out of Order Execution. I had read a paper on OOO using the Tomasulo Algorithm. Really quite helpful. However, the reservation stations are far too expensive to implement. A dual-issue superscalar CPU with out of order execution would require something like a 32 entry Reorder Buffer to be efficient. This would then require 32 reservation stations to hold instructions waiting for their operands to become available. Each reservation station requires at least two slots to compare operand tags and hold operand data, each operand slot requires one tag comparator for each write-back bus. This is waaaay to much resources for the FPGA I have. In fact, it uses > 100% resources on the chip :/

So now I'm researching other alternatives. I would really like to be able to do a four-issue processor, however, I don't think it's possible at this point.

Here are some resources I'm currently reading:

http://courses.ece.uiuc.edu/ece512/Papers/sched.html
http://www.cs.swan.ac.uk/~csandy/cs-323/09_dynamic_scheduling.html
http://courses.ece.uiuc.edu/ece512/Papers/Dwyer.1992.MICRO.pdf

Pizza out.

Bitstuff

Sunday, May 28, 2006

Epiphany

Tuesday, May 09, 2006

CPU Project Calendar (sortof)

Thursday, May 04, 2006

x86 ISA

Tuesday, May 02, 2006

HOT16 ISA

Monday, May 01, 2006

A Complexity-Effective Dynamic Scheduling Algorithm

New Reservation Station Operand Design

Dynamic Instruction Scheduling Algorithms

Bitstuff

About Me

Previous Posts

Archives