Bitstuff

For people interested in how computers work bottom-up.

Sunday, May 28, 2006

Epiphany

So I just realized that a multiple-issue superscalar processor is not at all efficient in an fpga. It simply will not provide the performance I'd like for one major reason: multiple write ports in FPGAs are prohibitively expensive.

So what I will do, is create a one-issue out of order processor that runs at a very high clock rate. The goal is 200MHz. We'll see. The scheduling hardware will be much cheaper this way. Also, it will be faster than a slower multiple-issue processor on an fpga.

Basically, my one choice for implementing multiple-write ports was to time-multiplex the register files. This creates some asynchronous fun. I figured as long as I was doing the time-multiplexing for the register files, I might as well time-multiplex the write back bus. Turns out, that is not such a bad idea. In fact, I should make things easy and run everything at the higher clock speed and pipeline where I have to. This would make a 200MHz scalar out of order cpu similiar to a 50MHz four-issue cpu. Tada! But it's even better, because many instructions will take only one cycle, while others, like addition, will take a couple.

Neato, right?

Tuesday, May 09, 2006

CPU Project Calendar (sortof)

Looking at this whole project at once is a bit intimidating, so I split it up into a few different stages.

  1. Overall Architecture
    1. Concrete design parameters
    2. Number of functional units and their partitioning
    3. Major units
    4. Busses between units
    5. ISA (likely to be a subset of the M-word)
    6. Scheduling algorithm
    7. Do some quantitative analysis to back up design choices
    8. At this point, I should have a bunch of nifty drawings and all-around fuzzy feelings about the project
  2. Basic Processor
    1. Implement instruction decode for basic arithmetic instructions
    2. Implement 4-issue scheduling hardware
      1. Dependency FIFOs
      2. Register file
      3. Re-Order Buffer
      4. Write-Back bus arbitrator
    3. Implement basic ALU functional units
    4. Bypass hardware
    5. Instruction memory (trivial)
    6. Trackers, trackers, and more trackers
      1. Make sure to have good debugging information
      2. Possibly implment a transaction checker
    7. At this point, I should have a good testbench environment and be able to calculate the fibbonacci sequence along with some other basic programs
  3. Branches
    1. Implement branch functional unit
      1. branch
      2. jump
      3. trap
    2. Implement branch prediction unit
      1. Branch prediction table
      2. Branch target buffer
      3. Return address stack
    3. Implement branch-misprediction rollback mechanism (should be trivial)
  4. Load/Store + Cache
    1. Implement Load/store functional unit
    2. Implement data and instruction caches
    3. Implement bus interface
    4. Implement cache system instructions
  5. TLB + interrupts/exceptions + Finish implementing all instructions
    1. TLB
    2. interrupts/exceptions
    3. finish all instructions
  6. Floating Point Units?
    1. This would be nifty. We'll see how much space is left.
  7. At this point, I should have a working processor. Pat self on back.

Thursday, May 04, 2006

x86 ISA

I kindof think it would be cool to implement a 80386-compatible CPU. It would be more relavent to my interests, even though the ISA isn't nearly as clean as a RISC CPU. Funny thing is, I've only been designing RISC CPUs, so maybe I should take a detour to the dark side of CISC anyways.

I think I will still piece together the requirements and solutions for the superscalar processor; but in the interim, I want to hack together an x86 CPU.

Fun stuff.

Tuesday, May 02, 2006

HOT16 ISA

I thought I'd post the ISA encoding of the "speed-demon" processor I created. It really did have a ridiculously high-clock rate for being implemented in an FPGA. It also was quite efficient with resources. It is a dual-issue statically scheduled processor. The two issue slots have their register file, registers from the other file are accessed through a few instructions. Operands from the other register file are denoted by a preceding "x.".

The major problem with this processor is that it is a super pain in the bum to program. I didn't even bother, because I knew it would vaporize my brain if I tried. It's only a 16-bit processor with 16-bit addressing. It has a carry flag and add with carry to facilitate adding larger numbers. It has several shift operations to implement a fast multi-cycle barrel-shifter of sorts. There are one to three delay slots after each instruction depending on how much forwarding logic is included when the CPU is instantiated. The forwarding logic costs cycle time, but I'm not sure this processor would be possible to efficiently schedule with 3 delay slots before the result of an instruction can be used, especially with only 8 registers per issue slot.

So even though it has all these shortcomings, I did some quantitative analysis to see how fast it is compared to another 32-Bit Pipelined RISC processor I designed. Turns out, with perfect instruction scheduling, the processor is 5 to 10 times faster than the 32 bit processor, leaning towards the lower-end with more 32-bit instructions. With more realistic scheduling, say, one instruction per cycle on average, it is still 2 to 5 times faster. Along with the fact that 12 of these can fit on a chip, it could actually get some pretty decent performance. Definately not general-purpose computing, however. It is more akin to the cell architecture with attached processing units.

Instruction

Type

[15:14]

[13:11]

[10:8]

[7:3]

[2:0]

ld rt, imm8(rs1)

Rm

10

rs1

rt

imm8

st rt, imm8(rs1)

Rm

11

rs1

rt

imm8

addi rs1, imm8

I

01

rs1

000

imm8

subi rs1, imm8

I

01

rs1

001

imm8

cmpi rs1, imm8

I

01

rs1

010

imm8

andi rs1, imm8

I

01

rs1

011

imm8

lui rs1, imm8

I

01

rs1

100

imm8

b.cc imm8

I

01

n

z

o

101

imm8

jr rs1

J

01

rs1

110

00000000

jal imm11

J

01

imm11

111

imm11

add rt, x.rs1, rs2

R

00

rs1

rt

00000

rs2

sub rt, x.rs1, rs2

R

00

rs1

rt

00001

rs2

or rt, x.rs1, rs2

R

00

rs1

rt

00010

rs2

and rt, x.rs1, rs2

R

00

rs1

rt

00011

rs2

add rt, rs1, rs2

R

00

rs1

rt

00100

rs2

sub rt, rs1, rs2

R

00

rs1

rt

00101

rs2

or rt, rs1, rs2

R

00

rs1

rt

00110

rs2

and rt, rs1, rs2

R

00

rs1

rt

00111

rs2

addc rt, rs1, rs2

R

00

rs1

rt

01000

rs2

xor rt, rs1, rs2

R

00

rs1

rt

01001

rs2

not rt, rs1

R

00

rs1

rt

01010

rs2

mulhi rt, rs1, rs2

R

00

rs1

rt

01011

rs2

mullo rt, rs1, rs2

R

00

rs1

rt

01100

rs2

sll.m rt, rs1, imm3

R

00

rs1

rt

0111m

imm3

srl.m rt, rs1, imm3

R

00

rs1

rt

1000m

imm3

sra.m rt, rs1, imm3

R

00

rs1

rt

1001m

imm3

sllv.m rt, rs1, rs2

R

00

rs1

rt

1010m

rs2

srlv.m rt, rs1, rs2

R

00

rs1

rt

1011m

rs2

srav.m rt, rs1, rs2

R

00

rs1

rt

1100m

rs2

test rs1

R

00

rs1

000

11010

000

RESERVED

R

00



11011


m: [L=shift 0,1,2,3 bits | H= shift 0,4,8,12 bits]

Monday, May 01, 2006

A Complexity-Effective Dynamic Scheduling Algorithm

I found this paper online:

http://courses.ece.uiuc.edu/ece512/Papers/Palacharla.1997.ISCA.pdf

This method uses register renaming along with dependency checking in instruction dispatch to reduce complexity in the wakeup-select logic (reservation stations). A FIFO is setup for each issue slot, instructions are issued to FIFOs so that dependency chains are located in FIFOs. This means that instructions only need to be checked at the head of the FIFO if they are ready to be dispatched to a function unit.

This could greatly reduce the cycle-time and resources used by the scheduler in an FPGA CPU.

New Reservation Station Operand Design

A couple things I can do to optimize the size of reservation stations.
  • Don't store data in the reservation station itself. Instead, store only the tag of the operand. This saves a ton of luts and flipflops because multiplexers are expensive in FPGAs, and without the data, we significantly reduce the width of the multiplexer required. We also don't need a multiplexer from each of the write-back busses because we don't need to save the data. Data for outstanding tags can be stored in a special register-file because register files can be implemented as distributed RAM. This basically combines the storage and multiplexer into one relatively tiny alternative. The drawback is that the in-flight register file will need two read ports for each functional unit and one write port for each write-back bus (there are some time-multiplexing tricks to alleviate this issue). This should still be cheaper.
  • Something like 90% of instructions have only one or none of their operands outstanding. Therefore, most reservation stations only need to compare one operand at a time. This allows us to save a bit on comparators (which are relatively cheap in the Spartan 3 fabric).
The utilization savings are huge:
  • 7 slices VS 82 slices
  • 6 flipflops VS 39 flipflops
  • 10 LUTs VS 200 LUTs
Something like 10 times smaller. Also somewhat faster (~200MHz). Hopefully this is correct and will work out.

Dynamic Instruction Scheduling Algorithms

For my latest processor design, I want to do a modern processor design. This can mean many things. To me, it means it must have the following qualities:
  • Out of Order Execution
  • Speculative Execution
  • Superscalar
  • At least L1 Cache
  • Memory Management Unit
The most expensive quality is Out of Order Execution. I had read a paper on OOO using the Tomasulo Algorithm. Really quite helpful. However, the reservation stations are far too expensive to implement. A dual-issue superscalar CPU with out of order execution would require something like a 32 entry Reorder Buffer to be efficient. This would then require 32 reservation stations to hold instructions waiting for their operands to become available. Each reservation station requires at least two slots to compare operand tags and hold operand data, each operand slot requires one tag comparator for each write-back bus. This is waaaay to much resources for the FPGA I have. In fact, it uses > 100% resources on the chip :/

So now I'm researching other alternatives. I would really like to be able to do a four-issue processor, however, I don't think it's possible at this point.

Here are some resources I'm currently reading:
  • http://courses.ece.uiuc.edu/ece512/Papers/sched.html
  • http://www.cs.swan.ac.uk/~csandy/cs-323/09_dynamic_scheduling.html
  • http://courses.ece.uiuc.edu/ece512/Papers/Dwyer.1992.MICRO.pdf
Pizza out.