I thought I'd post the ISA encoding of the "speed-demon" processor I created. It really did have a ridiculously high-clock rate for being implemented in an FPGA. It also was quite efficient with resources. It is a dual-issue statically scheduled processor. The two issue slots have their register file, registers from the other file are accessed through a few instructions. Operands from the other register file are denoted by a preceding "x.".
The major problem with this processor is that it is a super pain in the bum to program. I didn't even bother, because I knew it would vaporize my brain if I tried. It's only a 16-bit processor with 16-bit addressing. It has a carry flag and add with carry to facilitate adding larger numbers. It has several shift operations to implement a fast multi-cycle barrel-shifter of sorts. There are one to three delay slots after each instruction depending on how much forwarding logic is included when the CPU is instantiated. The forwarding logic costs cycle time, but I'm not sure this processor would be possible to efficiently schedule with 3 delay slots before the result of an instruction can be used, especially with only 8 registers per issue slot.
So even though it has all these shortcomings, I did some quantitative analysis to see how fast it is compared to another 32-Bit Pipelined RISC processor I designed. Turns out, with perfect instruction scheduling, the processor is 5 to 10 times faster than the 32 bit processor, leaning towards the lower-end with more 32-bit instructions. With more realistic scheduling, say, one instruction per cycle on average, it is still 2 to 5 times faster. Along with the fact that 12 of these can fit on a chip, it could actually get some pretty decent performance. Definately not general-purpose computing, however. It is more akin to the cell architecture with attached processing units.
Instruction | Type | [15:14] | [13:11] | [10:8] | [7:3] | [2:0] |
ld rt, imm8(rs1) | Rm | 10 | rs1 | rt | imm8 |
st rt, imm8(rs1) | Rm | 11 | rs1 | rt | imm8 |
addi rs1, imm8 | I | 01 | rs1 | 000 | imm8 |
subi rs1, imm8 | I | 01 | rs1 | 001 | imm8 |
cmpi rs1, imm8 | I | 01 | rs1 | 010 | imm8 |
andi rs1, imm8 | I | 01 | rs1 | 011 | imm8 |
lui rs1, imm8 | I | 01 | rs1 | 100 | imm8 |
b.cc imm8 | I | 01 | n | z | o | 101 | imm8 |
jr rs1 | J | 01 | rs1 | 110 | 00000000 |
jal imm11 | J | 01 | imm11 | 111 | imm11 |
add rt, x.rs1, rs2 | R | 00 | rs1 | rt | 00000 | rs2 |
sub rt, x.rs1, rs2 | R | 00 | rs1 | rt | 00001 | rs2 |
or rt, x.rs1, rs2 | R | 00 | rs1 | rt | 00010 | rs2 |
and rt, x.rs1, rs2 | R | 00 | rs1 | rt | 00011 | rs2 |
add rt, rs1, rs2 | R | 00 | rs1 | rt | 00100 | rs2 |
sub rt, rs1, rs2 | R | 00 | rs1 | rt | 00101 | rs2 |
or rt, rs1, rs2 | R | 00 | rs1 | rt | 00110 | rs2 |
and rt, rs1, rs2 | R | 00 | rs1 | rt | 00111 | rs2 |
addc rt, rs1, rs2 | R | 00 | rs1 | rt | 01000 | rs2 |
xor rt, rs1, rs2 | R | 00 | rs1 | rt | 01001 | rs2 |
not rt, rs1 | R | 00 | rs1 | rt | 01010 | rs2 |
mulhi rt, rs1, rs2 | R | 00 | rs1 | rt | 01011 | rs2 |
mullo rt, rs1, rs2 | R | 00 | rs1 | rt | 01100 | rs2 |
sll.m rt, rs1, imm3 | R | 00 | rs1 | rt | 0111m | imm3 |
srl.m rt, rs1, imm3 | R | 00 | rs1 | rt | 1000m | imm3 |
sra.m rt, rs1, imm3 | R | 00 | rs1 | rt | 1001m | imm3 |
sllv.m rt, rs1, rs2 | R | 00 | rs1 | rt | 1010m | rs2 |
srlv.m rt, rs1, rs2 | R | 00 | rs1 | rt | 1011m | rs2 |
srav.m rt, rs1, rs2 | R | 00 | rs1 | rt | 1100m | rs2 |
test rs1 | R | 00 | rs1 | 000 | 11010 | 000 |
RESERVED | R | 00 | | | 11011 | |
m: [L=shift 0,1,2,3 bits | H= shift 0,4,8,12 bits]