I've been busy designing some FPGA hardware recently, and part of it uses a lovely little 32-bit microprocessor I designed a while back. I've been meaning to publish the source for it for some time, but since the contract for this design means I have to provide the source to the customer and even [shudder] document the thing to some extent I may as well take the opportunity to publish it now.
I've called it the R3220, those of a technical nature can assume this name derives from the fact it's a 32-bit RISC with about 20 instructions (while being completely oblivious of the "saturday night special" connection, I suspect).
The performance is fairly respectable, I'm using it to replace an existing NIOS II based design because the NIOS based design wasn't anywhere near fast enough.
As a guide the R3220 CPU uses between 600 and 1700 LE's in a Cyclone II array, so a complete system with one CPU with several peripherals only takes up ~12% of a Cyclone II EP2C20.
For fun I put four of these processors in the array (quad-core, anyone?) and it took up about 30% of the available logic. Since they can share memory without losing any performance it's actually not a bad idea to think in terms of more than one CPU for some applications, though I don't think I need to do it for this one.
(To facilitate the use of dual-port memory, which is common in gate arrays, the reset vector is configurable, so you can have two processors using the same memory block but not executing the same part of it. This incurs no extra delay so both will run at full speed.)
The verilog source, an example SOC design including various peripherals, a simple SDRAM controller and the r32 assembler can be found here -http://www.desdes.com/products/r3220/index.htm - note that this is the first verilog I've written so the style across the sources is inconsistent, I'm still finding my feet with the language. I'm more used to using VHDL, though I have to say I'm finding it a lot easier to write in verilog than VHDL, there's something about the 'feel' of verilog that I prefer.
This isn't the first processor I've designed and hidden away inside a gate array, but most of them have been too dedicated to a specific task to be worth discussing. This one is a nice general-purpose device to programme (in assembler, that is. If you're stuck using C then pick one of the MIPS implementations) and given how little time it took to develop (about ten days) I have to say I'm very pleased with it - it's a shame that so many embedded designers are stuck in the whole turgid morass of the linux/gcc/c bloatware mindset and can't play about like this, they really don't know how much fun they're missing...
For those who care it's a fairly classic load/store risc with a couple of unusual features. I designed it to have lots of registers (128) so I can dedicate them to interrupt handlers, and it has a nice touch that allows for 7 or 14-bit immediate values in the single-cycle instructions but also a full 32-bit immediate value can be used for all operations with only a single cycle penalty - it achieves this by taking one of the 7-bit immediate values, (-64), and using it as a flag to request that the next instruction word be used as a 32-bit immediate instead.
It has a reasonable addressing mode set with all the usual pre and post inc/dec indexed operations. The address is formed by taking a pointer register, which can be any of the 128 registers (there is no distinction between them) and optionally adding a 7-bit signed offset value. This address can be written back to the index register if required (with no extra delay) which allows auto inc/dec modes. The fact that the memory address used can be the register address or the register+offset means that there is full support for auto inc and dec with pre inc/dec and post inc/dec addresses. So any sort of stack you like, basically.
Every instruction is conditional with an option to update the flags. Sounds like an ARM, I suppose, though I first designed a CPU that did this many years before the ARM appeared so saying I got the idea from ARM might just earn you a thick ear [grin]. In fact, the bit field layout in this instruction set is remarkably similar to one of my designs from the 70's, though I was trying to outdo the PDP-11 back then so the that design had a particularly gothic and slow addressing mode set. [shudder]
The condition codes operate by having a 4-bit field that selects individual flags in the status register and then determining if the instruction should be executed using the state of the selected flag.
Doing that uses 14 of the 16 possible condition codes to handle 7 status flag bits. Of the two remaining codes one is ALWAYS which means always execute, of course, and the last is interesting - I've called it 'LOCK' and it always executes the instruction, the same as ALWAYS, but it also prevents the instruction from being interrupted. Once the processor has been put into LOCK mode it remains in LOCK mode until it executes an instruction with the ALWAYS condition code. This allows sections of code to be made 'atomic' without any overhead, and those sections can include conditional code.
Lock mode also forces locked RMW bus cycles where the bus architecture supports them, which makes it beautiful for implementing semaphores, etc. It's all much nicer than having to either flip interrupt enable/disable flags around critical bits of code or use peculiar instructions.
Another use of LOCK is in function entry/exit code, to make the stacking/unstacking of link registers etc safe. This can be handled transparently by the development tools.
I designed it to be agile, so interrupts have to be entered and exited quickly, and to that end it supports multiple interrupt levels with priority encoding on each of those for multiple interrupt sources at each level. The CPU outputs acknowledge signals when taking an interrupt vector so there are no delays while the handler has to manage the interrupt controller.
For speed it uses dedicated link registers instead of stacking the return address/state during interrupt cycles, and cleverly uses the particular link register invoked by the 'return' instruction to signal the end of that interrupt level to the interrupt controller. This means interrupt handlers are exactly as fast as calls and yet don't require any special entry/exit instructions.
Something else that's nice, though obvious, is that because you can load/store values directly to the status register and the condition codes use single bit testing peripherals can to organise their status flags so it's meaningful for the device handler software to load the peripheral's status register directly into the CPU status register and then use conditional instruction execution to determine how the flow should proceed. ie:
call s0t RxByteHandler ; The s0t means execute this if status register bit 0 is true.
call s1t TxByteHandler ; The s1t means execute this if status register bit 1 is true.
The s0t, s1t (etc) are aliases for the conditions C, Z, etc.
This process is helped by the fact that the handers can preserve the status flags (they're saved in the link register and can be loaded back or not as required when returning).
The assembler is pretty powerful, as well as all the usual stuff you'd expect it handles allocating link registers automatically so the user doesn't have to know or care what they are. Thanks are due to a certain Ste for making me do this with his damned "It can't be that difficult" and "I can't see the problem" attitude.
In summary it's a nice processor, amazingly nice considering how little time it took to design. I absolutely love being in a position where I can design stuff like this and get paid for doing it...