Low-latency I/O RISC-V CPU core in FPGA fabric

Hi @thevenus ,

I didn’t see you generate a GSoC proposal. Would you still be interested in engaging with us on this project over the summer as a volunteer supporting the development? I very much appreciate community engagement on development early in the process.

Hi, this is Aditya Garg from NSUT, Delhi , India. I have worked with ALU design on a very basic level. Currently trying to learn more about FPGA’s and CPU architecture. I will love to contribute if I may be of any help. Also can someone point what are the best resources to learn FPGA’s?

Hello @jkridner . I am interested to collaborate for this project this summer. Is there any new update, has been a while. Thank you.

We’d be looking for a new idea this summer. That is, specific ideas how to improve upon what has been done or to do something different, not just more of the same. Do you have thoughts on what specific areas of improvement might be possible?

Hello
I’m excited about contributing to this project. I believe we could either improve the existing design by optimizing key components or implementing a small library of C macros for GPIO access and delays (which was mentioned in last year’s Future Work) or we could focus on verification, developing a UVM-based framework and a golden reference to compare outputs. Would love to hear your thoughts on these suggestions

Interested, Now i get the idea you guys are going to utilize RISCV CPU on FPGA fabric. But initially you required a softcore riscv soc which should be written on verilog .

But i think to design the while interface by connecting core to peripherals with bus interface is quite difficult for a single contributor

Hi myself Abhay, I have worked on single and pipelined RISC-V cores, have experience with zynq FPGAs and computer architecture. I want to contribute to this project. Please guide me how can I start contributing. @jkridner

Good to see so much interest in improving and expanding on the great work already done by @Roger18.

The way I see it, what is sorely missing to make this CPU more than a footnote, is:

  • Proper compiler support, hereunder the selection of a performant and small libc.
  • Debugging capability, perhaps something that can interface with OpenOCD.
  • A proper mechanism for loading the abovementioned code into memory.

because as fun as it may be to tinker at the gate level,
this whole thing becomes Academic real fast if you can’t program it properly.

1 Like

I’d like to see more focus on the MIT benchmark. Being passed up by the PIO in the RP2040 per ring oscillator timing tests makes me sad.

BeagleV-Fire should be put into this page with a simple FPGA gateware demo. That should compare well with iCE40UP5K.

The softcore generated here on BeagleV-Fire should be made somewhat competitive. Obviously, the CPU core would not be as fast as if it were running synthesized in an ASIC, but it should be shown to provide a MHz(software-in-the-loop-oscillator)/MHZ(cpu frequency) rating that should exceed the others.

As far as the compiler goes, introducing a very small number custom instructions seems to be necessary to get the performance as all of the registers are currently consumed by the active programming models. I’d be happy to see a suitable work-around, but, as of now, it seems some modifications are necessary. Integrating this into GCC and LLVM seems to be a requirement for completion of this project.

As for point 1, I don’t think it’s fair to expect a general purpose CPU to compete with a very fast and very purposeful built state machine. That is comparing Apples to Oranges in my opinion.

As for point 2, I must admit that I did not consider the C programming ABI when I suggested that R30/R31 should mimic what a PRU does with those.

That being said, I still think that the biggest obstacle to general uptake with programmers,
is that they have no way to program the thing, no matter how fast we could theoretically make it.

Roger got to the PRU like r30/r31 on picoRV which is fpga optimized, but multi cycle core.
I think the next step would be to go into custum instructions (on pipelined core?) and maybe more peripherals.

As of compiler support, apart from instruction support (or that x30/x31, or x15-x31 of alt1). It would be desirable to get avr8-gcc like remapping of memory mapped peripherals to io-mapped ones.
(have dedicated chapter 2.2. of XTightlyCoupledIO doc, for this)

PIO is a very special architecture capable of doing in, negate, out, and looping in a single instruction (and cycle). Comparable more to FPGA logic rather than general purpose architecture.
Best comparison baseline would be the BBB PRU or other general purpose archs.

BTW, those samples with highest FCPU/FRING ratio could be biased due to input register being sampled a cycle or two too fast and wasting cycles by looping with no output change. (adding nops before sampling could counterintuitively improve it)

I think that one can increase ring frequency there, by something like this (PRU sample):

while(1) {
	while(__R31 & (1<<2)); // should use the wbs/wbc instructions
	__R30 |= (1<<3); // SET
	while(!(__R31 & (1<<2)));
	__R30 &= ~(1<<3); // CLR
}

Alternatively extract, invert, insert method that avoids internal branching for better duty cycle (PRU sample again, might not be the most efficient, especially in case of high FCPU/FRING ratio):

// assuming that there are no interrupts accessing
// R30, but don't assume or ruin GPIO state

__R30 &= ~(1<<3); // initial state

uint32_t tmp;
uint32_t R30_cached = __R30;

while(1) {
	tmp = __R31 & (1<<2) // extract one bit
	tmp <<= 1; // move to output position
	tmp = ~tmp;
	__R30 = R30_cached | tmp; // fusable to orn insn (risc-v zbb)
	
	// should compile to 4 insn + branch
	// didn't find a better way to do a single bit insertion
	//
	//| R30 | R31 | out |
	//|-----|-----|-----|
	//|  0  |  0  |  1  | // input changed, invert output
	//|  0  |  1  |  0  | // input not changed yet, output stays the same
	//|  1  |  1  |  0  | // input changed, invert output
	//|  1  |  0  |  1  | // input not changed yet, output stays the same
	//
	// in order to get the desired output without R30 caching we would
	// need ~(R30 ^ (tmp ^ R30)), which destroys R30 state
}

Also, the PRU has hardware loop (limited to 16bits), which is not used in this benchmak (due to infinite loop)

Meanwhile with XtightlycoupledIO (as of v3.2) extract, invert, insert is 3 (+loop) instruction sequence (could be 2 if it had invertion fused with bit extract):

1:	tio.bexti a0, io12, 2
	not a0, a0
	tio.bfinserti io13, a0, 3
	b 1b

// alternatively without tio.bfinserti insn

	li a2, (1<<2)
	tio.bclri io13, 3
	tio.mv a1, io13
	
2:	tio.slli a0, io12, 1 // move to output position
	andn a0, a2, a0 // invert the input and mask
	tio.or io13, a1, a0
	b 2b

polling method for high FCPU/FRING ratio should be similar to PRU

1:	tio.bsbseti io12, 2, 1b
	tio.bseti io13, 3
2: 	tio.bsbclri io12, 2, 2b
	tio.bclri io13, 3
	b 1b
tmp = __R31 & (1<<2) // extract one bit
tmp <<= 1; // move to output position
tmp = ~tmp;
__R30 = R30_cached | tmp; // fusable to orn insn (risc-v zbb)

That still ruins r30 state, correct code should be:

__R30 &= ~(1<<3); // initial state

uint32_t tmp;
uint32_t R30_cached = __R30;

while(1) {
	tmp = __R31 << 1 // move to output position
	tmp = ~tmp;
	tmp = __R31 & (1<<3) // extract one bit (at output position)
	__R30 = R30_cached | tmp; // fusable to orn insn (risc-v zbb)
}

I think this ended up being a great project. Anyone interested in evaluating it and extending it?