I didn’t see you generate a GSoC proposal. Would you still be interested in engaging with us on this project over the summer as a volunteer supporting the development? I very much appreciate community engagement on development early in the process.
Hi, this is Aditya Garg from NSUT, Delhi , India. I have worked with ALU design on a very basic level. Currently trying to learn more about FPGA’s and CPU architecture. I will love to contribute if I may be of any help. Also can someone point what are the best resources to learn FPGA’s?
We’d be looking for a new idea this summer. That is, specific ideas how to improve upon what has been done or to do something different, not just more of the same. Do you have thoughts on what specific areas of improvement might be possible?
Hello
I’m excited about contributing to this project. I believe we could either improve the existing design by optimizing key components or implementing a small library of C macros for GPIO access and delays (which was mentioned in last year’s Future Work) or we could focus on verification, developing a UVM-based framework and a golden reference to compare outputs. Would love to hear your thoughts on these suggestions
Interested, Now i get the idea you guys are going to utilize RISCV CPU on FPGA fabric. But initially you required a softcore riscv soc which should be written on verilog .
But i think to design the while interface by connecting core to peripherals with bus interface is quite difficult for a single contributor
Hi myself Abhay, I have worked on single and pipelined RISC-V cores, have experience with zynq FPGAs and computer architecture. I want to contribute to this project. Please guide me how can I start contributing. @jkridner
I’d like to see more focus on the MIT benchmark. Being passed up by the PIO in the RP2040 per ring oscillator timing tests makes me sad.
BeagleV-Fire should be put into this page with a simple FPGA gateware demo. That should compare well with iCE40UP5K.
The softcore generated here on BeagleV-Fire should be made somewhat competitive. Obviously, the CPU core would not be as fast as if it were running synthesized in an ASIC, but it should be shown to provide a MHz(software-in-the-loop-oscillator)/MHZ(cpu frequency) rating that should exceed the others.
As far as the compiler goes, introducing a very small number custom instructions seems to be necessary to get the performance as all of the registers are currently consumed by the active programming models. I’d be happy to see a suitable work-around, but, as of now, it seems some modifications are necessary. Integrating this into GCC and LLVM seems to be a requirement for completion of this project.
As for point 1, I don’t think it’s fair to expect a general purpose CPU to compete with a very fast and very purposeful built state machine. That is comparing Apples to Oranges in my opinion.
As for point 2, I must admit that I did not consider the C programming ABI when I suggested that R30/R31 should mimic what a PRU does with those.
That being said, I still think that the biggest obstacle to general uptake with programmers,
is that they have no way to program the thing, no matter how fast we could theoretically make it.
Roger got to the PRU like r30/r31 on picoRV which is fpga optimized, but multi cycle core.
I think the next step would be to go into custum instructions (on pipelined core?) and maybe more peripherals.
As of compiler support, apart from instruction support (or that x30/x31, or x15-x31 of alt1). It would be desirable to get avr8-gcc like remapping of memory mapped peripherals to io-mapped ones.
(have dedicated chapter 2.2. of XTightlyCoupledIO doc, for this)
PIO is a very special architecture capable of doing in, negate, out, and looping in a single instruction (and cycle). Comparable more to FPGA logic rather than general purpose architecture.
Best comparison baseline would be the BBB PRU or other general purpose archs.
BTW, those samples with highest FCPU/FRING ratio could be biased due to input register being sampled a cycle or two too fast and wasting cycles by looping with no output change. (adding nops before sampling could counterintuitively improve it)
I think that one can increase ring frequency there, by something like this (PRU sample):
while(1) {
while(__R31 & (1<<2)); // should use the wbs/wbc instructions
__R30 |= (1<<3); // SET
while(!(__R31 & (1<<2)));
__R30 &= ~(1<<3); // CLR
}
Alternatively extract, invert, insert method that avoids internal branching for better duty cycle (PRU sample again, might not be the most efficient, especially in case of high FCPU/FRING ratio):
// assuming that there are no interrupts accessing
// R30, but don't assume or ruin GPIO state
__R30 &= ~(1<<3); // initial state
uint32_t tmp;
uint32_t R30_cached = __R30;
while(1) {
tmp = __R31 & (1<<2) // extract one bit
tmp <<= 1; // move to output position
tmp = ~tmp;
__R30 = R30_cached | tmp; // fusable to orn insn (risc-v zbb)
// should compile to 4 insn + branch
// didn't find a better way to do a single bit insertion
//
//| R30 | R31 | out |
//|-----|-----|-----|
//| 0 | 0 | 1 | // input changed, invert output
//| 0 | 1 | 0 | // input not changed yet, output stays the same
//| 1 | 1 | 0 | // input changed, invert output
//| 1 | 0 | 1 | // input not changed yet, output stays the same
//
// in order to get the desired output without R30 caching we would
// need ~(R30 ^ (tmp ^ R30)), which destroys R30 state
}
Also, the PRU has hardware loop (limited to 16bits), which is not used in this benchmak (due to infinite loop)
Meanwhile with XtightlycoupledIO (as of v3.2) extract, invert, insert is 3 (+loop) instruction sequence (could be 2 if it had invertion fused with bit extract):
1: tio.bexti a0, io12, 2
not a0, a0
tio.bfinserti io13, a0, 3
b 1b
// alternatively without tio.bfinserti insn
li a2, (1<<2)
tio.bclri io13, 3
tio.mv a1, io13
2: tio.slli a0, io12, 1 // move to output position
andn a0, a2, a0 // invert the input and mask
tio.or io13, a1, a0
b 2b
polling method for high FCPU/FRING ratio should be similar to PRU