Well, it has been a while.

A few months ago, I challenged myself to make a drone from scratch and wrote a blog where I laid out the plan for this goal. The uniqueness of the project was that the flight controller of the drone runs not on an Arduino or Raspberry PI as it is done in similar projects. Instead, I wanted to build a RISC-V-based processor on FPGA and use this processor to run the flight controller program. I hereby want to update you on my progress.

A detailed description of the project can be found in the first blog. At a high level, this project involves 1) making a soft RISC-V processor (implementing the RTL), 2) making a custom FPGA PCB board and successfully loading the RISC-V processor into the FPGA, and 3) developing a flight controller program using C language and running it using the RISC-V processor inside the FPGA, and finally, 4) constructing and testing the drone. So far, I have completed the first task; that is, I have implemented and verified (to a certain extent) a 32-bit RISC-V processor.

In this blog, I discuss my design of the processor (the design is mostly inspired by the content in one of David Patterson’s book). I will also share a link to a detailed design diagram and the source code itself. My goal is to inspire and equip my readers to take on the same or similar projects and succeed.

What is a Processor?

A processor is a machine; it is a digital machine with two types of input. Its inputs are data and instruction. The input instruction tells the processor what to do with the input data. After manipulating the input data, the processor can modify its internal state and/or generate output data.

Embedded system processors (which we will focus on in this blog series), typically work with two distinct memories. The first is the Program Memory (PMEM) which carries the sequence of instructions the processor need to execute. The other is the Data Memory (DMEM), which carries the input and output data of the processor. Based on the sequence of instructions in PMEM, the processor reads and modifies the contents of DMEM. This is illustrated in Figure 1.

Figure 1. Processor Input and Outputs

Composition of a Processor

The processor block is typically composed of the following elements. We will discuss each subsequently.

  1. Registers
  2. Arithmetic Logic Unit (ALU)
  3. Control Unit (CU)
  4. Immediate
  5. Program Counter (PC)
Registers

Registers are the internal memory of the processor. The processor can operate only on data that is stored in registers; the processor does not manipulate data directly at the DMEM. Instead, the data is first copied from DMEM to registers through load instructions. The direct output of a processor operation is written to one of the registers before it is copied to DMEM through store instructions.

RISC-V instruction set manual specifies the processor to have 32 registers. In the 32-bit version of the standard, each of these registers have 32-bit width. The first register is specified to contain 0 always. As shown in Figure 2, an instruction can read one or two registers (source_0_idx and source_1_idx) and can write to one register (dest_idx).

Figure 2. RISC-V Registers

Before we move forward, it may be good to see what a RISC-V instruction looks like. RISC-V is a processor following the RISC-V instruction set manual specification. The instruction manual outlines and defines a set of instructions to be supported by a RISC-V processor. In other words, a specification compliant RISC-V processor can correctly run a program that is generated by a RISC-V compiler.

There are multiple versions of RISC-V processors. The basic 32-bit RISC-V instruction set involves more than 40 instructions (see the full list). As shown in Figure 2, each instruction is represented by 32 bits; and within these 32 bits, the instruction packs the opcode (opcode and func3) which determines the function of the instruction, the index of register operands (rs1 and rs2) which determine the operands of the instruction, destination register index (rd) where the output is written to, and a constant (imm) which is an additional operand.

An instruction format defines how the different components of the instruction are arranged within the available 32 bits. As shown in Figure 3, 32-bit RISC-V instructions use one of the shown six formats.

Figure 3. RISC-V instruction formats

As an example, you may consider the RISC-V instruction ADDI rd, rs1, imm. This instruction adds the content of register rs1 with imm and stores the sum into register rd. If the actual instruction is ADDI 5,0,10, then the processor is instructed to add 10 to the contents of register 0 (which is always 0). After the processor finishes executing this instruction, we should find 10 at register 5.

ADDI follows the “I” (the second row in Figure 3) instruction format (see here). The opcode of ADDI is 0x13 (in hex), and the func3 value is zero. Accordingly, the 32 bit binary representation of ADDI 5,0,10 is construction in Table 1.

Table 1. Binary Representation of ADDI 5,0,10

Instruction Element Bit index Value (Hex) Value (Binary)
Opcode [6:0] 0x13 001 0011
rd [11:7] 0x5 0 0101
func3 [14:12] 0x0 000
rs1 [19:15] 0x0 0 0000
imm [31:20] 0x00B 0000 0000 1011
Full Instruction (Binary) 0000 0000 1011 0000 0000 0010 1001 0011
Immediate Block

The immediate block extracts the constant operand from an instruction (except for “R” instruction which does not contain imm). The immediate block first reads the opcode and func3 (which are at fixed bit locations regardless of instruction format (see Figure 3), and then based on the value of the opcode it infers the format and the bit locations of imm.

Arithmetic Logical Unit (ALU)

As discussed above, a RISC-V instruction contains an opcode, the list of operands, and a destination register address (if the instruction involves a register write). One of the operands is always a register and the other one can be a register or an immediate constant.

The ALU carries out arithmetic or logical operation on the operands. The opcode determines which type of arithmetic or logical operation the ALU should conduct. The ALU can perform adding, subtracting, anding, oring, xoring, logical shifting, and arithmetic shifting (see a complete list of instructions here). The ALU outputs the outcome of the operation and a set of flags.

Besides the main output, ALU generates a set of flags that indicate aspects of the operation outcome. The flags indicate if the output is zero, or negative, or positive, or if there was an overflow.

Figure 4. Contents of an Instruction and Interconnection between Registers, Immediate Block and ALU

Control Unit (CU)

The CU can be considered the master block in the processor. Its main task is to extract the opcode and func3 (if any) which are always located at the same bit locations regardless of instruction format (see Figure 2). This opcode is an input to the ALU and the immediate block. More importantly, the CU generates control signals that configure the processor according to the function of the input instruction.

For example, one RISC-V instruction is the load instruction. The load instruction copies the content of a DMEM address into one of the registers. Accordingly, when the CU sees a load instruction, it raises a register_wr_en (enable for writing into one of the registers). The other set of RISC-V instructions is the store instruction set. The store instructions do the opposite of what load instructions do; they copy the content of a register into a given DMEM location. Hence, when the CU sees a store instruction, it lowers register_wr_en and raises data_mem_wr_en (enable for writing into DMEM). Figure 4 shows some of the control signals generated by the CU.

Figure 5. Control Unit Input and Output (partially)

Program Counter (PC)

As discussed above, instructions are sequentially stored in PMEM. Obviously, the running program contains more than one instruction. Hence, the processor should know where from PMEM to read the next instruction. The Program Counter (PC) stores the next read address of the PMEM.

In 32-bit RISC-V, each instruction spans 32 bits (4 bytes). Therefore, the processor increments the PC by 4 at every instruction. However, if the program encounters a branch, the PC is overwritten with the landing address of the branch instruction.

Figure 6. Updating the Program Counter

Putting it Together and Pipelining

We are now ready to put all the elements together. Figure 7 shows a simplified version diagram of the processor we are building in this project. First, the instruction is read from PMEM. Then, this instruction is decoded inside the CU and the Immediate blocks. In parallel, register operands are read out from the register block. After register reads and instruction decode, the instruction is executed in ALU. The next stage involves DMEM read or write, and the final stage involves register write step.

Figure 7. Simplified sketch of a Full RISC-V Processor

As you can elude from Figure 7, the processor elements are intentionally organized into five stages; this is done so to allow pipelining. Pipelining allows the processor to start processing the next instruction before it finishes processing the current one. For example, when the current instruction is in stage 2 (Decoding stage), the processor can go ahead and read PMEM for the next instruction, without interfering with the execution of the current instruction. Otherwise, the PMEM read stage (stage1) unnecessarily remains idle until the current instruction goes through all the five stages – which leads to inefficiency and slowness.

Figure 8 illustrates the benefits of pipelining using two instructions. It shows that pipelining can provide a 40% reduction in cycle count. The performance difference between pipelining and not pipelining increases along with the total number of instructions. For example, if 1000 instructions are considered, the unpipelined design would take 5000 cycles to finish while pipelined design would take as small as 1004 instructions – a whapping 5x difference. Pipelining is a must in a processor design.

Figure 8. Benefits of Pipelining

Actual Design

Before discussing the intricacies of pipelining, I want to show the actual (fully detailed) design. You can find the lucid chart block diagrams here (free login to lucid chart is required to view them). One can understand the detailed (seemingly complicated) design easily after understanding the simplified sketch shown in Figure 7.

Below are listed highlights of the actual design. I will not attempt to describe the full detail here – one can get the complete detail by looking at the block diagrams and the source code.

Two Cycles per Stage

PMEM and DMEM are implemented using Block RAM IP from Xilinx. The read and write to this memory costs two clock cycles – this means the instruction fetch and DMEM RD/WR stages of the pipeline last two cycles (not only cycle as shown in the illustration in Figure 8). To make the pipeline consistent, all the other stages (Decode, ALU and Reg WR) are also designed to span two cycles.

Every Signal is Pipelined

Pipelining applies to all signals – a signal crossing from one stage to another should do so with two-cycle delay. To illustrate this point, consider the JAL rd, imm instruction. This instruction causes an unconditional jump to address PC + imm, and stores PC + 4 into register rd. PC originates from stage1, but it is written to rd at stage 5. The PC originates from stage1, but the writing to rd register happens at stage5. Hence, the PC signal must be propagated through the stages until it reaches stage5. Hence, as shown in Figure 9, the value that is written to rd is PC delayed from 6 cycles ago plus 4. Similarly, the rd address and the register write enable which are both obtained at stage2 need to be propagated through the pipeline.

Figure 9. Every signal must be pipelined

First Kind of Data Hazard

Pipelining becomes difficult if two consecutive instructions have dependencies on each other. For example, consider the following consecutive instructions:

  1. ADD 5,0,1 (Add contents of register 1 to 0 and store the result in register 5)
  2. ADD 6,0,5 (Add contents of register 5 to 0 and store the result in register 6).

The actual adding (in ALU) of the second instruction assumes register 5 to be already updated by the first instruction. As shown in Figure 10, the second instruction ALU occurs before the register update from the first instruction is fulfilled. Unaddressed, this would result in incorrect program behavior.

To detect a data hazard, the processor needs to check if the operand registers of the current instruction are the same as the destination register of the previous instruction. Accordingly, data hazards can be detected as soon as the instruction from the current instruction is fetched.   As shown in Figure 10, the solution to this kind of pipelining hazard is called forwarding (see here for further reading). Forwarding leverages the fact that the value to be written to register 5 is already available at the end of the ALU stage of the first instruction. It just must propagate through the pipeline to be written to the register. Accordingly, if the ALU operation of the second instruction takes the ALU out of the first instruction instead of the current register readout, it would use the correct content of register 5. This mechanism is called forwarding. When this kind of hazard is detected, the ALU takes its input from its own output (the ALU output of previous instruction).

Figure 10. First kind of data hazard, and addressing it using forwarding

Second Kind of Data Hazard

The first kind of hazard is (relatively) benign one for that it can be addressed without stalling the pipeline. The second kind of data hazard however requires a more elaborate mechanism to be addressed. The second kind of data hazard occurs if current ALU-using instruction depends on memory read in the previous instruction. For example, consider the following consecutive instructions:

  1. LW 5,0,10 (Read DMEM at address 10 and store the outcome in register 5)
  2. ADD 6,0,5 (Add contents of register 5 to 0 and store the result in register 6).

As shown in Figure 11, the value going to register 5 is not even available by the time the second instruction is ready to do the adding inside the ALU. Accordingly, forwarding is not possible. In this design, this hazard is handled by injecting no-operation (NoP) instruction followed by a repeat of the second instruction. Accordingly, this hazard slows down the pipeline by 2 stages (4 clock cycles).

Figure 11. Second kind of data hazard, and addressing it by instruction repeat

Control Hazard (Caused by Branching)

The other type of pipeline hazard is caused by conditionally branching instructions. In such instructions, the processor does not whether to take the branch until the ALU computes the branch condition and updates the flags. However, by then two other instructions would be already in the pipeline. If the branch is not taken, executing the processor can carry on without interrupting the pipeline. However, if the branch is to be taken, the processor should somehow stall the execution of the two following instructions which are already in the pipeline

Figure 12. Control hazards and how to address it

For example, consider the following instructions.

  1. BEQ 5,4,20 (If contents of register 5 and 4 are equal, branch to the instruction that is 20 bytes aways – or after 5 instructions )
  2. ADDI 6,0,5 (Put 5 in register 6).
  3. ADDI 7,0,5 (Put 5 in register 7).
  4. ADDI 8,0,5
  5. ADDI 9,0,5
  6. ADDI 6,0,8 (Put 8 in register 6).

While executing the first instruction, if the processor finds that the content of registers 5 and 4 are the same; then the processor in theory should execute the 6th instruction, not the second and the third. But the hazard is from the fact that the second and third instructions will already be in the pipeline. In this design, this hazard is handled by allowing the second and third instructions to carry on in the pipeline, but DMEM and register write in the last two stages are disabled for the following two instructions if the branch needs to be taken. The 6th instruction is fetched as soon as the ALU updates the flags in the branch instruction. Because of this hazard, the processor will waste 2 pipeline stages or 4 cycles – and that is so only if the branch needs to be taken.

Verification

The source code of this implementation can be found in our GitLab repository. After the implementation, the code is tested through a verification test bench which you may find under “tb” folder. All RISC-V instructions (except sync, system, and counter instructions) were tested individually with some randomized test tasks. In particular, the register operands and the immediate values used in the tests were randomized to improve test quality and coverage. In addition, 8 other tests were added to test data and control hazards handling logics. All these tests (45 in total) are now passing.

Next Steps

The next step in this project involves testing the processor with realistic codes originating from actual compilers. In addition, we will be making a PCB board for the FPGA which will be hosting the RISC-V processor. I will come back with an update once these steps are completed. It will all be exciting!!!

 

 

 

 


ከመሃንዲሱ ጋራጅ