Key Concepts from Labs and Projects

Lab01 - Hello World in the RISC-V Dev Env

Console based editing: micro or vim
git and GitHub usage
Makefiles
Hello World in C
The Autograder

Project01 - RISC-V VM Access and Number Conversion in C

Dev Environment Setup
- Local RISC-V VM using qemu-system-riscv64
- Remote RISC-V VM on euryale
- ssh keys, ssh config, ssh authorized_keys
- Shell configuration
  - bash: ~/.profile, ~/.bash_profile, ~/.bashrc, ~/.bash_aliases
  - zsh: ~/.zprofile, ~/.zshrc
C Basics
- Functions
- Data (global, stack, heap)
- Statements and expressions
- Data types
  - Primitives - int, char, float, uint8_t, uint32_t, int32_t, uint64_t, int64_t`
  - Composit - arrays and structs
  - Pointers - int *, char *, uint32_t *, uint64_t *
- Pointer operations
  - & address of, e.g., int x; int *p; p = &x;
  - - dereference, e.g., x = *p;
- Strings in C
  - Array of chars (bytes)
  - Null \0 (0) terminated
  - ASCII character codes
- The C stack
  - Local variables
  - Allocated on function entry, deallocated on function exit (return)
- Signed (int, int32_t) vs unsigned (unsigned int, uint32_t) values
Command line arguments with int argc and char *argv
- Layout of argv array in memory
- Array of pointers first, argv[0], argv[1], argv[2], NULL (0)
- String arrays second, ./foo + \0, -p + \0, etc.
Parsing command line arguments
Number systems
- Decimal (base 10, digits: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
- Binary (base 2, digits: 0, 1, C literal prefix: 0b)
- Hexadecimal, “hex” (base 16: digits: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F, C literal prefix: 0x)
  - A = 10, B = 11, C = 12, D = 13, E = 14, F = 15
base conversions
- C string as decimal, binay, hexadecimal to int
  - Place value: 2 * (10^2) + 5 * (10^1) + 4 * (10^0)
- int to C string as decimal, binary, hexadecimal
  - Modulus to find digit
  - Divide to get value for next digit
  - Repeat until divide gives 0

Lab02 - Introduction to RISC-V Assembly Programming

Instructions, registers, labels, directives
- See RISC-V Assembly Guide
Assembling vs compiling
- Assembling (as) assembly source (.s) to object code (.o)
- Compiling (gcc) c source (.c) to object code (.o)
- The object files must be linked, we can use gcc for this
  - as -o foo.o foo.s
  - gcc -o main main.c foo.o
- The C compiler can also assemble
  - gcc -o main main.c main.s
Instructions have a name (mnemonic), like add or mul
Assembly programming model
- A RISC-V processor has 32 registers plus the PC registers
  - x0, x1, … x31
  - ABI names: sp, ra, a0, a1, …, t0, t1, …, s0, s1, …
- The PC points to next instruction in memory to execute
- Load an instruction from memory into processor
- Execute the instruction, usually updates one or more register values, but can also update memory
- Update PC, often just the next instruction PC + 4, but can also be a diffenent address in the case of a branch or jump (control instruction)
Most instructions have three operands
- One target (destination)
- Two source
- add t0, t1, t1 means t0 = t1 + t2
On RISC-V 64 bit registers are 64 bits (8 bytes)
Data processing instructions
Immediate values
Control instructions
- conditional branchs: beq, bne, blt, bge
- jumps: j
- return: ret
Assembly arguments passed in a0, a1, a2, a3, etc.
Return value is put into a0

Basic assembly function

.global func_name
func_name:
    add a0, a0, a1
    ret

if/then/else in assembly

int x, y, r;
r = 0;
if (x > y) {
 r = r + 1;
} else {
 r = r + 99;
}

Asm: Assume a0 - int x, a1 - int y, t0 - int r

  li t0, 0
  bgt a0, a1, else
  addi t0, t0, 1
  j endif
else:
  addi t0, t0, 99
endif:

for loops in assembly

int r, i, n;
r = 0;
n = 10
for (i = 0; i < n; i++) {
  r = r + i;
}

Assume t0 - int r, t1 - int i, t2 - int n

  li r, 0
  li n, 10
  li i, 0
loop:
  bge t1, t2, loopend
  addi t0, t0, t1
  addi t1, t1, 1
  j loop
loopend:

Array access in assembly
- lw (load word) lw t0, (a0) means t0 = *a0
- Load word at address a0 into register t0
- Pointer based array access
  - To get to next element in array of words (ints): addi a0, a0, 4

Project02 - RISC-V Assembly Language

RISC-V ABI Function Calling Conventions
- Only one set of registers (32), so we need a way to coordinate usage between functions
- Caller-saved registers (a0, a1, …, a7, t0, t, … t6)
  - These must be preserved on the stack by the caller of a function
  - You only need to preserve the registers you will be using after the function call
- Callee-saved registers (sp, ra, s0, s1, … s11)
  - These must be preserved on the stack by the callee, on entry to the function.
  - They should be restored on exit.
  - The stack pointer is preserved by subtracting (stack allocation) on function entry and adding (stack deallocation) on function exit the name number of bytes.
  - The sp should be a multiple of 16 (for performance compatibility with some instructions)
Functions that don’t call other functions do not need to allocate stack space.
- Only need stack space if the function uses caller-saved registers

Function call template when using stack:

.global func_name
func_name:
    addi sp, sp, -N    # Allocate N bytes on stack, N must be a multiple of 16
    sd ra, (sp)         # Save ra onto stack
                        # Save any callee-saved registers if needed
    ...
    ld ra, (sp)
    addi sp, sp, +N
    ret

Indexed-based array access
- Assume t0 is int i, a0 is base address of array int arr[]
```
li t1, 4
mul t1, t1, t0
add t1, a0, t1
lw t2, (t1)
```
- Altneratively use a shift instead of multiply
```
slli t1, t0, 2
add t1, a0, t1
lw t2, (t1)
```
Recursive functions
- Nothing special, just that the func calls itself
- Same rules apply
- Presere any caller-saved registers on stack before the recursive call
- Restore any caller-saved registers from stack after the recursive call
- If a recursive function simply returns after the recursive call, then it is called tail recursive and the recursive call can be turned into a jump.

Project03 - RISC-V Assembly Language Part 2

Working with C strings in assembly
Use lb (load byte) and sb (store byte) to access characters in a string
Calling C functions from Assembly
- Just add a .global func_name for the function you want to use
- E.g., .global strlen
- Then set arg registers and use call instruction as normal
- Make sure to preserve any caller-saved registers before call
Memory
- Byte addressable
- Smallest addressable unit is a byte (8 bits)
- A int (word) is 4 bytes, need to decide how to layout a word in memory
- Two byte orderings
  - Big endian (put the most significant byte first)
  - Little endian (put the least significant byte first)
    int x = 0xFFAA1122 4 | | 4 | | 3 | 22 | 3 | FF | 2 | 11 | 2 | AA | 1 | AA | 1 | 11 | 0 | FF | 0 | 22 | Big Little Endian Endian
Two’s Complement (2’s Complement)
- Need a way to represent negative integer values
- To get the two’s complement negative representation of a positive value:
  - twos = invert_bits(val) + 1
- 3-bit two’s complement values
```
Bin Dec
000   0
001   1
010   2
011   3
100  -4
101  -3
110  -2
111  -1
```
- Two’s complement properties
  - The msb (most significant bit) is the sign bit (0 is pos, 1 is neg)
  - Only one representation of 0
  - Because of this, there is one more neg value than pos values
  - Simple grade school arithmetic work on two’s complement numbers
Bit manipulation and bitwise operators
- Zero (0) / One (1)
  - Other names: unset/set, low/higg, off/on, false/true
- A byte is 8 bits
```
msb | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | lsb
```
  - msb - most significant bit
  - lsb - least significant bit
- A word is 4 bytes or 32 bits
```
msb | 31      24|23      16|15       8|7        0| lsb
    |   byte 3  |  byte 2  |  byte 1  |  byte 0  |
```
- Bitwise Operaters (op, C, ASM)
  - AND , & , and/andi
  - OR , | , or/ori
  - NOT , ~
  - XOR , ^ ,
  - Shift Left Logical
    - Shift x to the left by n bits
    - Fill lower bits with 0
    - uint32_t x
    - x << n
    - sll/slli
  - Shift Right Logical
    - Shift x to the right by n bits
    - Fill upper bits with 0
    - uint32_t x
    - x >> n
    - srl/srli
  - Shift Right Arithmetic
    - Shift x to the right by n bits
    - Fill upper bits with sign bit value
    - int32_t x
    - x >> n
    - sra/srai
- Masking
  - It is often need to pull out a sequence of bits from a larger value
    - To get a sequence of bits:
    - Shift the bits to the far right
    - Mask the bits to zero out any upper bits we don’t want
    - Example: assume we want the middle 4 bits of an 8 bit value (bits 2-5)
      uint8_t x = 0b11011010; uint32_t v = x >> 2 // Shift the 4 bits to the beginning uint32_t v = v & 0b1111 // Mask off upper bits remaining in v
- Combine individual bit sequences with right shift and OR (|)
- Sign extension
  - Assume you have a 4-bit two’s complement number you want to sign extend to be a 32-bit two’s complement number.
    int32_t v = 0b1110 // -2 in 4 bits v = (v << 28) >> 28;
  - Shift all the way to the left and then do shift right arithmetic all the way back

Lab03 - RISC-V Machine Code Emulation

RISC-V Machine Code and Emulation
- Instructions are 32 bits (1 word), there is also a 16 bit compressed format
- Each instruction encodes source register(s), a destination registers, and target address or offset
- We decode the iw (instruction word) in steps
  - Look at opcode to determine instruction format type
  - Decode the iw based on opcode
  - Further decode iw by looking at other fields such as funct3 and funct7
- The base RISC-V emulator includes the emulated state as a struct:
```
struct rv_state {
  uint64_t regs[NREGS];
  uint64_t pc;
  uint8_t stack[STACK_SIZE];
};
```
  - NREGS is 32 (because RISC-V has 32 registers)
  - The pc is the program counter
  - STACK_SIZE is 8192, but for prorgrams with lots of function calls and/or local vars this may need to be bigger.
- Basic emulation
  - We will emulate assembly programs that have been linked with the emulator code
  - Initialize the rv_state with
    - A pointer to the function we want to emulate
    - The first 4 arguments as uint64_t values to send to the function
  - Get iw: iw = *pc
  - Get opcode from iw
  - Based on opcode call emulation function, like:
    - emu_r_type(rsp, iw)
    - rsp is struct rv_state *rsp
  - The emu functions will further decode the iw
    - Use shift and mask (get_bits()) to get each field
    - Update the state (registers and memory based on opcode, funct3, and funct7)
    - Update the pc, usually pc += 4 to go to the next instruction
      - For control instructions, we will compute either an offset and add to pc
      - Or, for jalr use a register value to update the pc
RISC-V Instruction Formats
- R-type |funct7[31:25]|rs2[24:20]|rs1[19:15]|funct3[14:12]|rd[11:7]|opcode[6:0]|
  - For register instructions like add, sub, sll
- I-type |imm_11_0[31:20]|rs1[19:15]|funct3[14:12]|rd[11:7]|opcode[6:0]|
  - For immediate instruction like addi, slli, lw, lb
  - Immediate must be constructed from parts and sign extended
  - int64_t imm = sign_extend(imm_11_0, 11)
- B-type |imm_12[31]|imm_10_5[30:25]|rs2[24:20]|rs1[19:15]|funct3[14:12]|imm_4_1[11:8]|imm_11[7]|opcode[6:0]|
  - For branches, like beq, bne, blt, bge
  - int64_t imm = sign_extend(imm_12 << 12 | imm_11 << 11 | imm_10_5 << 5 | imm_4_1 << 1)
  - If branch taken pc += imm, if not taken pc += 4
- J-type |imm_20[31]|imm_10_1[30:21]|imm_11[20]|imm_19_12[19:12]|rd[11:7]|opcode|
  - For ‘jal’ jump and link, also ‘j’
  - int64_t imm = sign_extend(imm_20 << 20 | imm_19_12 << 12 | imm_11 < 11 | imm_10_1 << 1)
  - pc += imm
RISC-V Pseudo Instructions
- The assembler provide many convenience instructions that do not exist as real instructions
- Examples
  - li a0, 99 is addi a0, zero, 99
  - j offset is jal zero, offset
  - call offset is jal ra, offset
  - bgt r1, r2, offset is blt r2, r1, offset

Project04 - RISC-V Emulation - Analysis - Cache Simulation

Additional RISC-V Instruction Formats
- S-type |imm_11_5[31:25]|rs2[24:20]|rs1[19:15]|funct3[14:12]|imm_4_0[11:7]|opcode[6:0]|
  - For store instuctions, like sb, sw, sd
  - Immediate must be constructed from parts and sign extended
  - int64_t imm = sign_extend(imm_11_5 << 5 | imm_4_0, 11)
- U-type |imm_31_12[31:12]|rd[11:7]|opcode[6:0]|
  - For lui (load upper immediate) and auipc add upper immediate to pc
  - Our code dosn’t use these
Note that load instructions are an I-type form
Dynamic Analysis
- Because we emulate each instruction we can do the following
  - Count the total number of instructions executed
  - Count the different types of instructions executed
    - I/R count, Loads, Stores, Jumps, Branches Taken, Branches Not Taken
  - Counts added to struct rv_state as a new struct struct rv_analysis
  - Need to update counts accurately in the emulation functions
Cache Memory and Simulation
- A Cache is a small and fast memory that holds recently accessed data from main memory
- Main memory is slow relative to CPU speed
- A cache can work because program exeuction adheres to two principles of locality
  - Principle of spatial locality - If code accesses a value, there is good chance it will access nearby values
  - Principle of temporal locality - If code accesses a value now there is a good chance it will do so again soon
- We send an address to a cache an request a value back
  - Hit - value is in the cache for the requested address
  - Miss - value is not in the cache for the requested address
  - Hit Rate = #hits / #refs
  - Miss Rate = #misses / #refs
  - In modern processes caches can achieve very high hit rates, > 95%
- Main cache ideas
  - For a given memory address, how to locate value in the cache?
  - How to replace a value?
  - How many bytes do we load at at time into the cache (block size)?
- Three main cache designs
  - Direct Mapped
  - Fully Associative
  - Set Associative
- Direct Mapped
  - Each address maps to one slot in the cache
  - On miss, just repace slot value with new value from memory
- Fully Associative
  - Each address can map to any slot in the cache
  - Use a form of LRU (least recently used) to choose which slot to evict on miss
- Set Associative
  - An address maps to a set of slots, and can map to any slot in the set
  - Use LRU within the slots in a set to choose a slot to evict
- Block Size
  - To support spatial locatlity a cache will read in more than the word requested
  - We can think of memory as an array of blocks
  - We need to find the block for a give address
  - Find the base of the block
  - The load the entire block into the cache slot