At demonstrations whenever I am asked a question about how BeagleLogic works, it takes time to be able to explain how a low cost SBC can actually sample at digital signals at 100 MHz, what makes the BeagleBone Black so special, why can’t this be done with something like a Pi without adding any extra hardware.
This blog post is an attempt to start from (almost) scratch and explain the nuts and bolts of the BeagleLogic assembly and document the design decisions made last year as a reference for future application scenarios. This should be the first in a series of posts.
The PRUs, or the “Programming Real-Time Units” on the AM3358 SoC on the BeagleBone Black are two 200 MHz microcontrollers that run side-by-side the 1 GHz ARM CPU. They can be started, stopped, reset, programmed via the CPU and also share the same bus so the PRUs can independent of the CPU access the core peripherals like GPIO, memory, ADC, DMA, … and also have a GPI/GPO subsystem of their own (“Enhanced GPIO subsytem”).
These PRUs can be programmed in C using the TI PRU C Compiler (recommended) or PASM (now considered obsolete but still works) or a GCC Compiler port (in progress) and of course, hand tuned assembly code one can write for the PRU as well.
At the heart of a logic analyzer…
… is a register that samples the input signal(s) at regular intervals (decided by the sample rate in case of a free running sampler) or at the edge of an external clock signal. These samples are then recorded into the sample buffer which can be then used to extract and analyze the captured digital signal.
Note that this is the simplest case, in real life scenarios there are often trigger conditions like ‘Hey, start recording when there is, say, “a falling edge on pin B when pin A is high”, or “after 5 rising edges on pin C when pin D is high and pin E is low”‘ and so on.
The core of BeagleLogic is the simplest possible implementation. It simply samples and records the inputs into a buffer at a sample rate that can be configured in integer divisions of 100 MHz i.e. 100 MHz (100 / 1), 50 MHz (100 / 2), 33.33 MHz (100 / 3) and so on.
Now let us look at the building blocks available in the PRUs of the BeagleBone Black that enable us to achieve this.
MOV, SBCO and LBCO Instructions
A quick primer on the 3 most important instructions for data transfer in PRU assembly. For a detailed overview refer to the PRU instruction set manual.
MOV
is used for moving data between registers
MOV R1, R2 // R1 gets entire contents of R2
MOV R1.b0, R2.b0 // least significant byte of R2 copied to LSbyte of R1
MOV R2.w1, R3.w0 // least significant halfword of R3 copied to MSHWord [HWord=16bits]
SBCO
is used for moving data between a physical address and a register. This physical address can be within the PRU (e.g. PRU data RAM, shared RAM, power and control registers) or outside peripherals like the GPIO subsystem, the ADC subsystem, the EMIF subsystem (the DDR SDRAM controller), the GPMC. General usage:
SBCO &src, destination, offset, n
Moves n
bytes from the src
register into (*destination)+offset
, this is indirect addressing. If n
> 4, then the subsequent registers are accessed as well. Offset can be a register or an immediate value.
When using SBCO
to write to the DDR memory, note that we must provide only physical memory addresses as there is no MMU involved when PRU accesses it. This is what most examples of the PRU that demonstrate shared memory access do.
Using R0 = 0x40000000
, R1 = 0x100
, try and guess what SBCO &R2, R0, R1, 32
does
LBCO
loads data from an external address into a destination register. If n > 4, data gets loaded into the subsequent registers as well.
LBCO &dest, src, offset, n
Clock cycle counting
I’ve used the Cortex M3/4 microcontrollers and these cores have this nice register called DWT_CYCCNT
which provides number of clock cycles the processor has executed code. So one can take the difference of DWT_CYCCNT
register before and after a code block and get the number of cycles this code takes to execute. This allows cycle-accurate code profiling.
Lucky for us, the PRU has this neat feature as well, that I used for determining the number of processor cycles each instruction takes to execute. This is known as the CYCLE
register. But before we can use it, we’ll have to enable it by setting bit 3 in the CTRL
register using this code snippet:
MOV R1, CTPPR_0
MOV R2, 0x00000220 // C28 = 00_0220_00h = PRU0 CFG Registers
SBBO &R2, R1, 0, 4
LBCO &R1, C28, 0, 4 // Enable CYCLE counter
SET R1, 3
SBCO &R1, C28, 0, 4
Notice that we first modify the CTPPR_0
register which allows us to use the C28
register to refer to the PRU control registers instead of using up another register to hold the PRUCFG
register address.
So, whenever we need to time a section of the code, we can do something like:
LBCO &R1, C28, 0xC, 4 // Load "before" cycle count into R1
// your assembly code here
LBCO &R2, C28, 0xC, 4 // Load "after" cycle count into R2
Now we can examine the contents of R1
and R2
to determine how many cycles it takes. You would also have to account for the “extra” clock cycles of the LBCO
instruction.
I ran a lot of tests using this initially, here’s what I found:
- every
MOV
operation is one cycle. In fact, any operation that does not access external memory or peripherals completes in a single cycle i.e. 5ns. This is by design.
LBCO
and SBCO
instructions with byte count 4 take 2 cycles. The way I hypothesize is that 1 cycle is spent to generate the address by adding the offset to it and then the actual data transfer operation takes 1 cycle per 32 bits transferred, thus O(n) time. Therefore the SBCO
example in the previous section should take 9 cycles to complete (1+32/4), assuming there is no bus stall while writing the data to the memory.
We will use this information to help us with the timings needed for sampling.
Enhanced GPI/GPO feature
The PRU has an enhanced GPIO that operates at 200 MHz and it implements a “Direct Input” GPI mode. What it means that whatever be the pin value at the PRU inputs at the sampling instant will be captured whenever register R31 of the PRU is read. So, to sample first 8 bits of R31
into the register file, we can use the following PRU assembly code: (edited for more clarity, referring issue #9 on GitHub) The first line of the snippet below shows how one can sample the lowest 8 bits of R31 which is connected to the PRU input pins. The entire snippet shows how to make 5 samples of the input pins and store them into successive registers.
MOV R10.b0, R31.b0
MOV R10.b1, R31.b0
MOV R10.b2, R31.b0
MOV R10.b3, R31.b0
MOV R11.b0, R31.b0
Observe:
- We can and pack up to 4 samples into a single 32-bit register using the PRU assembly instruction (
Rn.bx
refers to the x’th byte in the register – each register is 4 bytes).
- You might ask, why sample to the registers, and why not store this data in the 8 KB SRAM, or the 12 KB shared RAM, or even the DDR RAM?
* From the previous sections, we see that every SBCO would take 2 cycles but register access is just 1 cycle, so we can achieve higher sampling rates.
* While accessing the DDR RAM there is a very low but finite probability of bus conflicts while data is being written, and having such an instruction within the real-time sampling loop has the potential to compromise the sampling operation. So we would like to separate both of them.
3. Right now, we’ve only stored data in the registers, and there isn’t enough space to store all samples in the registers. So we need to somehow get the data out of there.
The PRU0/1 Scratchpad and the XIN/XOUT instructions
I remember initially reading this section in the PRU reference manual with skepticism and was disappointed to find scarce resources and/or example applications in the early phase of my GSoC period. But this turned out to be one of the important links in the puzzle.
Apart from the 30 registers in the two PRUs (R30 and R31 are connected to the GPO / GPI respectively), there’s also 30×3 independent register banks available as a scratchpad; and this is connected to the registers on both the PRUs using a “broadside interface”. Broadside means that all 30 registers are connected, and all of them can be moved in parallel. This means that in a single clock cycle one can copy or swap all 30 registers of a PRU with one of these 3 banks.
Here’s an example [syntax:: XOUT , &Rn, count
]:
// On PRU1
XOUT 10, &R16, 32 // Copies R16-R23 into Bank0 (Bank0 = 10)
// On PRU0 - after XOUT has executed on PRU1
XIN 10, &R16, 32 // Reads R16-R23 from Bank0 into PRU0
This ability of moving bytes across the PRU barrier is crucial for BeagleLogic as it means that:
- We can cleanly separate pin sampling (handled by PRU1) and data transfer (handled by PRU0).
- Since pin sampling operates only on registers, it is effectively shielded from bus latencies.
- Because register manipulation is cycle accurate we can design delay loops in PRU1 to give us a programmable sample rate, independent of PRU0 operation.
- PRU0 can now directly push data into the 512 MB(!) of DDR memory directly giving BeagleLogic a buffer capacity so large at this price point. Note that due to packing the samples into 32 bit words the number of write transactions is cut down by a factor of 1/4th (8 bit samples) or half (16-bit samples) as compared to the sample rate, giving us cycles to spare. [The actual PRU firmware of BeagleLogic writes 32 bytes at a time into the DDR memory.]
Inter-PRU signaling
The final piece in the puzzle for a basic implementation is to have a way so that PRU1 can signal PRU0 that it has pushed data into Bank0 using XOUT, and that it can take data in using XIN and write it to the DDR memory. Interrupts. By configuring the mapping appropriately in the PRU interrupt controller (PINTC) one can send a signal to the other PRU by writing a value into the R31 register which triggers an interrupt (note we generally read from R31, not write). Similarly by waiting on bit 30 of R30 to be set, one can know from the other PRU when an interrupt has occurred. The interrupt configuration and mapping is in general handled by the library (libprussdrv) or the pru_remoteproc kernel driver, we just assume within the PRU that everything is already configured for us and is working.
Putting it all together
Let us try writing a sketch of a very simple firmware which implements a basic form of a 8-bit logic analyzer. This should be helpful when developing for a similar application scenario and understanding the behind-the-hood working of BeagleLogic.
PRU1 code:
MOV R16.b0, R31.b0 // take an 8bit sample
NOP
MOV R16.b1, R31.b0
NOP
MOV R16.b2, R31.b0
NOP
L1: MOV R16.b3, R31.b0
XOUT 10, R16, 4 // Move 4 samples to Bank0
MOV R16.b0, R31.b0
MOV R31, PRU0_INTR // Signal PRU0
MOV R16.b1, R31.b0
NOP
MOV R16.b2, R31.b0
JMP L1
Observe how writing an infinite loop this way allows us to sample 4 bytes, move it into Bank0 and signal the PRU0 that data is ready. Also, the time gap between two samples is 2 clock cycles, so this infinite loop samples the pins at 100 MHz. See? How could one sample at, say, 50 MHz?
Next, let’s write the segment of code in PRU0 to receive this and write it into memory
// Assume DDR buffer start address is in R0, offset is in R1
loop:
WBS R30, 30 // Wait till interrupt
// an SBCO to clear interrupt flag (omitted here for clarity)
XIN 10, R16, 4 // Receive data
SBBO &R16, R0, R1, 4 // Store it into the DDR mem
ADD R1, 4 // Increment dest address
JMP loop
Now observing how BeagleLogic does it – PRU1 Sampling code and PRU0 memory writing code should give an idea of the basic processes happening that make BeagleLogic possible. Of course, there is the overhead but it’s the same basic principle.
A similar post explaining the BeagleLogic kernel module is on it’s way soon.