Google Summer of Code – The Embedded Kitchen

BeagleLogic: Building a logic analyzer with the PRUs: Part 1

December 5, 2022May 20, 2015 by abhishek

At demonstrations whenever I am asked a question about how BeagleLogic works, it takes time to be able to explain how a low cost SBC can actually sample at digital signals at 100 MHz, what makes the BeagleBone Black so special, why can’t this be done with something like a Pi without adding any extra hardware.

This blog post is an attempt to start from (almost) scratch and explain the nuts and bolts of the BeagleLogic assembly and document the design decisions made last year as a reference for future application scenarios. This should be the first in a series of posts.

The PRUs, or the “Programming Real-Time Units” on the AM3358 SoC on the BeagleBone Black are two 200 MHz microcontrollers that run side-by-side the 1 GHz ARM CPU. They can be started, stopped, reset, programmed via the CPU and also share the same bus so the PRUs can independent of the CPU access the core peripherals like GPIO, memory, ADC, DMA, … and also have a GPI/GPO subsystem of their own (“Enhanced GPIO subsytem”).

These PRUs can be programmed in C using the TI PRU C Compiler (recommended) or PASM (now considered obsolete but still works) or a GCC Compiler port (in progress) and of course, hand tuned assembly code one can write for the PRU as well.

At the heart of a logic analyzer…

… is a register that samples the input signal(s) at regular intervals (decided by the sample rate in case of a free running sampler) or at the edge of an external clock signal. These samples are then recorded into the sample buffer which can be then used to extract and analyze the captured digital signal.

Note that this is the simplest case, in real life scenarios there are often trigger conditions like ‘Hey, start recording when there is, say, “a falling edge on pin B when pin A is high”, or “after 5 rising edges on pin C when pin D is high and pin E is low”‘ and so on.

The core of BeagleLogic is the simplest possible implementation. It simply samples and records the inputs into a buffer at a sample rate that can be configured in integer divisions of 100 MHz i.e. 100 MHz (100 / 1), 50 MHz (100 / 2), 33.33 MHz (100 / 3) and so on.

Now let us look at the building blocks available in the PRUs of the BeagleBone Black that enable us to achieve this.

MOV, SBCO and LBCO Instructions

A quick primer on the 3 most important instructions for data transfer in PRU assembly. For a detailed overview refer to the PRU instruction set manual.

MOV is used for moving data between registers

MOV R1, R2 // R1 gets entire contents of R2
MOV R1.b0, R2.b0 // least significant byte of R2 copied to LSbyte of R1
MOV R2.w1, R3.w0 // least significant halfword of R3 copied to MSHWord [HWord=16bits]

SBCO is used for moving data between a physical address and a register. This physical address can be within the PRU (e.g. PRU data RAM, shared RAM, power and control registers) or outside peripherals like the GPIO subsystem, the ADC subsystem, the EMIF subsystem (the DDR SDRAM controller), the GPMC. General usage:

SBCO &src, destination, offset, n

Moves n bytes from the src register into (*destination)+offset, this is indirect addressing. If n > 4, then the subsequent registers are accessed as well. Offset can be a register or an immediate value.

When using SBCO to write to the DDR memory, note that we must provide only physical memory addresses as there is no MMU involved when PRU accesses it. This is what most examples of the PRU that demonstrate shared memory access do.

Using R0 = 0x40000000, R1 = 0x100, try and guess what SBCO &R2, R0, R1, 32 does

LBCO loads data from an external address into a destination register. If n > 4, data gets loaded into the subsequent registers as well.

LBCO &dest, src, offset, n

Clock cycle counting

I’ve used the Cortex M3/4 microcontrollers and these cores have this nice register called DWT_CYCCNT which provides number of clock cycles the processor has executed code. So one can take the difference of DWT_CYCCNT register before and after a code block and get the number of cycles this code takes to execute. This allows cycle-accurate code profiling.

Lucky for us, the PRU has this neat feature as well, that I used for determining the number of processor cycles each instruction takes to execute. This is known as the CYCLE register¹. But before we can use it, we’ll have to enable it by setting bit 3 in the CTRL register using this code snippet:

MOV R1, CTPPR_0
MOV R2, 0x00000220 // C28 = 00_0220_00h = PRU0 CFG Registers
SBBO &R2, R1, 0, 4

LBCO &R1, C28, 0, 4 // Enable CYCLE counter
SET R1, 3
SBCO &R1, C28, 0, 4

Notice that we first modify the CTPPR_0 register which allows us to use the C28 register to refer to the PRU control registers instead of using up another register to hold the PRUCFG register address.

So, whenever we need to time a section of the code, we can do something like:

LBCO &R1, C28, 0xC, 4 // Load "before" cycle count into R1
// your assembly code here
LBCO &R2, C28, 0xC, 4 // Load "after" cycle count into R2

Now we can examine the contents of R1 and R2 to determine how many cycles it takes. You would also have to account for the “extra” clock cycles of the LBCO instruction.

I ran a lot of tests using this initially, here’s what I found:

every MOV operation is one cycle. In fact, any operation that does not access external memory or peripherals completes in a single cycle i.e. 5ns. This is by design.
LBCO and SBCO instructions with byte count 4 take 2 cycles. The way I hypothesize is that 1 cycle is spent to generate the address by adding the offset to it and then the actual data transfer operation takes 1 cycle per 32 bits transferred, thus O(n) time. Therefore the SBCO example in the previous section should take 9 cycles to complete (1+32/4), assuming there is no bus stall while writing the data to the memory.

We will use this information to help us with the timings needed for sampling.

Enhanced GPI/GPO feature

The PRU has an enhanced GPIO that operates at 200 MHz² and it implements a “Direct Input” GPI mode. What it means that whatever be the pin value at the PRU inputs at the sampling instant will be captured whenever register R31 of the PRU is read. ~~So, to sample first 8 bits of R31 into the register file, we can use the following PRU assembly code:~~ (edited for more clarity, referring issue #9 on GitHub) The first line of the snippet below shows how one can sample the lowest 8 bits of R31 which is connected to the PRU input pins. The entire snippet shows how to make 5 samples of the input pins and store them into successive registers.

MOV R10.b0, R31.b0
MOV R10.b1, R31.b0
MOV R10.b2, R31.b0
MOV R10.b3, R31.b0
MOV R11.b0, R31.b0

Observe:

We can and pack up to 4 samples into a single 32-bit register using the PRU assembly instruction (Rn.bx refers to the x’th byte in the register – each register is 4 bytes).
You might ask, why sample to the registers, and why not store this data in the 8 KB SRAM, or the 12 KB shared RAM, or even the DDR RAM?

* From the previous sections, we see that every SBCO would take 2 cycles but register access is just 1 cycle, so we can achieve higher sampling rates.
* While accessing the DDR RAM there is a very low but finite probability of bus conflicts while data is being written, and having such an instruction within the real-time sampling loop has the potential to compromise the sampling operation. So we would like to separate both of them.
3. Right now, we’ve only stored data in the registers, and there isn’t enough space to store all samples in the registers. So we need to somehow get the data out of there.

The PRU0/1 Scratchpad and the XIN/XOUT instructions

I remember initially reading this section in the PRU reference manual with skepticism and was disappointed to find scarce resources and/or example applications in the early phase of my GSoC period. But this turned out to be one of the important links in the puzzle.

Apart from the 30 registers in the two PRUs (R30 and R31 are connected to the GPO / GPI respectively), there’s also 30×3 independent register banks available as a scratchpad; and this is connected to the registers on both the PRUs using a “broadside interface”. Broadside means that all 30 registers are connected, and all of them can be moved in parallel. This means that in a single clock cycle one can copy or swap all 30 registers of a PRU with one of these 3 banks.

Here’s an example [syntax:: XOUT , &Rn, count]:

// On PRU1
XOUT 10, &R16, 32 // Copies R16-R23 into Bank0 (Bank0 = 10)

// On PRU0 - after XOUT has executed on PRU1
XIN 10, &R16, 32 // Reads R16-R23 from Bank0 into PRU0

This ability of moving bytes across the PRU barrier is crucial for BeagleLogic as it means that:

We can cleanly separate pin sampling (handled by PRU1) and data transfer (handled by PRU0).
Since pin sampling operates only on registers, it is effectively shielded from bus latencies.
Because register manipulation is cycle accurate we can design delay loops in PRU1 to give us a programmable sample rate, independent of PRU0 operation.
PRU0 can now directly push data into the 512 MB(!) of DDR memory directly giving BeagleLogic a buffer capacity so large at this price point. Note that due to packing the samples into 32 bit words the number of write transactions is cut down by a factor of 1/4th (8 bit samples) or half (16-bit samples) as compared to the sample rate, giving us cycles to spare. [The actual PRU firmware of BeagleLogic writes 32 bytes at a time into the DDR memory.]

Inter-PRU signaling

The final piece in the puzzle for a basic implementation is to have a way so that PRU1 can signal PRU0 that it has pushed data into Bank0 using XOUT, and that it can take data in using XIN and write it to the DDR memory. Interrupts. By configuring the mapping appropriately in the PRU interrupt controller (PINTC) one can send a signal to the other PRU by writing a value into the R31 register which triggers an interrupt (note we generally read from R31, not write). Similarly by waiting on bit 30 of R30 to be set, one can know from the other PRU when an interrupt has occurred. The interrupt configuration and mapping is in general handled by the library (libprussdrv) or the pru_remoteproc kernel driver, we just assume within the PRU that everything is already configured for us and is working.

Putting it all together

Let us try writing a sketch of a very simple firmware which implements a basic form of a 8-bit logic analyzer. This should be helpful when developing for a similar application scenario and understanding the behind-the-hood working of BeagleLogic.

PRU1 code:

MOV R16.b0, R31.b0 // take an 8bit sample
NOP
 MOV R16.b1, R31.b0
NOP
MOV R16.b2, R31.b0
NOP
L1: MOV R16.b3, R31.b0
XOUT 10, R16, 4 // Move 4 samples to Bank0
MOV R16.b0, R31.b0
MOV R31, PRU0_INTR // Signal PRU0
 MOV R16.b1, R31.b0
NOP
MOV R16.b2, R31.b0
JMP L1

Observe how writing an infinite loop this way allows us to sample 4 bytes, move it into Bank0 and signal the PRU0 that data is ready. Also, the time gap between two samples is 2 clock cycles, so this infinite loop samples the pins at 100 MHz. See? How could one sample at, say, 50 MHz?

Next, let’s write the segment of code in PRU0 to receive this and write it into memory

// Assume DDR buffer start address is in R0, offset is in R1
loop:
WBS R30, 30 // Wait till interrupt
// an SBCO to clear interrupt flag (omitted here for clarity)
XIN 10, R16, 4 // Receive data
SBBO &R16, R0, R1, 4 // Store it into the DDR mem
ADD R1, 4 // Increment dest address
JMP loop

Now observing how BeagleLogic does it – PRU1 Sampling code and PRU0 memory writing code should give an idea of the basic processes happening that make BeagleLogic possible. Of course, there is the overhead but it’s the same basic principle.

A similar post explaining the BeagleLogic kernel module is on it’s way soon.

section 4.5.1.4 of the AM335x reference manual ↩
section 4.4.1.2.3 of the AM335x reference manual ↩

The BeagleLogic cape: Assembly and Testing

April 23, 2015January 7, 2015 by abhishek

A day before Christmas, I was greeted by a yellow DHL packet from Hong Kong. The panels had finally arrived after 8 days – I had chosen rush (48hrs turnaround) and express shipping, and it did work out well as I received them before Christmas. The boards turned out to be nice. Not dirty at all and at a price which was reasonable to me.

Assembly

I reflowed the board with a hot air gun. Applied solder paste with a toothpick and placed the parts with the naked eye using a pair of good quality tweezers. The passives are all 0603 size (1.5mm x 0.76mm). Since I was yet to receive the batch of BSS138 FETs I ordered, I used a BC847 NPN transistor for the build instead. Reflowing left a couple of solder bridges on the 74LVCH16T245 (now onwards referred to as ‘the 245’) buffer chip (smallest pitched part on board – 0.5mm) which I removed using a solder wick. Then soldered the pin headers manually.

Cleaned up after soldering using nail polish remover. It’s terribly inefficient, and I plan to get a bottle of concentrated isopropyl alcohol (IPA) at some point of time. Here’s the finished result:

Assembled PCB - Back — BeagleLogic – Components side

“Smoke Test”

A colloquial reference to the first power-on of the assembled circuit. I plugged the cape into the BeagleBone Black and powered on the assembly. The LEDs lit up and the board booted. No smoke or burning smell. Woohoo, test passed, or …

Errata

I then probed the BeagleBone P8 header pins which serve as inputs to BeagleLogic using an LED, and did the same with the input pin headers. The LED were brighter on the inputs to the ‘245 than the P8 pins, which was contrary to expectations. A quick check on the schematics and turns out that I wired the 245 to translate from rail B to rail A while connecting the inputs to rail A and the Bone pins to rail B. Whoops! However I quickly resolved the problem by lifting pads 1 and 24 (DIR1 and DIR2) off the PCB (which was simple as those were on the ends) and soldered a bit of magnet wire through and to one of the pads of C1 going to +3.3V. Neatly done, and a lesson not to design and send off PCBs to a fab while pulling an all-nighter 😛 .

Testing

Once this was done all was back on track and the circuit worked as expected. The cape as designed did not interfere with the boot-up process due to the action of the transistor pull-down that ensured that the buffer did not drive the P8 pins until SYS_RESETn signal was high.

I tested it with SPI signals upto 24 MHz using the BeagleBoard itself and results are good at a 100Msps sample rate. I could also subject the input pins to 5V freely, without fear that I would burn the board. That’s exactly what the cape is meant for, and I’m happy with the results.

I am planning to give away the surplus boards from the first batch so anyone interested in testing it out can get in touch and depending upon where you are, I may be able to send you one of the surplus unpopulated boards (no parts included) which can be assembled yourself (1 IC, 4 passives, 1 transistor and pin headers). The Cape EEPROM section can be left unpopulated without affecting the functionality.

The design files are now available here. I made some changes to the design after fixing the errata, so the cape version is bumped to 1.1 .

Suggestions and feedback on the cape are welcome. The design is Open Hardware, so you have the freedom to use it and improve it as you like. Let me know if there’s anything that could be added in the cape as there is plenty of board real estate.

Introducing: The BeagleLogic Cape

February 21, 2015December 19, 2014 by abhishek

After coding up the BeagleLogic project, I thought that it would be great to have an add-on cape for the project that provides buffering and also makes the inputs of the BeagleBone Black tolerant to TTL logic voltage levels (up to 5.5V) allowing BeagleLogic to debug external projects with ease. Hence introducing the BeagleLogic cape, the 3D render of which you can see above. The design is done in KiCad.

The design source and gerbers will be made available on the BeagleLogic GitHub repository after I physically assemble and verify the design.

Design & Layout

The cape design is simple enough to just have a single layer layout, as you can see in the render above the top layer is entirely a ground plane but for a single trace. Since the top isn’t much populated I added useful information on the top silkscreen including indexing the pin headers on the Bone on both sides.

The logic channels are accessed via 2×14 right angled pin headers. The upper row of headers are the actual logic channels while the bottom row is all GND pins. The pin headers are arranged in a MSB-to-LSB fashion. This means that the rightmost pin when viewed from the top is raw bit 0 of the captured logic samples. Note that sigrok will use the names of the actual Bone pins so bit 0 (Channel 1) is to be identified as P8_45, bit 1 (Ch2) is P8_46 and so on. The numbering is a little non-obvious but it’s because that’s the way the pins are arranged on the BeagleBone GPIO header. But don’t worry as the cape lists the pin ID of each logic channel so you don’t have to look it up in the pin diagrams.

One important point here. Only the first 12 channels can be used by default. To use the last two channels, you must disable eMMC first and solder 0R resistors or bridge the two resistors R8 and R9 on the bottom side to enable them. Otherwise the buffer will drive those two pins and you will damage the eMMC of the board and also void the warranty.

Here’s a shot of the schematic (click to enlarge). This is for reference only with respect to the current board and the released schematic may or may not be the same

The active buffer is a TI 74LVCH16T245 or equivalent. The buffer is powered from the VDD_3V3B power rail. The OE pin initially pulled is driven using an arrangment of a BSS138 N-MOSFET whose gate is connected to SYS_RESETn of the Bone. This should ensure that the logic input pins, which are also the system boot pins, are not driven by the buffer until the startup has completed.

This version of the design has a 0R resistor through which the VDD_3V3B powers the VDDA side rail of the 16245. If you remove the short and connect it to a 1.8V supply it should become compatible with 1.8V logic levels. I am however thinking of a better solution to the problem and should address this in the next released design.

There’s the officially required cape EEPROM on the bottom side as well, I presume this could be rendered redundant as the community moves towards the Universal Cape concept. But the footprints are there, just in case.

Manufacturing

The first prototype cape has been manufactured by DirtyPCBs.com as a 2-layer Black 10x10cm protopack. It has been shipped as of the time of writing and should reach me next week. I ordered the boards as a Rush order (48h turnaround time) and got it shipped via DHL so that I could have the boards in hand before Christmas rush. I would be using their services further if the boards work out well, looking forward to receive them!

Since I had left space on the panel, and there’s free panelizing so I managed to squeeze some more of my designs into the panel and make the best use of the available real estate. I would write more about those in the coming posts.

So that’s pretty much it. Design suggestions are welcome, and I’ll see if they can be accomodated in the subsequent hardware revisions. Once I test and it all works, the design files will be made available as I have written above.

BeagleLogic: 14 weeks and counting; The journey so far

February 21, 2015September 2, 2014 by abhishek

BeagleLogic was born out of a project idea using the BeagleBone Black, and I am thankful to Jason Kridner and the BeagleBoard.org community for accepting it as a mentored project under the Google Summer of Code 2014. As the GSoC program officially ended last week, here is a video report highlighting what’s been accomplished so far, over a period of 3 months from 19th May to 18th August.

It was an awesome experience this summer, interacting with the community and developing the codebase for a project that showcases the capability of the two little yet powerful integrated Programmable Real-Time Units (PRUs) on the AM335x SoC that powers the BeagleBone Black to realize a 14-channel, 100Msps, 360 MB buffer Logic Analyzer which by far, offers the best value for a 55$ board which you can use not only to develop your embedded projects but also to debug them on-site without any extra hardware requirement. All the magic happens inside the firmware loaded into the PRUs and the kernel driver which manages sharing the system memory with them.

This project also saw me doing many things for the first time. It was the first time I compiled a Linux kernel and wrote myself a kernel module that is now a part of the BeagleBone community kernel, and I look forward to getting it merged upstream. The Web Client for BeagleLogic (link) is my first web application using HTML5 and Bootstrap for a lightweight web client for BeagleLogic. The backend I’ve written for the web client is my first Node.JS based application. I found these frameworks quite interesting and hope to work on projects in these areas in the future.

Google Summer of Code has been the best thing to happen to me so far, and I’ll be looking forward to it in the years to come.

Have fun with BeagleLogic. Also, do write back to me with how BeagleLogic helped you debug helped you debug your circuit and learn more about logic protocols!

BeagleLogic goes kernel-mode with PRU remoteproc[Week 2-3]

February 21, 2015June 11, 2014 by abhishek

The first week was spent well in prototyping the PRU firmware. Problems with the memory management [Maximum of 8 MB of memory] and dismal memory copy speeds [10-20 MB/s] with UIO and libprussdrv prompted me to escalate the back-end to the remoteproc kernel driver. It took me a while before I started off based on the WS28xx lighting firmware example to build up the necessary kernel driver infrastructure for BeagleLogic, and it’s now close to the first set of stress tests and actual data capture for the PRU.

Going into the kernel driver, here is a summary of what’s coming up in the next update [the experiment is with dummy data, the PRU’s aren’t engaged] :

BeagleLogic Week 1: Building the PRU Firmware

February 21, 2015May 27, 2014 by abhishek

The first week of the Summer of code began with the build of the core PRU firmware for data capture. I coded up firmware for both the Programmable Real-Time units operating in tandem to sample the pins and transfer it to the DDR memory directly. The code can be viewed at these links:
PRU0 assembly code
PRU1 assembly code

It makes good use of the XIN/XOUT broadside interface available for inter-PRU communication allowing movement a chunk of sampled data from PRU1 to PRU0 [currently 8 registers = 16 samples = 32 bytes] in one clock cycle. PRU0 then writes the data into the DDR memory in bursts of 32 bytes. Inter-PRU signalling is achieved through interrupts.

For buffer overflow/underflow detection, there is a global byte counter running in PRU0, which is moved alongwith the logic data “for free” via XOUT. PRU0 compares the received value of the counter with the value received from the previous interrupt, and if they differ by more than 32, then there has been an overflow. Also, when the ARM core is signalled for data, an interrupt counter is incremented. The counter is compared to its previous value, and the delta here again enables us to determine if an underflow has occurred.

This approach works fine for one-shot sampling, and I have been able to achieve all the way up to 4 MSamples of 12 pins running at 100 MHz [40ms], although the limitation on the maximum sample rate is likely to be the hardware of the BeagleBone Black in this case, remember there’s a 47pF capacitive load on the HDMI shared pins, and it is yet to be tested with actual hardware.

Currently there’s only 8 MB of memory shared with the PRU for storing samples, there’s an issue with the UIO kernel driver that prevents reserving more memory. The UIO driver will not be fixed, rather the issue will be addressed in the remoteproc interface of the PRU with the kernel. Until then, there is a workaround by adding mem=448m to the boot command line in uEnv.txt, to reserve the upper 64 MB of the memory for the PRU.

Reducing the sample rate is just inserting more NOPs into the sample loop to adjust the cycles. However, availability of more room between two sampling instructions turn out to be potential cycles for performing RLE (which will be implemented very soon), which seems achievable at 50 MHz.

Overwriting previous chunks of data works fine as well, stress tested upto overwrite 500x. See here for an example run.

The amount of samples collected is still limited by only the amount of memory available. The PRUs are quite fast and capable of sample rates like 100 MHz. The only bottleneck is to hold the data and process it before it runs out, and this is the target for the coming week.

The current test code will be adapted to the sigrok bindings once it is in good shape. The current stable code is available at the repo here, and the latest development is in the “prutest” branch here

Next update due on June 4th.

BeagleLogic Update 0 – 18th May 2014

February 21, 2015May 18, 2014 by abhishek

The introduction video to BeagleLogic is now released! Check it out here:

The slides are available here.

This week I spent time understanding the remoteproc implementation and UIO implementations of the kernel drivers of the PRU in the kernel source tree.

One of the important decisions this week was the confirmation of UIO implementation for the PRU drivers. Although a remoteproc implementation is a better approach, the initial implementation of the core would be in UIO, and as the remoteproc infrastructure improves, BeagleLogic will eventually migrate. Since this would be a core change, it would have minimal impact on the functionality.

All set for May 19th when the coding period commences.

Next update coming up on 21st May.