Raspberry Pi Pico PIO - VGA: Homebrew Video Interface Program Ep.12

Поділитися
Вставка
  • Опубліковано 28 лис 2024

КОМЕНТАРІ • 43

  • @Nāmarūpa1
    @Nāmarūpa1 Рік тому +1

    Amazing stuff David. Thank you so much for sharing your wisdom.

    • @LifewithDavid1
      @LifewithDavid1  Рік тому

      Glad you enjoyed it! I wish I was wise; I'm just curious and stubborn.

  • @kippie80
    @kippie80 2 роки тому +2

    By default, all memory is used in a stripped configuration. if you segment the memory and dedicate to cpu, dma, etc, you can eliminate blocking. The trade off is that now your code is having to be aware of memory banks. Great video! Thanks so much! (yes, been reading the almost 1000 page datasheet)

    • @LifewithDavid1
      @LifewithDavid1  2 роки тому +2

      Thanks for watching! That's what I have been struggling with; to get to where I can manhandle the memory banks, I have to get a whole lot smarter. I have gone through the Raspberry Pi Foundation's code for their scan video, and I now understand about 70% of what they are doing, but it's that 30% that really baffles me. And that's the 30% I need to get rid of the tearing. :-)

    • @TheFerdi265
      @TheFerdi265 2 роки тому +1

      usually when filling the PIO fifo from DMA, the difference between using striped memory vs banked memory is not that large. since the 25MHz pixel clock means that the PIO will only consume a pixel every (assuming default CPU freq) 5 clock cycles, there's plenty of time even if two bus masters compete for bus access, and that's assuming they always access the same memory bank. (also, usually, depending on the pixel format used, you can get more than one pixel into a DMA word; with 16bpp, you get 2 pixels per DMA copy, so thats roughly only every 10 cycles)
      striped memory is actually quite great in mitigating this, since parallel linear readouts of memory will basically automatically synchronize in a way that after an initial bus collision, they will always read from different memory banks if they copy at the same speed

  • @TheFerdi265
    @TheFerdi265 2 роки тому +2

    When I first got VGA running on my pico, I also had some tearing problems, but not as regular as yours, but I got it pretty smooth in the end without doing lots of trickery.
    There were 2 situations in which I saw tearing in my projects:
    1) having large amounts of code or data in flash that run frequently sometimes caused cache misses which stalled the DMA transfer. having the DMA copy from a buffer in RAM instead made it much more stable. it can also help to put important IRQ handlers into RAM to remove the possibility of having a cache miss in those
    2) IRQ handlers being somehow too slow to restart the DMA. setting up the DMA to run continuously and retriggering itself removes the need for CPU needing to restart the DMA in time after a scanline, which makes the timing less likely to be too late. You can even set it up in a way that ping-pongs between a set of buffers to be able to do double or triple buffering, so you can edit a buffer while the other(s) are copied to the PIO.
    I also used a slightly different PIO setup. you can get away with using much less code on the PIO for sync pulses if you stream in a list of commands for building the sync pulse from the CPU (e.g.: , , ). that way you can do the whole sync pulse (v and h) on a single PIO SM, which is also what the pico_scanvideo library does.
    the pico_scanvideo library is quite a treasure trove of interesting ideas, i ended up stealing quite a bit of it, but I found all that buffer management it does to be overkill and far more complex than necessary, so I did my own. My code is not published anywhere yet, but I do intend to at some point.

    • @LifewithDavid1
      @LifewithDavid1  2 роки тому +1

      Thank you so much for the comments. It gives me quite a bit to consider. I used more "ping-ponging" of DMA with my recent arbitrary waveform generator; I'll have to incorporate more of that into my video display. Thanks again!

  • @kpharck
    @kpharck Рік тому

    No way I'd give you a "thumbs down", sir ! Thank you for your videos !

  • @TheDarkelvenangel
    @TheDarkelvenangel 2 роки тому +1

    This is very impressive and very well explained, thank you for all this effort.

    • @LifewithDavid1
      @LifewithDavid1  2 роки тому

      Thank you for your kind words and thanks for watching.

  • @Ololoshize
    @Ololoshize Рік тому +1

    Impressing, thank you.

  • @PeranMe
    @PeranMe 2 роки тому +2

    Thank you, this is great stuff!!

  • @MakunaRGBIC
    @MakunaRGBIC 2 роки тому +1

    The pin interrupt handlers, are they called on the same core as the main? Are they called on the same core from hsync/vsync? Is the system blocking cores to handle these?
    I know it's a minor issue, but maybe two separate irq handlers and get rid of the extra instructions for compare who its being called for (under the model of keep IRQ short as possible).

    • @LifewithDavid1
      @LifewithDavid1  2 роки тому

      Thanks so much for the comment! Speeding up the interrupts would help out. I was under the impression that there could be only one gpio_callback routine per program, and that I had to do comparisons to find out what was triggering the interrupt. Getting rid of comparisons would certainly speed things up. If I remember correctly, I tried a few things to see how I make the interrupts faster.
      I've since learned some things with my AWG projects. When I get back to this project; I'll have to do some soul searching to see how I can apply what I learned. Thanks for watching!

  • @Marc_Wolfe
    @Marc_Wolfe Рік тому

    I need a better understanding of coding these counter loops used in the V sync PIO program. Also, I assume interrupts can be used between PIOs without being acknowledged by CPU cores etc? Looks like with a better understanding it would be simple to implement a pulse counter (count crank sensor teeth), reset/synchronized by cam senor, PIO per cylinder (or pair with waste spark) each with a different count, and CPU dictates adding/subtracting tooth count for spark advance. Could do one for all cylinders, except I vaguely remember doing math that suggests overlapping spark outputs for high RPM engines (or at least overlapping dwell time or something). Fuel can just be basic duty cycle adjustment... of course fully sequential would be cooler.

    • @LifewithDavid1
      @LifewithDavid1  Рік тому

      That's what I was thinking. My next video will fix into interrupts in more detail.

  • @eli3963
    @eli3963 2 роки тому

    Thanks for the super detailed videos! I'd personally be interested in seeing a QSPI controller. The Pico has two SPI peripherals, but as far as I can tell there's no way for those to operate in QSPI mode. Chips which offer QSPI usually initialize in normal SPI mode, so a controller would need to be capable of both.

    • @LifewithDavid1
      @LifewithDavid1  2 роки тому

      Although the Pico doesn't have QSPI, the RP2040 does and uses it to directly connect to the flash memory. According to the RasPi datasheets, warnings are given about keeping lead lengths short and direct since the bus speed is extremely fast and crosstalk is an issue. It would be interesting to try to implement a low speed QSPI in PIO; but I don't have the instrumentation or knowledge needed to troubleshoot it (yet?). Thanks for the suggestion, it gives me some ideas for the future.

  • @slimhazard
    @slimhazard 2 роки тому +1

    14:24 tight_loop_contents is a function call, that should have the open and close parentheses: tight_loop_contents(). I'm surprised there was no compiler error. No idea if that has anything to do with the problem.

    • @LifewithDavid1
      @LifewithDavid1  2 роки тому

      Thanks, I'll have to try that. I copied that function from one of Raspberry Pi's example programs, that didn't have parenthesis either. However, I commented out "TightLoopContents" and I still saw the tearing.

    • @slimhazard
      @slimhazard 2 роки тому

      @@LifewithDavid1 come to think of it, you can do better than tight_loop_contents(), if you want the main program to enter a state of "stop and don't do anything" while PIO and DMA are off doing their thing. Call __wfi() instead, for "wait for interrupt", which puts the processor in a sleep state until an interrupted is detected. If there is an interrupt, the while loop loops back around and puts the processor back to sleep again. tight_loop_contents() does exactly nothing, so you end up with a busy-wait loop, wastefully spinning around and around. No idea why all of the examples do it that way, it seems terribly wasteful. (But still no idea if __wfi() will do anything to fix the problem.)

    • @davidminderman3179
      @davidminderman3179 2 роки тому +1

      @@slimhazard Thanks for the comment! Sorry for the late response. I just saw your reply (UA-cam doesn't flag comments to comments as needing replies; so I only stumbled on it by chance). I'll look at __wfi() next time I'm working with that breadboard. It makes sense that there would be less tearing if I could turn off the processor. However eventually, the processor will be needed full time when it is calculating each scan line instead of just displaying it; so it might only be a short term fix.

  • @Marc_Wolfe
    @Marc_Wolfe Рік тому

    I wonder if part of the tearing could be clock stretching. Probably not, I'd expect that to only ever cause the tiniest shift. Probably just DMA.

    • @LifewithDavid1
      @LifewithDavid1  Рік тому +1

      I don't think so. I examine clock stretching in my PIO episodes 15 and 16 ( ua-cam.com/video/Yui6NjrU23c/v-deo.html & ua-cam.com/video/8ByDgh5-O2U/v-deo.html ). The video stream runs at full sysrem clock speed so there isn't any fractional divisors to "stutter step" the clock. I think the master bus is getting in the way. I've learned a lot more about interrupts recently and I might take another look.

  • @danman32
    @danman32 11 місяців тому

    Great stuff! I've only come across your site last night and learned a TON! So much more to absorb.
    The Pico may be unconventional and not the biggest set of resources, but it has a lot of features like the PIO that makes up for it.
    I hope and wish you'll figure out what is causing the horizontal line tearing, since whatever the cause and remedy is could help us out with other time-sensitive applications.
    As I pointed out in comments in other of your videos, i am trying to develop a 16x300 Neopixel graphics array with the 16 rows driven by separate channels, demuxed by use of 74HC595 serial to parallel shift register. Tricky part is to grab 1 bit from each of the 16 rows and send them to the GPIO driving the 74HC595.
    Even if I use 16 GPIOs to handle 16 channels for the Neopixel array, I still have to get the single bits that are many byes apart.

    • @LifewithDavid1
      @LifewithDavid1  11 місяців тому

      I'm trying to think how to use PIO to grab 16 independent bits and then shift them out as a single 16 bit word. I think there are a lot of options, especially if you aren't afraid of solder.

    • @danman32
      @danman32 11 місяців тому

      @@LifewithDavid1 Do you mean have the PIO grab a 16 bit word and put each of the bits on different GPIO pins as a parallel output?
      I believe that's what an instruction out pin, 16 does if you have the PIO assigned to 16 consecutive GPIO pins. I probably don't have the instruction written right, but hopefully you'll get what I am trying to convey. The out instruction would be pulling 16 bits from the OSR, placing them into 16 consecutive GPIOs as if the GPIOs were a 16 bit register.
      You could use a 74HC589 is an 8 bit parallel in, serial out IC, complimentary to the 74HC595 which is 8 bit serial in that I plan to use.
      In my case, to drive 16 of rather long strands of neopixels and only need to use one GPIO pin to drive them as a 16x300 matrix. The problem I have is that the sequential bytes in the pixel map is the sequential pixels in a row/strand. Somehow I need to take 1 bit at a time from each row, put the bits together sequentially (multiplex them), and send them out the GPIO serially so that the 74hc595 can demux them to each strand. One method is to transpose the bits in the 300 column by 16 row pixel matrix, and transpose that to 16 columns by 300 rows before sending them out via the PIO during a frame. There is an Adafruit article that someone was able to do that in Circuitpython for 8 rows, where he used the SDK transpose function for speed. Unfortunately that function only supports 8 bits, i need 16.
      Hopefully a function written in C will be fast enough. Better yet, if the bit level transpose can be done for each set of bits sent out the GPIO to the 74HC595, that would be even better.

    • @LifewithDavid1
      @LifewithDavid1  11 місяців тому

      @@danman32I haven't fully thought this through, but could you divide it into 8 x 2 where you use all 8 OSRs but alternate bits from each row. For instance, take row 1 and 2 (of 16). Create a word in the main core that intermixes the bits like (row 1-bit 1; row 2-bit 1; row 2-bit 2; row 2-bit 2; row 1-bit 3; row 2, bit 3; row 1-bit 4; row 2-bit 4; etc). Then you do the same for rows 3 and 4, 5 and 6, 7 and 8, etc. Now you have 8 interspersed words. Then feed them into the 8 separate PIO state machines. Synchronize them using interrupts and each PIO state machine could output to two GPIOs for a total of 16 pins. I think intermixing two rows would be quicker than intermixing 16. That's probably not the solution, but the concept of ganging together all 8 PIO state machines might make the problem more manageable. Interesting problem.

    • @danman32
      @danman32 11 місяців тому

      @@LifewithDavid1 I thought of using 16 GPIOs rather than 1 GPIO and demux using the 74HC595, but using and placing the 74HC595 near the neopixel array reduces the number of wires i need between the matrix and the Pico.
      The stream of 24 bits to a neopixel, with a "start" bit and "Stop" bit surrounding the data bit cannot be paused by more than 9uS, otherwise the strand will think it has been fully populated and will display what it has been sent. Each start, data and stop bit package has to be about 1.25uS. There is some tolerance to the 1.25uS specification though.
      Even in your idea, you still have the problem of pulling bits from 2 separate streams of bits and putting them together.
      R1b1R2b1R1b2R2b2R1b3R2b3. etc where you have in each row 24x300 bits (24 bits per pixel, 300 pixels).
      I believe the Adafruit library provides a means to address the entire pixel matrix as a 2 dimensional array, so that might be an approach, and only transpose 3 bytes (24 pixel bits) at a time per row. If that can be accomplished in significantly less than 9uS from when i last transmitted the bits to the matrix, that could work. Not clean, as ideally you want the pixel bits steadily streamed, but probably would work. I'd be using a much smaller intermediate buffer too. I would just need to be sure I kept track of where I was in the main buffer.

    • @LifewithDavid1
      @LifewithDavid1  11 місяців тому

      @@danman32I haven't done any experimentation with Neopixels (I probably should), so I can't wrap my head around it very quickly. However, you can do a lot in 9 uS (1125 clock cycles) in assembly language. There are 8-32 bit registers that have full arithmetic, logic, shifting, and indexed memory ops functionality. By taking the words 8 at a time you should be able to slice and dice them however you want (because they are in the registers, each op only takes 1 clock cycle). Plus you have a whole other core with the same capability that can run in parallel. Jam the transposed words into one or more PIO state machines using DMA (no core time needed) and it might work. You could even have one core just feeding a graphics memory, and the other just pulling from the memory and popping it into PIO. However, until I actually work with Neopixels, I can't give you great advice; I'm sorry. However, it sounds cool; good luck!

  • @Malcrom1967
    @Malcrom1967 Рік тому

    As a retired Telecom worker I haven't seen that cable colour code in a long time lol

  • @danman32
    @danman32 11 місяців тому

    By the way, if you want to see "crappy" VGA, check out Ben Eater's "Lets build the world's worst VGA card" built with discrete TTL logic.
    Signal is 800x600 @ 60Hz using 10Mhz crystal for simplicity. Actual resolution 200x600 but because of addressing limitations in an EEProm and make the aspect ratio be more reasonable, image resolution goes down to 100x75.
    Now THAT's crappy. Very interesting and educational though. I believe i was able to bring the image up to 200x128 with addressing changes to the EEProm.
    EEProm response time wasn't adequate so there were jail-bars.

    • @LifewithDavid1
      @LifewithDavid1  11 місяців тому

      I think Ben's is better than mine; at least he doesn't have delayed horizontal scans. However, since I've done bare metal stuff recently, I think I could do better now. Thanks for the comment!

  • @Marc_Wolfe
    @Marc_Wolfe Рік тому

    Could probably waste a bunch of PWM channels on a Teensy 4.0 (600MHz) to make it happen. I might try on something.

    • @LifewithDavid1
      @LifewithDavid1  Рік тому

      It sounds like it has the speed. It would be a cool project. You might have to get down to machine language programming to output the data in a controlled manner. That's where the PIO shines. Good luck!

    • @Marc_Wolfe
      @Marc_Wolfe Рік тому

      @@LifewithDavid1 Yeah, it'd be clock dividers and top values to set the right pulse timings, then right a duty cycle register high or low for pixels... something like that. I'll bother making it less half-baked when I actually get the part.

    • @LifewithDavid1
      @LifewithDavid1  Рік тому

      @@Marc_Wolfe Sounds good! Should be fun!