Using CPU-DMAC to transfer via USB cart to PC-side

mrkotfw

Mid Boss
I'm using CPU-DMAC, with 1-byte stride, fixed destination, and cycle-steal mode (burst is not supported?).

I'm seeing terrible performance, if I transfer 4096 bytes in chunks of 64 bytes, I notice that the PC doesn't receive all 4096 bytes. For every 64-byte chunk transfer via CPU-DMAC, I poll for the TXE bit before starting the transfer.

I noticed that all 4096 bytes are sent when I add in a wait of 100 iterations.

A single 1-byte transfer (4096 DMA calls) works flawlessly. As does writing byte per byte.



I don't have much of an understanding of what to look for.

Is this just not possible?
Is there a known amount of time to wait per transfer?
Does this have to do with the SCU bus registers?
Should I just stick to writing byte per byte?
 
Setting SCU(ASR0) to 0x00000000 was the culprit.

Can anyone explain, or point to material that explains the concepts behind what the ASR0 generally does? It's a mystery to me, unfortunately.

Takes about 66.0ms to transfer 64KiB from the Saturn to host via CPU-DMAC.
 
Setting SCU(ASR0) to (9<<20) (9-cycle wait) seems to work. Anything lower than that prevents the Saturn from sending the full 64KiB.
 
The register controls some access timings. My understanding is as follows:

- If the "previous read" bit is set, then after a read access the SCU will pre-emptively drive the address bits with the next address. If the next access is to that address, some time is saved since the address lines are already set up. Note that the SCU errata 9 forbids the use of this setting.
- The write/read pre-charge bits insert a one-cycle idle state after each write/read.
- If external wait is set, bus slaves can insert wait states by pulling the /AWAIT signal.
- Wait states should be self explanatory. It's not explicitly documented in the SCU manual at least, but I assume "burst access" means SCU DMA.
- Bus width should also be obvious, I'm not sure how it is affected by errata 32. Four read accesses may be generated if the bus width is set to 8 bits, but this should be tested on real hardware.
 
Thanks for the information. I'll continue testing, and testing the other bits.

So far, I've upped the transfer from 64KiB to 860KiB, and that takes about 700ms.

Is there a 2-byte aligned address to write to the USB FIFO?
 
One thing to note about the FT245R is that even though it uses 64 byte USB packets in both directions, two bytes of each IN (device to PC) packet is reserved for status information. Try writing to the FIFO in multiples of 62 instead, to avoid overruns. It may also improve bus utilization, as writing 64 bytes to the FIFO will result in two packets being generated, with 62 and 2 bytes of payload, respectively.

Edit: OTOH, the device has a 256 byte transmit buffer. If you don't write in nice powers of two, if may be impossible to determine when there is sufficient space in the FIFO or not.

Edit2: I'm starting to remember why I never implemented DMA to host transfers. The basic issue is that you have no control over when the chip sends data to the PC, so you have no actual knowledge of the state of the transmit FIFO. You can be in the middle of writing your 64-byte chunk when the host generates an IN transfer, so it's impossible to try to calculate how much space there is left.

When transferring from the host to the Saturn, you know that the device driver will send the data in as few USB packets as possible, so you know that if there's data in the receive FIFO it's either a full 64-byte package, or the final remaining bytes. You just have to make sure the command and data are sent separately.

Edit3: The FT245R does have a latency timer to make it hold off sending short packages to the host, but that still leaves the issue of the FIFO draining in 62-byte chunks.
 
Last edited:
Very interesting.

So the consensus is that there could be a loss of data, or is this the reason why there's a hang, and the host is reporting that not all the data has been sent?

Aside from your last edit in your previous post, I've tested 62, 64, and 256. I haven't verified if all the data has indeed been correctly transferred.

All this work is mainly for implementing an output console for ssload. I guess that I could go the other way where I send a single byte to notify that the text output buffer is full, pass the address, and size. And have the buffer transfer be initiated by the host instead?
 
If you write 64 bytes to the FIFO every time TXE goes low, then it will overflow and data will be lost.

I implemented an output console in ftx. It's just a basic polling loop, but it works. On the Saturn side, you just write to the FIFO the same way as always. The default value of the latency timer is 16ms, which is plenty good for logging, but if need be it can be changed via an API call.

I don't quite understand your last question. If you really want to use DMA I suppose you could have the host read in 62-byte chunks, and then send an acknowledgement for each received package. Whenever the Saturn reads a reply from the FIFO, it then knows there's room for another 62 bytes. With lots of data to send you could start by priming the transmit FIFO with 62*4 bytes and then try to keep it at that level (for performance you always want to have data ready to transmit every USB frame). However, the extra handshake would about halve the maximum achievable speed.
 
Thanks antime, I am using ftx in combination with my own transfer tool.

Disregard the last comment, I had the idea backwards.
 
Back
Top