Is the execution speed of the 'transferIn()' function about 0.16s? - webusb

It receives 2048 bytes from one transferIn function as bulk transfer.
It takes 0.16s to execute one function.
This means it takes more than 80s to get 1mb.
What should I do to speed this up on my JavaScript code using WebUSB? Or is there no way?

In addition to the time it actually takes for the data to be transferred a single call to transferIn() has to do a lot of work to set up the host to receive data from the device. Even assuming there was zero delay introduced by the web browser and operating system the USB only provides transfer opportunities every 1ms (for full-speed devices) or 125us (for high-speed devices). The tricks for increasing your data transfer rate are:
Submit transferIn() calls with buffers much larger than the endpoint's packet size. This trades latency for throughput. The transfer won't complete until the buffer is full or a short packet is received but the host controller won't waste time waiting around for the operating system to ask it to ask the device for more data.
Submit multiple transferIn() calls in parallel. This adds more overhead but solves the latency problem by reporting transfer completion at higher granularity. This technique is commonly used for endpoints that deliver events. Keeping at least two transfers in flight at once guarantees that the next event will be delivered immediately rather than having to wait until a new transfer request is set up after the first event is processed.
This advice also applies for transferOut().

Related

What is the main difference between RSS, RPS and RFS?

As known, there are: https://www.kernel.org/doc/Documentation/networking/scaling.txt
RSS: Receive Side Scaling
RPS: Receive Packet Steering
RFS: Receive Flow Steering
Does it meant that:
RSS - allows us to use many CPU-Cores to process Soft-irq from Ethernet (one CPU-Core for each Ethernet queue)
RPS - allows us to process Soft-irq for all packets from one the same connection on one and the same CPU-Core
RFS - allows us to process Soft-irq for all packets from one the same connection on one and the same CPU-Core on which thread of our Application procces this connection
Is that correct?
Quotes are from https://www.kernel.org/doc/Documentation/networking/scaling.txt.
RSS: Receive Side Scaling - is hardware implemented and hashes some bytes of packets ("hash function over the network and/or transport layer headers-- for example, a 4-tuple hash over IP addresses and TCP ports of a packet"). Implementations are different, some may not filter most useful bytes or may be limited in other ways. This filtering and queue distribution is fast (only several additional cycles are needed in hw to classify packet), but not portable between some network cards or can't be used with tunneled packets or some rare protocols. And sometimes your hardware have no support of number of queues enough to get one queue per logical CPU core.
RSS should be enabled when latency is a concern or whenever receive
interrupt processing forms a bottleneck. Spreading load between CPUs
decreases queue length.
Receive Packet Steering (RPS) "is logically a software implementation of
RSS. Being in software, it is necessarily called later in the datapath.". So, this is software alternative to hardware RSS (still parses some bytes to hash them into queue id), when you use hardware without RSS or want to classify based on more complex rule than hw can or have protocol which can't be parsed in HW RSS classifier. But with RPS more CPU resources are used and there is additional inter-CPU traffic.
RPS has some advantages over RSS: 1) it can be used with any NIC,
2) software filters can easily be added to hash over new protocols,
3) it does not increase hardware device interrupt rate (although it does
introduce inter-processor interrupts (IPIs)).
RFS: Receive Flow Steering is like RSS (software mechanism with more CPU overhead), but it not just hashing into pseudo-random queue id, but takes "into account application locality." (so, packet processing will probably be faster due to good locality). Queues are tracked to be more local to the thread which will process received data, and packets are delivered to correct CPU core.
The goal of RFS is to increase datacache hitrate by steering
kernel processing of packets to the CPU where the application thread
consuming the packet is running. RFS relies on the same RPS mechanisms
to enqueue packets onto the backlog of another CPU and to wake up that
CPU. ... In RFS, packets are not forwarded directly by the value of their hash,
but the hash is used as index into a flow lookup table. This table maps
flows to the CPUs where those flows are being processed.
Accelerated RFS - RFS with hw support. (Check your network driver for ndo_rx_flow_steer) "Accelerated RFS is to RFS what RSS is to RPS: a hardware-accelerated load balancing mechanism that uses soft state to steer flows based on where the application thread consuming the packets of each flow is running.".
Similar method for packet transmitting (but packet is already generated and ready to be send, just select best queue to send it with - and to easier post-processing like freeing skb)
XPS: Transmit Packet Steering: "a mapping from CPU to hardware queue(s) is
recorded. The goal of this mapping is usually to assign queues
exclusively to a subset of CPUs, where the transmit completions for
these queues are processed on a CPU within this set"
osgx's answer covers the main differences, but it is important to point out that it is also possible to use RSS and RPS in unison.
RSS controls the selected HW queue for receiving a stream of packets. Once certain conditions are met, an interrupt would be issued to the SW. The interrupt handler, which is defined by the NIC's driver, would be the SW starting point for processing received packets. The code there would poll the packets from the relevant receive queue, might perform initial processing and then move the packets for higher level protocol processing.
At this point RPS mechanism might be used, if configured. The driver calls netif_receive_skb(), which (eventually) will check for RPS configuration. If exists it would enqueue the SKB for continuing processing on the selected CPU:
int netif_receive_skb(struct sk_buff *skb)
{
...
return netif_receive_skb_internal(skb);
}
static int netif_receive_skb_internal(struct sk_buff *skb)
{
...
int cpu = get_rps_cpu(skb->dev, skb, &rflow);
if (cpu >= 0) {
ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
rcu_read_unlock();
return ret;
}
...
}
In some scenarios, it would be smart to use RSS and RPS together in order to avoid CPU utilization bottlenecks on the receiving side. A good example is IPoIB (IP over Infiniband). Without diving into too many details, IPoIB has a mode which can only open a single channel. This means all the incoming traffic would be handled by a single core. By properly configuring RPS, some of the processing load can be shared by multiple cores, which dramatically improves performance for this scenario.
Since transmitting was mentioned, it worth noting that packet transmission, which results from the receiving process (ACKs, forwarding), would be processed from the same core selected by netif_receive_skb().
Hope this helps.

Latency in ioread

Suppose you have a PCIE device presenting a single BAR and one DMA area declared with pci_alloc_consistent(..). The BAR's flags indicate non-prefetchable, non-cacheable, memory region.
What are the principle causes for latency in reading the DMA area, and similarly, what are the causes of latency reading the BAR?
Thank you for answering this simple question :D!
This smells a bit like homework but I suspect the concepts are not well understood by many so I'll add an answer.
The best way to think through this is to consider what needs to happen in order for a read to complete. The CPU and the device are on separate sides of the PCIe link. It's helpful to view PCI-Express as a mini network. Each link is point-to-point (like your PC connected to another PC). There may also be intermediate switches (aka bridges in PCI). In that case, it's like your PC is connected to a switch that is in turn connected to the other PC.
So, if the CPU wants to read its own memory (the "DMA" region you allocated), it's relatively fast. It has a high speed bus that is designed to make that happen fast. Also, there are multiple layers of caching built in to keep frequently (or recently) used data "close" to the CPU.
But if the CPU wants to read from the BAR in the device, the CPU (actually the PCIe root complex integrated with the CPU) must compose a PCIe read request, send the request, and wait while the device decodes the request, accesses the BAR location and sends back the requested data. Tick tock. Your CPU is doing nothing else while it waits for this to complete.
This is pretty much analogous to asking for a web page from another computer. You formulate an HTTP request, send it and wait while the web server accesses the content, formulates a return packet and sends it to you.
If the device wishes to access memory residing "in" the CPU, it's pretty much the exact same thing in reverse. ("Direct memory access" just means that it doesn't need to interrupt the CPU to handle it, but something [the root complex here] is still responsible for decoding the request, fulfilling the read and sending back the resulting data.)
Also, if there are intermediate PCIe switches between CPU and device, those may add additional buffering/queuing delays (exactly as a switch or router might in a network). And any such delays are doubled since they're incurred in both directions.
Of course PCIe is very fast, so all of that happens in mere nanoseconds, but that's still orders of magnitude slower than a "local" read.

What is the least amount of (managable) samples I can give to a PCM buffer?

Some APIs, like this one, can create a PCM buffer from an array of samples (represented by a number).
Say I want to generate and play some audio in (near) real time. I could generate a PCM buffer with 100 samples and send them off the sound card, using my magic API functions. As those 100 samples are playing, 100 more samples are generated and then switch the buffers are switched. Finally, I can repeat the writing / playing / switching process to create a constant stream of audio.
Now, for my question. What is the smallest sample-size I can use with the write / play / switch approach, without a perceivable pause in the audio stream occurring? I understand the answer here will depend on sample rate, processor speed, and transfer time to the sound card - so please provide a "rule of thumb" like answer if it's more appropriate!
(I'm a bit new to audio stuff, so please feel free to point out any misconceptions I might have!)
TL;DR: 1ms buffers are easily achievable on desktop operating systems if care is taken; it might not be desirable from a performance and energy usage perspective.
The lower limit to buffer-size (and this output latency) is limited by the worst-case scheduling latency of your operating system.
The sequence of events is:
The audio hardware progressively outputs samples from its buffer
At some point, it reaches a low-water-mark and generates an interrupt, signalling that the buffer needs replenishing with more samples
The operating system service the interrupt, and marks the thread as being ready to run
The operating system schedules the thread to run on a CPU
The thread computes, or otherwise obtains samples, and writes them into the output buffer.
The scheduling latency is the time between step 2 and 4 above, and are dictated largely by the design of the host operating. If using a hard RTOS such as VxWorks or eCos with pre-emptive priority scheduling, the worst case can be in the order of fractions of a uS.
General purpose desktop operating systems are generally less slick. MacOSX supports real-time user-space scheduling, and is easily capable of servicing 1ms buffers. The Linux kernel can be configured for pre-emptive real-time threads and bottom-half interrupt handlers handled by kernel threads. You ought to also be able to get achieve 1ms buffers sizes there too. I can't comment on the capabilities of recent versions of the NT kernel.
It's also possible to take a (usually bad) latency hit in step 5 - when your process fills the buffer, if it takes page-fault. Usual practice is to obtain all of the heap and stack memory you require and mlock() it and program code and data into physical memory.
Absolutely forget about achieving low latency in an interpreted or JITed language run-time. You have far too little control of what the language run-time is doing, and have no realistic prospect preventing page-faults (e.g. for memory allocation). I suspect 10ms is pushing your luck in these cases.
It's worth noting that rendering short buffers has a significant impact on system performance (and energy consumption) due to the high rate of interrupts and context switches. These destroy L1 cache locality in a way that's disproportionate with the work they actually do.
While 1ms audio buffers are possible they are not necessarily desirable. The tick rate of modern Windows, for example, is between 10ms and 30ms. That means that, usually at the audio driver end, you need to keep a ring buffer of a bunch of those 1 ms packets to deal with buffer starvation conditions, in case the CPU gets pulled out from under you by some other thread.
All modern audio engines use powers of 2 for their audio buffer sizes. Start with 256 samples per frame and see how that works for you. Every piece of hardware is different, and you can't depend on how a Mac or PC gives you time slices. The user might be calculating pi on some other process while you are running your audio program. It's safer to leave the buffer size large, like 2048 samples per frame, and let the user turn it down if the latency bothers them.

linux: smart fsync()?

I'm recording audio and writing the same to a SD Card, the data rate is around 1.5 MB/s. I'm using a class 4 SD Card with ext4 file system.
After certain interval, kernel auto syncs the files. The downside of this is, my application buffers pile up waiting to be written to disk.
I think, if the kernel syncs frequently that what it is doing now, it may solve the issue.
I used fsync() in application to sync after certain intervals. But this does not solve the problem, because certain times kernel has synced just before the application called fsync(), so the fsync() called from application was a waste of time.
I need a syncing mechanism (say, smart_fsync() ), so that when application calls smart_fsync(), then the kernel will sync only if it has not synced in a while, else it will just return.
Since there is no function as smart_fsync(). what can be a possible workaround?
The first question to ask is, what exactly is the problem you're experiencing? The kernel will flush dirty (unwritten cached) buffers periodically - this is because doing so tends to be faster than flushing synchronously (less latency hit for applications). The downside is that this means a larger latency hit if you reach the kernel's limit on dirty data (and potentially more data loss after an unclean shutdown).
If you want to ensure that the data hits disk ASAP, then you should simply open the file with the O_SYNC option. This will flush the data to disk immediately upon write(). Of course, this implies a significant performance penalty, but on the other hand you have complete control over when the data is flushed.
If you are experiencing drops in throughput while the syncing is going on, most likely you are attempting to write faster than the disk can support, and reaching the dirty page memory limit. Unfortunately, this would mean the hardware is simply not up to the write rate you are attempting to push at it - you'll need to write slower, or buffer the data up on faster media (or add more RAM!).
Note also that your 'smart fsync' is exactly what the kernel implements - it will flush pages when one of the following is true:
* There is too much dirty data in memory. Triggers asynchronously (without blocking writes) when the total amount of dirty data exceeds /proc/sys/vm/dirty_background_bytes, or when the percentage of total memory exceeds /proc/sys/vm/dirty_background_ratio. Triggers synchronously (blocking your application's write() for an extended time) when the total amount of data exceeds /proc/sys/vm/dirty_bytes, or the percentage of total memory exceeds /proc/sys/vm/dirty_ratio.
* Dirty data has been pending in memory for too long. The pdflush daemon checks for old dirty blocks every /proc/sys/vm/dirty_writeback_centisecs centiseconds (1/100 seconds), and will expire blocks if they have been in memory for longer than /proc/sys/vm/dirty_expire_centisecs.
It's possible that tuning these parameters might help a bit, but you're probably better off figuring out why the defaults aren't keeping up as is.

udp send from driver

I have a driver that needs to:
receive data from an FPGA
DMA data to another another device (DSP) for encoding
send the encoded data via UDP to an external host
The original plan was to have the application handle step 3, but the application doesn't get the processor in time to process the data before the next set of data arrives from the FPGA.
Is there a way to force the scheduler (from the driver) to run my application?
If not, I think work queues are likely the solution I need to use, but I'm not sure how/where to call into the network stack/driver to accomplish the UDP transfers from the work queues.
Any ideas?
You should try to discover why the application "can't get the data fast enough".
Your memory bandwith is probably vastly superior to the thypical ethernet bandwith, so even if passing data from the driver to the application involves copying.
If the udp link is not fast enough in userspace, it won't be faster in kernelspace.
What you need to do is :
understand why your application is not fast enough, maybe by stracing it.
implement queuing in userspace.
You can probably split your application in two thread, sharing buffer list
thread A waits for the driver to have data available, and puts it at the tail of the list.
thread B reads data from the head of the list, and sends it through UDP. If for some reason thread B is busy waiting for a particular buffer to be sent, the fifo fills a bit, but as long as the UDP link bandwith is larger than the rate of data coming from the DSP you should be fine.
Moving things into the kernel does not makes things magically faster, it is just MUCH harder to code and debug and trace.

Resources