How is PCM audio data processed on a PC? - audio

I know that in a DSP for example, the samples are being processed one by one. On a PC afaik, the data is processed in blocks of samples. So do you introduce block by block and discard the old one or is it processed in a FIFO queue or circular buffer? Does it depend in the hardware?

I've seen audio data edited on either a per-frame basis or on a batch basis. I'm guessing when you say "sample" you are referring to the individual PCM audio values that make up the signal. The method you choose is going to depend on the tradeoffs, for example, if organizing operations on a per frame basis takes a higher priority than maximizing throughput.
I've seen two ways in which audio is routed to the system for playback, either as a series of arrays copied into predefined locations (e.g., as done with OpenAL), or via writing arrays to a line that employs a blocking queue (e.g., Java's SourceDataLine.write() method).

Related

Writing data on memory at different clocks

I want to write data on a common memory coming from different clock domains how can I do that?
I have one common memory block and that memory block works on clock having frequency clk. Now I want to write data on that memory coming from different clock domains i.e clk1, clk2, clk3, clk4 etc. How to do that?
I was thinking of using FIFO for each clock domain i.e. 1st FIFO have input clock clk1 and output at clk(same as memory), the 2md FIFO will have input clock clk2 and output at clk(same as memory) and so on.. But it seems that my design will outgrew if I will use large number of FIFOs. Kindly tell me the correct approach.
To pass data-units (bytes, words etc) safely between clock domains the asynchronous FIFO is the only safe solution. Note that the FIFO does not have to be deep, but in that case you may need flow control.
You may need flow control anyway as you have many sources all accessing the same memory.
But it seems that my design will outgrew if I will use large number of FIFOs.
Then you have a design problem: your FPGA is not large enough to implement the solution you have chosen. So either go to a bigger FPGA or find fundamentally different solution to your problem.
RAM can be write by clock domain A and read by clock domain B with different clocks (Dual-ported RAM):
http://www.asic-world.com/examples/verilog/ram_dp_ar_aw.html
This RAM should be used by some controller as asynchronous FIFO.
Many fpga have dedicated RAM components for example:
UFM Altera, Xilnx BRAM, Cypress Delta39K Cluster Memory Block etc.
Change your device if you have problem with large FIFO.

Is the execution speed of the 'transferIn()' function about 0.16s?

It receives 2048 bytes from one transferIn function as bulk transfer.
It takes 0.16s to execute one function.
This means it takes more than 80s to get 1mb.
What should I do to speed this up on my JavaScript code using WebUSB? Or is there no way?
In addition to the time it actually takes for the data to be transferred a single call to transferIn() has to do a lot of work to set up the host to receive data from the device. Even assuming there was zero delay introduced by the web browser and operating system the USB only provides transfer opportunities every 1ms (for full-speed devices) or 125us (for high-speed devices). The tricks for increasing your data transfer rate are:
Submit transferIn() calls with buffers much larger than the endpoint's packet size. This trades latency for throughput. The transfer won't complete until the buffer is full or a short packet is received but the host controller won't waste time waiting around for the operating system to ask it to ask the device for more data.
Submit multiple transferIn() calls in parallel. This adds more overhead but solves the latency problem by reporting transfer completion at higher granularity. This technique is commonly used for endpoints that deliver events. Keeping at least two transfers in flight at once guarantees that the next event will be delivered immediately rather than having to wait until a new transfer request is set up after the first event is processed.
This advice also applies for transferOut().

What is the least amount of (managable) samples I can give to a PCM buffer?

Some APIs, like this one, can create a PCM buffer from an array of samples (represented by a number).
Say I want to generate and play some audio in (near) real time. I could generate a PCM buffer with 100 samples and send them off the sound card, using my magic API functions. As those 100 samples are playing, 100 more samples are generated and then switch the buffers are switched. Finally, I can repeat the writing / playing / switching process to create a constant stream of audio.
Now, for my question. What is the smallest sample-size I can use with the write / play / switch approach, without a perceivable pause in the audio stream occurring? I understand the answer here will depend on sample rate, processor speed, and transfer time to the sound card - so please provide a "rule of thumb" like answer if it's more appropriate!
(I'm a bit new to audio stuff, so please feel free to point out any misconceptions I might have!)
TL;DR: 1ms buffers are easily achievable on desktop operating systems if care is taken; it might not be desirable from a performance and energy usage perspective.
The lower limit to buffer-size (and this output latency) is limited by the worst-case scheduling latency of your operating system.
The sequence of events is:
The audio hardware progressively outputs samples from its buffer
At some point, it reaches a low-water-mark and generates an interrupt, signalling that the buffer needs replenishing with more samples
The operating system service the interrupt, and marks the thread as being ready to run
The operating system schedules the thread to run on a CPU
The thread computes, or otherwise obtains samples, and writes them into the output buffer.
The scheduling latency is the time between step 2 and 4 above, and are dictated largely by the design of the host operating. If using a hard RTOS such as VxWorks or eCos with pre-emptive priority scheduling, the worst case can be in the order of fractions of a uS.
General purpose desktop operating systems are generally less slick. MacOSX supports real-time user-space scheduling, and is easily capable of servicing 1ms buffers. The Linux kernel can be configured for pre-emptive real-time threads and bottom-half interrupt handlers handled by kernel threads. You ought to also be able to get achieve 1ms buffers sizes there too. I can't comment on the capabilities of recent versions of the NT kernel.
It's also possible to take a (usually bad) latency hit in step 5 - when your process fills the buffer, if it takes page-fault. Usual practice is to obtain all of the heap and stack memory you require and mlock() it and program code and data into physical memory.
Absolutely forget about achieving low latency in an interpreted or JITed language run-time. You have far too little control of what the language run-time is doing, and have no realistic prospect preventing page-faults (e.g. for memory allocation). I suspect 10ms is pushing your luck in these cases.
It's worth noting that rendering short buffers has a significant impact on system performance (and energy consumption) due to the high rate of interrupts and context switches. These destroy L1 cache locality in a way that's disproportionate with the work they actually do.
While 1ms audio buffers are possible they are not necessarily desirable. The tick rate of modern Windows, for example, is between 10ms and 30ms. That means that, usually at the audio driver end, you need to keep a ring buffer of a bunch of those 1 ms packets to deal with buffer starvation conditions, in case the CPU gets pulled out from under you by some other thread.
All modern audio engines use powers of 2 for their audio buffer sizes. Start with 256 samples per frame and see how that works for you. Every piece of hardware is different, and you can't depend on how a Mac or PC gives you time slices. The user might be calculating pi on some other process while you are running your audio program. It's safer to leave the buffer size large, like 2048 samples per frame, and let the user turn it down if the latency bothers them.

Queueing writes to file system on Linux?

On a very large SMP machine with many CPUS scripts are run with tens of simultaneous jobs (fewer than the number of CPUs) like this:
some_program -in FIFO1 >OUTPUT1 2>s_p1.log </dev/null &
some_program -in FIFO2 >OUTPUT2 2>s_p2.log </dev/null &
...
some_program -in FIFO40 >OUTPUT40 2>s_p40.log </dev/null &
splitter_program -in real_input.dat -out FIFO1,FIFO2...FIFO40
The splitter reads the input data flat out and distributes it to the FIFOs in order. (Records 1,41,81... to FIFO1; 2,42,82 to FIFO2, etc.) The splitter has low overhead and can pretty much process data as fast as the file system can supply it.
Each some_program processes its stream and writes it to its output file. However, nothing controls the order in which the file system sees these writes. The writes are also very small, on the order of 10 bytes. The script "knows" that there are 40 streams here and that they could be buffered in 20M (or whatever) chunks, and then each chunk written to the file system sequentially. That is, queued writes should be used to maximize write speed to the disks. The OS, however, just sees a bunch of writes at about the same rate on each of the 40 streams.
What happens in practice during a run is that the subprocesses get a lot of CPU time (in top, >80%), then a flush process appears (10% CPU), and all the others drop to low CPU (1%), then it goes back to the higher rate. These pauses go on for several seconds at a time. The flush means that the writes are overwhelming the file caching. Also I think the OS and/or the underlying RAID controller is probably bouncing the physical disk heads around erratically which is reducing the ultimate write speed to the physical disks. That is just a guess though, since it is hard to say what exactly is happening as there is file cache (in a system with over 500Gb of RAM) and a RAID controller between the writes and the disk.
Is there a program or method around for controlling this sort of IO, forcing the file system writes to queue nicely to maximize write speed?
The "buffer" program is not going to help much here because, while it would accumulate an output stream into a big chunk, there wouldn't be an orderly queuing of the writes, so several could go out at the same time. If the data rate in the output streams was uncorrelated this would be less of a problem, but in some cases the data rate is exactly the same in all streams, which means the buffers would all fill at the same time. This would stall the entire tree until the last one was written because any process that cannot write an output will not reads its next input, and that would stall the splitter, as all I/O is synchronous. The buffers need to be emptied in a cyclical manner, preferably before any of them completely fill up, although that may not be avoidable when data output rate exceeds the file system write rate.
There are dozens of parameters for tuning the file system, some of those might help. The scheduler was changed from cfq to deadline because the system was locking up for minutes at a time with the former.
If the problem is sheer I/O bandwidth then buffering won't solve anything. In that case, you need to shrink the data or send it to a higher-bandwidth sink to improve and level your performance. One way to do that would be to reduce the number of parallel jobs, as #thatotherguy said.
If in fact the problem is with the number of distinct I/O operations rather than with the overall volume of data, however, then buffering might be a viable solution. I am unfamiliar with the buffer program you mentioned, but I suppose it does what its name suggests. I don't completely agree with your buffering comments, however:
The "buffer" program is not going to help much here because, while it would accumulate an output stream into a big chunk, there wouldn't be an orderly queuing of the writes, so several could go out at the same time.
You don't necessarily need big chunks. It would probably be ideal to chunk at the native block size of the file system, or a small integer multiple thereof. That might be, say, 4096- or 8192-byte chunks.
Moreover, I don't see why you think you have an "orderly queueing of writes" now, or why you're confident that such a thing is needed.
If the data rate in the output streams was uncorrelated this would be less of a problem, but in some cases the data rate is exactly the same in all streams, which means the buffers would all fill at the same time. This would stall the entire tree until the last one was written because any process that cannot write an output will not reads its next input, and that would stall the splitter, as all I/O is synchronous.
Your splitter is writing to FIFOs. Though it may do so serially, that is not "synchronous" in the sense that the data needs to be drained out the other end before the splitter can proceed -- at least, not if the writes do not exceed the size of the FIFOs' buffers. FIFO buffer capacity varies from system to system, adapts dynamically on some systems, and is configurable (e.g. via fcntl()) on some systems. The default buffer size on modern Linux is 64kB.
The buffers need to be emptied in a cyclical manner, preferably before any of them completely fill up, although that may not be avoidable when data output rate exceeds the file system write rate.
I think this is a problem that pretty much solves itself. If one of the buffers backs up enough to block the splitter, then that ensures that the competing processes will, before too long, give the blocked buffer the opportunity to write. But this is also why you don't want enormous buffers -- you want to interleave disk I/O from different processes relatively finely to try to keep everything going.
The alternative to an external buffer program is to modify your processes to perform internal buffering. That might be an advantage because it removes a whole set of pipes (to an external buffering program) from the mix, and it lightens the process load on the machines. It does mean modifying your working processing program, though, so perhaps it would be better to start with external buffering to see how well that does.
If the problem is that your 40 streams each have a high data rate, and your RAID controller cannot write to physical disk fast enough, then you need to redesign your disk system. Basically, divide it into 40 RAID-1 mirrors and write one file to each mirror set. That makes the writes sequential for each stream, but requires 80 disks.
If the data rate isn't the problem then you need to add more buffering. You might need a pair of threads. One thread to collect the data into memory buffers and another thread to write it into data files and fsync() it. To make the disk writes sequential it should fsync each output file one at a time. That should result in writing large sequential chunks of whatever your buffer size is. 8 MB maybe?

FPGA large input data

I am trying to send a 4 kilobyte string to an FPGA, what is the easiest way that this can be done?
This is the link for the fpga that I am using. I am using Verilog and Quartus.
The answer to your question depends a lot on what is feeding this data into the FPGA. Even if there isn't a specific protocol you need to adhere to (SPI, Ethernet, USB, etc.), there is the question of how fast you need to accept the data, and how far the data has to travel. If it's very slow, you can create a simple interface using regular IO pins with a parallel data bus and a clock. If it's much faster, you may need to explore using high speed serial interfaces and the special hard logic available on your chip to handle those speeds. Even if it's slower, but the data needs to travel over some distance, a serial interface may be a good idea to minimize cable costs.
One thing I would add to #gbuzogany 's answer: You probably want to configure that block of memory in the FPGA as a FIFO so you can handle the data input clock running at a different rate than the internal clock of your FPGA.
You can use your FPGA blocks to create a memory inside the FPGA chip (you can do that from Quartus). The creation assistant allows you to initialise this memory with anything you want (e.g: a 4KB string). The problem is that in-FPGA memory uses many of your FPGA blocks, but for a board like this it must not be a problem.
Here is a video explaining how to do that on Quartus:
https://www.youtube.com/watch?v=1nhTDOpY5gU
You can use string for memory initialization. It's easy in Verilog in 'initial begin end' block.
There are 2 ways:
1. You can create a memory block by using Xilinx Core Generator and then load the initial data to the memory, then use the data for the code. Of course, you have to convert the string to the binary data.
2. You can write a code which has a memory to store the string, it can be a First-In-First-Out FIFO memory. Then, you write a testbench to read the string from a text file and write data into the FIFO. Your FPGA can read the string from the FIFO.

Resources