Writing data on memory at different clocks - verilog

I want to write data on a common memory coming from different clock domains how can I do that?
I have one common memory block and that memory block works on clock having frequency clk. Now I want to write data on that memory coming from different clock domains i.e clk1, clk2, clk3, clk4 etc. How to do that?
I was thinking of using FIFO for each clock domain i.e. 1st FIFO have input clock clk1 and output at clk(same as memory), the 2md FIFO will have input clock clk2 and output at clk(same as memory) and so on.. But it seems that my design will outgrew if I will use large number of FIFOs. Kindly tell me the correct approach.

To pass data-units (bytes, words etc) safely between clock domains the asynchronous FIFO is the only safe solution. Note that the FIFO does not have to be deep, but in that case you may need flow control.
You may need flow control anyway as you have many sources all accessing the same memory.
But it seems that my design will outgrew if I will use large number of FIFOs.
Then you have a design problem: your FPGA is not large enough to implement the solution you have chosen. So either go to a bigger FPGA or find fundamentally different solution to your problem.

RAM can be write by clock domain A and read by clock domain B with different clocks (Dual-ported RAM):
http://www.asic-world.com/examples/verilog/ram_dp_ar_aw.html
This RAM should be used by some controller as asynchronous FIFO.
Many fpga have dedicated RAM components for example:
UFM Altera, Xilnx BRAM, Cypress Delta39K Cluster Memory Block etc.
Change your device if you have problem with large FIFO.

Related

What is the least amount of (managable) samples I can give to a PCM buffer?

Some APIs, like this one, can create a PCM buffer from an array of samples (represented by a number).
Say I want to generate and play some audio in (near) real time. I could generate a PCM buffer with 100 samples and send them off the sound card, using my magic API functions. As those 100 samples are playing, 100 more samples are generated and then switch the buffers are switched. Finally, I can repeat the writing / playing / switching process to create a constant stream of audio.
Now, for my question. What is the smallest sample-size I can use with the write / play / switch approach, without a perceivable pause in the audio stream occurring? I understand the answer here will depend on sample rate, processor speed, and transfer time to the sound card - so please provide a "rule of thumb" like answer if it's more appropriate!
(I'm a bit new to audio stuff, so please feel free to point out any misconceptions I might have!)
TL;DR: 1ms buffers are easily achievable on desktop operating systems if care is taken; it might not be desirable from a performance and energy usage perspective.
The lower limit to buffer-size (and this output latency) is limited by the worst-case scheduling latency of your operating system.
The sequence of events is:
The audio hardware progressively outputs samples from its buffer
At some point, it reaches a low-water-mark and generates an interrupt, signalling that the buffer needs replenishing with more samples
The operating system service the interrupt, and marks the thread as being ready to run
The operating system schedules the thread to run on a CPU
The thread computes, or otherwise obtains samples, and writes them into the output buffer.
The scheduling latency is the time between step 2 and 4 above, and are dictated largely by the design of the host operating. If using a hard RTOS such as VxWorks or eCos with pre-emptive priority scheduling, the worst case can be in the order of fractions of a uS.
General purpose desktop operating systems are generally less slick. MacOSX supports real-time user-space scheduling, and is easily capable of servicing 1ms buffers. The Linux kernel can be configured for pre-emptive real-time threads and bottom-half interrupt handlers handled by kernel threads. You ought to also be able to get achieve 1ms buffers sizes there too. I can't comment on the capabilities of recent versions of the NT kernel.
It's also possible to take a (usually bad) latency hit in step 5 - when your process fills the buffer, if it takes page-fault. Usual practice is to obtain all of the heap and stack memory you require and mlock() it and program code and data into physical memory.
Absolutely forget about achieving low latency in an interpreted or JITed language run-time. You have far too little control of what the language run-time is doing, and have no realistic prospect preventing page-faults (e.g. for memory allocation). I suspect 10ms is pushing your luck in these cases.
It's worth noting that rendering short buffers has a significant impact on system performance (and energy consumption) due to the high rate of interrupts and context switches. These destroy L1 cache locality in a way that's disproportionate with the work they actually do.
While 1ms audio buffers are possible they are not necessarily desirable. The tick rate of modern Windows, for example, is between 10ms and 30ms. That means that, usually at the audio driver end, you need to keep a ring buffer of a bunch of those 1 ms packets to deal with buffer starvation conditions, in case the CPU gets pulled out from under you by some other thread.
All modern audio engines use powers of 2 for their audio buffer sizes. Start with 256 samples per frame and see how that works for you. Every piece of hardware is different, and you can't depend on how a Mac or PC gives you time slices. The user might be calculating pi on some other process while you are running your audio program. It's safer to leave the buffer size large, like 2048 samples per frame, and let the user turn it down if the latency bothers them.

How to decide when to use memory barrier

As part of writing driver code, i have come across codes which uses memory barrier (fencing). After reading and surfing through Google, learnt as to why it is used and helpful in SMP. Thinking through this, in multi threaded programming we would find many instances where there are memory races and putting barrier in all places would cost system CPU. I was wondering how to:
I know about specific code path which use common memory to access data, do I need memory barrier in all these places?
Any specific technique or tip which will help me identify this pitfalls?
This is very generic questions but wanted to get insight on others experiences and any tips which would help to identify such pitfalls.
Often device hardware is sensitive to the order in which device registers are written. Modern systems are weakly-coupled and typically have write-combining hardware between the CPU and memory.
Suppose you write a single byte of a 32-bit object. What is in the write-combining hardware is now A _ _ _. Instead of immediately initiating a read/modify/write cycle to update the A byte, the hardware sets a timer. The hope is that the CPU will send the B, C, and D bytes before the timer expires. The timer expires, the data in the write-combining register gets dumped into memory.
Setting a barrier causes the write-combining hardware to use what it has. If only slot A is occupied then only slot A gets written.
Now supposed the hardware expected the bytes to be written in the strict order A, C, B, D. Without precautions the hardware registers get written in the wrong order. The result is what you expect: Ready! Fire! Aim!
Barriers should be placed judiciously because their incorrect use can seriously impede performance. Not every device write needs a barrier; judgement is called for.

FPGA large input data

I am trying to send a 4 kilobyte string to an FPGA, what is the easiest way that this can be done?
This is the link for the fpga that I am using. I am using Verilog and Quartus.
The answer to your question depends a lot on what is feeding this data into the FPGA. Even if there isn't a specific protocol you need to adhere to (SPI, Ethernet, USB, etc.), there is the question of how fast you need to accept the data, and how far the data has to travel. If it's very slow, you can create a simple interface using regular IO pins with a parallel data bus and a clock. If it's much faster, you may need to explore using high speed serial interfaces and the special hard logic available on your chip to handle those speeds. Even if it's slower, but the data needs to travel over some distance, a serial interface may be a good idea to minimize cable costs.
One thing I would add to #gbuzogany 's answer: You probably want to configure that block of memory in the FPGA as a FIFO so you can handle the data input clock running at a different rate than the internal clock of your FPGA.
You can use your FPGA blocks to create a memory inside the FPGA chip (you can do that from Quartus). The creation assistant allows you to initialise this memory with anything you want (e.g: a 4KB string). The problem is that in-FPGA memory uses many of your FPGA blocks, but for a board like this it must not be a problem.
Here is a video explaining how to do that on Quartus:
https://www.youtube.com/watch?v=1nhTDOpY5gU
You can use string for memory initialization. It's easy in Verilog in 'initial begin end' block.
There are 2 ways:
1. You can create a memory block by using Xilinx Core Generator and then load the initial data to the memory, then use the data for the code. Of course, you have to convert the string to the binary data.
2. You can write a code which has a memory to store the string, it can be a First-In-First-Out FIFO memory. Then, you write a testbench to read the string from a text file and write data into the FIFO. Your FPGA can read the string from the FIFO.

Is the maximum interrupt frequency for the linux kernel in Hz, kHz, MHz, or GHz?

Before I start: yes, I'm aware that the answer is architecture dependent - I'm just interested in a ballpark figure, in terms of orders of magnitude.
Is there an upper limit imposed by the linux kernel on interrupt frequency?
Background: I want to interface with a camera module from within Linux. The module has a clocked parallel data output (8 bits, at ~650kHz), which I want to read data from and store in a buffer for access through, eg, /dev/camera.
I have a basic driver written, and it is monitoring the appropriate interrupt line. If I leave a wire hanging off the interrupt pin, I get interrupts from white noise. However, if I hook up a higher frequency signal (atm ~250kHz from a 555 timer) then no interrupts are triggered. (I've confirmed this with /proc/interrupts)
My thinking is that this can either be from the GPIO module on the processor not being able to deal with such high frequencies (which would be silly - that's not particularly high), or it could be a kernel issue. What do people think?
Look at it this way. Modern CPUs execute around 109 instructions per second.
In order to handle an interrupt you need to execute some 100-1000 instructions (save the context, do I/O, signal end of interrupt handling, restore the context). That gives you some 106 - 107 interrupts per second max.
If you spend all the time in handling interrupts, then nothing is left for the rest of the system and programs.
So, think of some 105 interrupts/second (100 KHz) being the maximum practical interrupt rate.
There may be other limitations imposed by the circuitry and I'm not too familiar with this aspect. But it's unlikely for the kernel to somehow explicitly limit the interrupt rate. I see no good reason for it and I don't think it's something that can be easily done either.
Now, there are things like DMA that let you have interrupts not on every byte of input/output data, but on a buffer of several kilobytes or even megabytes. E.g. you prepare your data for output in a memory buffer and tell to the DMA controller that it can now send it out from the buffer. When done, it will trigger an interrupt signalling the completion of the transfer and you'll be able to initiate another one. It works the same in the other direction of transfers. You get an interrupt when the entire buffer is filled with input data.
I think you may be facing a hardware limitation if you can receive interrupts at lower rates only.

How to generate a square wave by Linux kernel

I need to develop a Linux driver that generates a square wave, with a cycle of about 1ms, using the MIPS platform (this is not i386).
I tried some methods, but these are not success:
Use timer/hrtimer --> but cycle is 12ms and unstable
Cannot use realtime additional packages as RTLinux/RTAI, because these do not support for MIPS
Use the kernel-thread with a forever loop and udelay function --> It takes too much of the CPU's resource --> Performance is not acceptable
Do you aid me? Or do you thwart me...? (Please help!)
Thank you.
The Unix way would be not doing that at all. Maybe in olden times on single task machines, you would have done like this, but now - if you don't have a hardware circuit that gives to the proper frequency, you may never succeed because hardware timers don't have the necessary resolution, and it may always happen that a task of more importance grabs your CPU time.
As FrankH said, the best solution involves relying on hardware. You should check your processor's reference manual to see if it has a timer.
I'll add this: if it happens to have an Output Compare or PWM subsystem (I'm not familiar with MIPS, but it's not at all uncommon in embedded devices) you can just write a few registers to set it all up, and then you don't need any more processor time.
It might be possible, but to get this from within Linux, the hardware must have certain characteristics:
you need a programmable timer device that can create an interrupt at sufficiently-high priority that other activity by the Linux kernel (such as scheduling or other interrupts, even) won't preempt / block the interrupt handler, and at sufficient granulatity/frequency to meet your signal stability constraints
the "square wave" electrical line must also be programmable and the operation (register write ? memory mapped register write ? special CPU instruction ? ... ?) which switches its state must be guaranteed faster than the shortest cycle time used with the timer above (or else you could get "frequency moire")
If that's the case then your special timer device driver can toggle the line from within its high prio interrupt handler and create the square wave. Since it's both interrupt driven and separate from the normal timer interrupt sources / consumers (i.e. not shared - no latency from possibly dispatching multiple timer events per interrupt), you've got a much better chance of sufficient precision.
Since all this (the existance of a separately-programmable timer device, to start with) is hardware-specific, you need to start with the specs of your CPU/SoC/board and find out if there are multiple independent timers available.

Resources