When is the Fourier transform of a signal periodic? - transform

Also , is the inverse Fourier transform of a periodic signal also periodic ?

I'm going to assume that we can start with a signal for which a Fourier transform exists (such as an absolutely integrable function). If we construct another signal by sampling this original signal at regular time intervals, then the Fourier transform of that newly constructed signal would corresponds to the Discrete-Time Fourier Transform (DTFT) which would be periodic. Note that if the original signal's bandwidth is less than the Nyquist frequency then the time-sampled signal can also be recovered from this frequency-domain representation.
Conversely, if we evenly sample a continuous frequency-domain function, the corresponding inverse transform would be a periodic signal in the time domain. Correspondingly an evenly time-sampled and periodic signal would have a periodic evenly frequency sampled signal (ie. the inverse Fourier transform of that periodic signal would also be periodic).

Related

Load balancing by matching process peaks and minima

Consider a machine running two independent multi threaded process which have a periodic load and an amplitude greater then 50% of the total CPU power. Say a load like 0.6 sin(t)^2 (t is time) for both process. In general of course there could be fluctuations and differing periods.
If the processes are locked in phase with a phase difference of pi/2 the computation is not slowed down. However for certain phase differences the CPU would throttle at maximum load and slow down the computation time.
Are there mechanisms to avoid these overlaps? On which level are they implemented?

DirectCompute multithreading performance (threads and thread groups) for multidimensional array processing

I understand that Disptach(x, y, z) will defines how many groups of threads are instanciated and numthreads(n, m, p) gives the size of each group.
Combined together, Dispatch and numthreads give the total number of threads. I also understood that dispatch arguments are used to pass parameters to each thread.
Questions:
1) Is there performance difference between I groups of J threads and J groups on I threads? Both options giving the same number of threads.
2) Assuming I have to process a two dimentional matrix of size only known at runtime, it is convenient to use Dispatch(DimX, DimY, 1) and numthreads(1, 1, 1) so that I have exactly on thread per matrix element whose postion is given by DTid.xy. Since numthreads() arguments are determined at compile time, how can I have the exact number of threads required to process a matrix having dimensions not multiple of the thread group size and not known at compile time?
1) Yes, there is (or can be) a performance difference, depending on the actual numbers and on the used hardware!
GPUs (usually) contain multiple so-called "waves" of threads. These waves work in a SIMD-like fashion (All threads in a wave are allways executing the same operations at the same time). The exact number of threads per wave is vendor-specific, but is usually 32 (all NVidia GPUs I know of) or 64 (most AMD GPUs).
A single group of threads can be distributed to multiple waves. However, a single wave can only execute threads of the same group. Therefore, if your number of threads per group is not an multiple of the hardware's wave size, there are some threads in a wave that are "idling" (They are actually doing the same things as the other ones, but aren't allowed to write into memory), so you are "loosing" performance that you would get with an better number of threads.
2) You would most likely select a thread count that's suitable to your hardware (64 would be a good default value, as it is also a multiple of 32), and use branching to mark threads as "inactive" that are outside of your matrix (you can pass the size of the matrix/data to the shader using a constant buffer). Since these inactive threads aren't doing anything at all, the hardware can simply mask them as "read-only" (similar to how they would be handled if the number of threads per group is smaller then the wave size), which is quite cheap. If all threads in a wave are marked inactive, the hardware can even choose to skip the work of this wave completly, which would be optimal.
You could also use padding to make sure that your matrix/data is allways a multiple of the number of threads per group, eg with zeroes or the identity matrix or whatever. However, whether this can be done depends on the application, and I would assume that branching would be as fast - if not faster - in most cases.

same source, different clk frequency(multi-clock design)

how to handle the multi-clock design signals with the clock generated from the same source?
For example,
one clock domain is 25Mhz
the other one is 100Mhz
how can I handle the data bus from 25Mhz to 100Mhz
and also 100Mhz to 25Hhz?
don't want to use AFIFO tho
any other easy CDC way to handle it?
Case 1: If the source ensures that the edges of the clocks are aligned, there is no need to do anything in the design. A single-bit and multi-bit data have no difference.
Case 2: If the edges are not aligned, but the phase relationship is known, the clocks are still synchronous. The synthesis/STA/P&R tool can calculate the worst cases for timing (e.g. setup/hold) checks. In case there is no violation, no need to do anything again. The most important part here is defining the timing constraints correctly.
Case 3: If the clocks are asynchronous, one solution is carrying an enable signal with the bus. The enable signal is synchronized by a pair of flip-flops. Then the data bits are masked or passed according to the value of synchronized enable signal. This solution is explained here as well as many other solutions and cases.
Depends if the two clocks are synchronous or asynchronous with respect to each other. You can use a 2 bit/n-bit synchronizer to eliminate the meta stability issue in CDC. Other approaches are mux based handshake mechanism, gray code counter.
If you are sending data from slower clock domain to faster clock domain, fast clock should be 1.5 times that of the slow clock.
for faster clock to slower clock domain, Data of Fast clock should be 1.5 times that of the slower clock.

get time until next output

Some kinds of I/O operate at exact frequencies. Under extreme latency requirements, it would be useful to know how much time there is until the next piece of data is due to arrive on a certain fd.
For example, consider a stream processor which must output data to hardware at some predetermined points in time. Suppose the stream's content depends on some input. In order to reduce latency from input to output, the stream processor should wait for input for as long as possible before rendering the next piece of data. In order to do that, though, the processor needs to know how much time is left before the data is required.
Are there extensions to the standard unix I/O library (unistd.h, read(), write(), file descriptors, etc.) that allow data streams to operate in a mode where you can determine the time until the next I/O operation? Is there a word for this kind of I/O extension?
You need to be more precise about your question. Technically speaking, you need to start a timer once one output is debugged and then stop it after a second output is debugged

Video encoding pipeline - threads design

I work on a system which does video capture & encoding of multiple channels. each stage takes time. The capture/encoding is done in HW, but still can take its time to finish.
capture frames->encode->file-save (or stream to network)
I have a dillema what would be a better approach/design:
one thread per channel, which call the pipeline blocking APIs one after the other such as:
while(1)
{
frame = get_next_capture_frame(); //no blocking api - every 1/60 sec
prev_bitstream = send_to_encode_and_get_any_already_encoded_frame(frame); //no blocking api
send_to_save_bitstream(prev_bitstream); //no blocking api
delay(1/60); //wait 1/60
}
Or is it better to use several thread each doing its job:
one thread for capture, another for encoding, and another for file-management.
This problem gets more complex as more than one channel is involved (about 6 channels - which might result in 6 threads in the first approach and 18 threads in the second approach)
Another dillema on this problem domain: should thread wakeup periodically and do the job waiting in queue (say wakeup every 60fps), or should thread wake-up according to new event (new buffer for capture, new buffer for encoder, etc.)
It kind of depends on the requirements. If you know that you'll always have 6 channels at 60 FPS, and that the Capture/Encode/Save process will take less than 1/60 second, then one thread per channel is the easiest to code. But be aware that if encoding or saving sometimes takes too long, you won't get the next frame on schedule.
You could use a pipelined approach (similar to your second option), but not on a per-thread basis. That is, if you could have a single thread that does nothing but read and store a frame from each channel 60 times per second. Those frames go into the Captured queue. You have a separate thread (or perhaps multiple threads) reading from the Captured queue, encoding the data, and saving the results to the Output queue. Finally, one or more output threads read the Output queue and save the encoded data.
The queues are shared so that all of the Encoding threads, for example, read from the same Captured queue and write to the same Output queue. Most programming languages these days have efficient thread-safe queues. Many have such structures that don't require busy waiting. That is, the Encoding threads can do a non-busy wait on the Captured queue if it's empty. The threads will be notified when something is placed on the queue. See .NET's BlockingCollection or Java's ConcurrentLinkedQueue, for example.
That model scales well. If, for example, you need more encoding threads to keep up with the throughput, you can just more of those. You might end up with, for example, two capture threads, 8 encoders, and a single output thread. You can balance it based on your workload.
As for the scheduling, I suspect you'd want your capture thread(s) to operate on a periodic basis (i.e. once every 1/60 second, or whatever your frame rate is). The encoding and output threads should be configured to wait on their respective queues. There's no reason for the output thread, for example, to continually poll the Output queue for data. Instead, it can be idle (waiting) and get notified when a packet is placed in the Output queue.
The details of how to do this for a video encoder might make the approach unnecessarily messy. I really don't know. If the encoder requires channel-specific state information, it becomes more difficult. It's especially difficult when you consider that the model would allow two encoders to be working on frames from the same channel. If that happens then you need some way to sequence the output.
Your first approach is the simplest, and it's what I'd do for the first cut of my program. If that can't maintain the throughput you need, then I'd consider more complex approaches.

Resources