Alsalib mmap direct write - linux

I am just messing around with ALSA library and can't really figure out how to do playback with a direct write.
I am using SND_PCM_ACCESS_MMAP_INTERLEAVED.
I am trying to write a square wave.
I created a buffer of shorts to hold the square wave. I have tested it with snd_pcm_writei and it works.
I then call snd_pcm_begin and use the pointers given from area to write to the device:
while(1)
{
int msg;
frames_available = snd_pcm_avail_update(handle);
snd_pcm_mmap_begin(handle,&areas,&offset,&limit_frames);
frames_to_write = frames; //frames is the size of the buffer in frames
if (frames_to_write > limit_frames)
frames_to_write = 0;
int offset_frames = (areas[0].first + offset*areas[0].step)/16;
short* write_ptr = (short*)areas[0].addr + offset_frames;
// fill the buffer with stuff
for(int i =0; i < frames_to_write;i++)
{
write_ptr[i] = buffer[i];
}
msg = snd_pcm_mmap_commit(handle,offset,frames_to_write);
}
The sound produced is choppy and gets cut off soon after. It gets cut off because the limit_frame reaches 0. I notice that limit_frames stays at 0 even if there are frames_available.
EDIT:
I used memcpy() instead of a for loop and that solved the choppiness. Still gets cut off though. Now I'm curious why memcpy() solves the choppiness. Shouldn't the for loop and memcpy and for loop copy over the memory contiguously?

Using mmap does not make sense if all you're doing is copying the samples from another buffer; that's exactly the same what snd_pcm_writei() does.
Anyway, before calling snd_pcm_mmap_begin(), you must set its last parameter to the number of frames you intend to write, and when it returns a smaller number, you should write that number, instead of 0.
When you have more than one channel, a frame is larger than one sample.

Related

Configure SAI peripheral on STM32H7

I'm trying to play a sound, on a single speaker (mono), from a .wav file in SD card using a STM32H7 controller and freertos environment.
I currently managed to generate sound but it is very dirty and jerky.
I'd like to show the parsed header content of my wav file but my reputation score is below 10.
Most important data are :
format : PCM
1 Channel
Sample rate : 44100
Bit per sample : 16
I initialize the SAI2 block A this way :
void MX_SAI2_Init(void)
{
/* USER CODE BEGIN SAI2_Init 0 */
/* USER CODE END SAI2_Init 0 */
/* USER CODE BEGIN SAI2_Init 1 */
/* USER CODE END SAI2_Init 1 */
hsai_BlockA2.Instance = SAI2_Block_A;
hsai_BlockA2.Init.AudioMode = SAI_MODEMASTER_TX;
hsai_BlockA2.Init.Synchro = SAI_ASYNCHRONOUS;
hsai_BlockA2.Init.OutputDrive = SAI_OUTPUTDRIVE_DISABLE;
hsai_BlockA2.Init.NoDivider = SAI_MASTERDIVIDER_ENABLE;
hsai_BlockA2.Init.FIFOThreshold = SAI_FIFOTHRESHOLD_EMPTY;
hsai_BlockA2.Init.AudioFrequency = SAI_AUDIO_FREQUENCY_44K;
hsai_BlockA2.Init.SynchroExt = SAI_SYNCEXT_DISABLE;
hsai_BlockA2.Init.MonoStereoMode = SAI_MONOMODE;
hsai_BlockA2.Init.CompandingMode = SAI_NOCOMPANDING;
hsai_BlockA2.Init.TriState = SAI_OUTPUT_NOTRELEASED;
if (HAL_SAI_InitProtocol(&hsai_BlockA2, SAI_I2S_STANDARD, SAI_PROTOCOL_DATASIZE_16BIT, 2) != HAL_OK)
{
Error_Handler();
}
/* USER CODE BEGIN SAI2_Init 2 */
/* USER CODE END SAI2_Init 2 */
}
I think I set the clock frequency correctly, as I measure a frame synch clock of 43Khz (closest I can get to 44,1Khz)
The file indicate it's using PCM protocol. My init function indicate SAI_I2S_STANDARD but it's only because I was curious of the result with this parameter value. I have bad result in both cases.
And here is the part where I read the file + send data to the SAI DMA
//Before infinite loop I extract the overall file size in bytes.
// Infinite Loop
for(;;)
{
if(drv_sdcard_getDmaTransferComplete()==true)
{
// BufferRead[0]=0xAA;
// BufferRead[1]=0xAA;
//
// ret = HAL_SAI_Transmit_DMA(&hsai_BlockA2, (uint8_t*)BufferRead, 2);
// drv_sdcard_resetDmaTransferComplete();
if((firstBytesDiscarded == true)&& (remainingBytes>0))
{
//read the next BufferRead size audio samples
if(remainingBytes < sizeof(BufferAudio))
{
remainingBytes -= drv_sdcard_readDataNoRewind(file_audio1_index, BufferAudio, remainingBytes);
}
else
{
remainingBytes -= drv_sdcard_readDataNoRewind(file_audio1_index, BufferAudio, sizeof(BufferAudio));
}
//send them by the SAI through DMA
ret = HAL_SAI_Transmit_DMA(&hsai_BlockA2, (uint8_t*)BufferAudio, sizeof(BufferAudio));
//reset transmit flag for forbidding next transmit
drv_sdcard_resetDmaTransferComplete();
}
else
{
//discard header size first bytes
//I removed this part here because it works properly on my side
firstBytesDiscarded = true;
}
}
I have one track of sound quality improvment : it is to filter speaker input. Yesterday I tried cutting # 20Khz and 44khz but it cut too much the signal... So I want to try different cutting frequencies until I find the sound is of good quality. It is a simple RC filter.
But to fix the jerky part, I dont know what to do. To give you an idea on how the sound comes out, I would describe it like this :
we can hear a bit of melody
then scratchy sound [krrrrrrr]
then short silence
and this looping until the end of the file.
Buffer Audio size is 16*1024 bytes.
Thank you for your help
Problems
No double-buffering. You are reading data from the SD-card into the same buffer that you are playing from. So you'll get some samples from the previous read, and some samples from the new read.
Not checking when the DMA is complete. HAL_SAI_Transmit_DMA() returns immediately, and you cannot call it again until the previous DMA has completed.
Not checking return values of HAL functions. You assign ret = HAL_SAI_Transmit_DMAbut then never check what ret is. You should check if there is an error and take appropriate action.
You seem to be driving things from how fast the SD-card can DMA the data. It needs to be based on how fast the SAI is consuming it, otherwise you will have glitches.
Possible solution
The STM32's DMA controller can be configured to run in circular-buffer mode. In this mode, it will DMA all the data given to it, and then start again from the beginning.
It also provides interrupts for when the DMA is half complete, and when it is fully complete.
These two things together can provide a smooth data transfer with no gaps and glitches, if used with the SAI DMA. You'd read data into the entire buffer to start with, and kick off the DMA. When you get the half-complete interrupt, read half a buffer's worth of data into the first half of the buffer. When you get a fully complete interrupt, read half a buffer's worth of data into the second half of the buffer.
This is psuedo-code-ish, but hopefully shows what I mean:
const size_t buff_len = 16u * 1024u;
uint16_t buff[buff_len];
void start_playback(void)
{
read_from_file(buff, buff_len);
if HAL_SAI_Transmit_DMA(&hsai_BlockA2, buff, buff_len) != HAL_OK)
{
// Handle error
}
}
void sai_dma_tx_half_complete_interrupt(void)
{
read_from_file(buff, buff_len / 2u);
}
void sai_dma_tx_full_complete_interrupt(void)
{
read_from_file(buff + buff_len / 2u, buff_len / 2u);
}
You'd need to detect when you have consumed the entire file, and then stop the DMA (with something like HAL_SAI_DMAStop()).
You might want to read this similar question where I gave a similar answer. They were recording to SD-card rather than playing back, but the same principles apply. They also supplied their actual code for the solution they employed.

Computing on variable length arrays in OpenCL

I am using OpenCL (Xcode, Intel GPU), and I am trying to implement a kernel that calculates moving averages and deviations. I want to pass several double arrays of varying lengths to the kernel. Is this possible to implement, or do I need to pad smaller arrays with zeroes so all the arrays are the same size?
I am new to OpenCL and GPGPU, so please forgive my ignorance of any nomenclature.
You can pass to the kernel any buffer, the kernel does not need to use it all.
For example, if your kernel reduces a buffer you can query at run time the amount of work items (items to reduce) using get_global_size(0). And then call the kernel with the proper parameters.
An example (unoptimized):
__kernel reduce_step(__global float* data)
{
int id = get_global_id(0);
int size = get_global_size(0);
int size2 = size/2;
int size2p = (size+1)/2;
if(id<size2) //Only reduce up to size2, the odd element will remain in place
data[id] += data[id+size2p];
}
Then you can call it like this.
void reduce_me(std::vector<cl_float>& data){
size_t size = data.size();
//Copy to a buffer already created, equal or bigger size than data size
// ... TODO, check sizes of buffer or change the buffer set to the kernel args.
queue.enqueueWriteBuffer(buffer,CL_FALSE,0,sizeof(cl_float)*size,data.data());
//Reduce until 1024
while(size > 1024){
queue.enqueueNDRangeKernel(reduce_kernel,cl::NullRange,cl::NDRange(size),cl::NullRange);
size /= 2;
}
//Read out and trim
queue.enqueueReadBuffer(buffer,CL_TRUE,0,sizeof(cl_float)*size,data.data());
data.resize(size);
}

Why to print a string in interrupt driven IO, only the first character needs to be copied?

Almost all materials I found online referenced the code below from Tananbaum's OS book. However I don't really understand why this would print the whole string instead of only the first character.
Is it because the interrupts will be generated recursively? But wouldn't that cost a lot of resources? Or did I miss something?
I'm really confused. Any help would be appreciated.
Code executed when print system call is made:
copy_from_user (buffer, p, count);
enable_interrupts ();
while (*printer_status_reg !=READY);
*printer_data_register = p[0];
scheduler ();
Interrupt handler:
if (count == 0) {
unblock_user ();
} else {
*printer_data_register = p[i];
count = count – 1;
i++;
}
acknowledge_interrupt ();
return_from_interrupt ();
You write first character in buffer and start the transmission.
After completion of transmission, Tx_Complete interrupt will be generated.
Now, your interrupt handler checks, whether there are any more bytes to transfer (The else part). If available, it adds next byte to transmit register, decrements number of bytes to transmit and increments buffer index.
This process goes on... When number of bytes to transmit reaches zero, you don't initiate next transfer and your interrupts stop.
By transferring first byte, you initiate the process and rest bytes are transferred by interrupt handler. You have to make sure that count is correct.
You can guess what can happen if count is less or more!

Accessing GPU memory in OpenCL/C++Amp

I need to find information about how the Unified Shader Array accessess the GPU memory to have an idea how to use it effectively. The image of the architecture of my graphics card doesn't show it clearly.
I need to load a big image into GPU memory using C++Amp and divide it into small pieces (like 4x4 pixels). Every piece should be computed with a different thread. I don't know how the threads share the access to the image.
Is there any way of doing it in such way that the threads aren't blocking each other while accessing the image? Maybe they have their own memory that can be accesses exclusively?
Or maybe the access to the unified memory is so fast that I shouldn't care about it (however I don't belive in it)? It is really important, because I need to compute about 10k subsets for every image.
For C++ AMP you want to load the data that each thread within a tile uses into tile_static memory before starting your convolution calculation. Because each thread accesses pixels which are also read by other threads this allows your to do a single read for each pixel from (slow) global memory and cache it in (fast) tile static memory so that all of the subsequent reads are faster.
You can see an example of tiling for convolution here. The DetectEdgeTiled method loads all the data that it requires and the calls idx.barrier.wait() to ensure all the threads have finished writing data into tile static memory. Then it executes the edge detection code taking advantage of tile_static memory. There are many other examples of this pattern in the samples. Note that the loading code in DetectEdgeTiled is complex only because it must account for the additional pixels around the edge of the pixels that are being written in the current tile and is essentially an unrolled loop, hence it's length.
I'm not sure you are thinking about the problem in quite the right way. There are two levels of partitioning here. To calculate the new value for each pixel the thread doing this work reads the block of surrounding pixels. In addition blocks (tiles) of threads loads larger blocks of pixel data into tile_static memory. Each thread on the tile then calculates the result for one pixel within the block.
void ApplyEdgeDetectionTiledHelper(const array<ArgbPackedPixel, 2>& srcFrame,
array<ArgbPackedPixel, 2>& destFrame)
{
tiled_extent<tileSize, tileSize> computeDomain = GetTiledExtent(srcFrame.extent);
parallel_for_each(computeDomain.tile<tileSize, tileSize>(), [=, &srcFrame, &destFrame, &orgFrame](tiled_index<tileSize, tileSize> idx) restrict(amp)
{
DetectEdgeTiled(idx, srcFrame, destFrame, orgFrame);
});
}
void DetectEdgeTiled(
tiled_index<tileSize, tileSize> idx,
const array<ArgbPackedPixel, 2>& srcFrame,
array<ArgbPackedPixel, 2>& destFrame) restrict(amp)
{
const UINT shift = imageBorderWidth / 2;
const UINT startHeight = 0;
const UINT startWidth = 0;
const UINT endHeight = srcFrame.extent[0];
const UINT endWidth = srcFrame.extent[1];
tile_static RgbPixel localSrc[tileSize + imageBorderWidth ]
[tileSize + imageBorderWidth];
const UINT global_idxY = idx.global[0];
const UINT global_idxX = idx.global[1];
const UINT local_idxY = idx.local[0];
const UINT local_idxX = idx.local[1];
const UINT local_idx_tsY = local_idxY + shift;
const UINT local_idx_tsX = local_idxX + shift;
// Copy image data to tile_static memory. The if clauses are required to deal with threads that own a
// pixel close to the edge of the tile and need to copy additional halo data.
// This pixel
index<2> gNew = index<2>(global_idxY, global_idxX);
localSrc[local_idx_tsY][local_idx_tsX] = UnpackPixel(srcFrame[gNew]);
// Left edge
if (local_idxX < shift)
{
index<2> gNew = index<2>(global_idxY, global_idxX - shift);
localSrc[local_idx_tsY][local_idx_tsX-shift] = UnpackPixel(srcFrame[gNew]);
}
// Right edge
// Top edge
// Bottom edge
// Top Left corner
// Bottom Left corner
// Bottom Right corner
// Top Right corner
// Synchronize all threads so that none of them start calculation before
// all data is copied onto the current tile.
idx.barrier.wait();
// Make sure that the thread is not referring to a border pixel
// for which the filter cannot be applied.
if ((global_idxY >= startHeight + 1 && global_idxY <= endHeight - 1) &&
(global_idxX >= startWidth + 1 && global_idxX <= endWidth - 1))
{
RgbPixel result = Convolution(localSrc, index<2>(local_idx_tsY, local_idx_tsX));
destFrame[index<2>(global_idxY, global_idxX)] = result;
}
}
This code was taken from CodePlex and I stripped out a lot of the real implementation to make it clearer.
WRT #sharpneli's answer you can use texture<> in C++ AMP to achieve the same result as OpenCL images. There is also an example of this on CodePlex.
In this particular case you do not have to worry. Just use OpenCL images. GPU's are extremely good at simply reading images (due to texturing). However this method requires writing the result into a separate image because you cannot read and write from the same image in a single kernel. You should use this if you can perform the computation as a single pass (no need to iterate).
Another way is to access it as a normal memory buffer, load the parts within a wavefront (group of threads running in sync) into local memory (this memory is blazingly fast), perform computation and write complete end result back into unified memory after computation. You should use this approach if you need to read and write values to the same image while computing. If you are not memory bound you can still read the original values from a texture, then iterate in local memory and write the end results in separate image.
Reads from unified memory are slow only if it's not const * restrict and multiple threads read the same location. In general if subsequent thread id's read subsequent locations it's rather fast. However if your threads both write and read to unified memory then it's going to be slow.

What is openCL equivalent for this cuda "cudaMallocPitch "code.?

My PC has an AMD processor with an ATI 3200 GPU which doesn't support OpenCL. The rest of the codes all running by "Falling back to CPU itself".
I am converting one of the code from CUDA to OpenCL but stuck in some particular part for which there is no exact conversion code in OpenCL. since i have less experience in OpenCL I can't make out this, please suggest me some solution if any of you think will work,
The CUDA code is,
size_t pitch = 0;
cudaError error = cudaMallocPitch((void**)&gpu_data, (size_t*)&pitch,
instances->cols * sizeof(float), instances->rows);
for( int i = 0; i < instances->rows; i++ ){
error = cudaMemcpy((void*)(gpu_data + (pitch/sizeof(float))*i),
(void*)(instances->data + (instances->cols*i)),
instances->cols * sizeof(float) ,cudaMemcpyHostToDevice);
If I remove the pitch value from the above I end up with an problem which doesn't write to the device memory "gpu_data".
Somebody please convert this code to OpenCL and reply. I have converted it to OpenCL, but its not working and the data is not written to "gpu_data". My converted OpenCL code is
gpu_data = clCreateBuffer(context, CL_MEM_READ_WRITE, ((instances->cols)*(instances->rows))*sizeof(float), NULL, &ret);
for( int i = 0; i < instances->rows; i++ ){
ret = clEnqueueWriteBuffer(command_queue, gpu_data, CL_TRUE, 0, ((instances->cols)*(instances->rows))*sizeof(float),(void*)(instances->data + (instances->cols*i)) , 0, NULL, NULL);
Sometimes it runs well for this code and gets stuck in the reading part i.e.
ret = clEnqueueReadBuffer(command_queue, gpu_data, CL_TRUE, 0,sizeof( float ) * instances->cols* 1 , instances->data, 0, NULL, NULL);
overhere. And it gives error like
Unhandled exception at 0x10001098 in CL_kmeans.exe: 0xC000001D: Illegal Instruction.
when break is pressed , it gives:
No symbols are loaded for any call stack frame. The source code cannot be displayed.
while debugging. In the call stack it is displaying:
OCL8CA9.tmp.dll!10001098()
[Frames below may be incorrect and/or missing, no symbols loaded for OCL8CA9.tmp.dll]
amdocl.dll!5c39de16()
I really dont know what it means. someone please help me to rid of this problem.
First of all, in the CUDA code you're doing a horribly inefficient thing to copy the data. The CUDA runtime has the function cudaMemcpy2D that does exactly what you are trying to do by looping over different rows.
What cudaMallocPitch does is to compute an optimal pitch (= distance in byte between rows in a 2D array) such that each new row begins at an address that is optimal for coalescing, and then allocates a memory area as large as pitch times the number of rows you specify. You can emulate the same thing in OpenCL by first computing the optimal pitch and then doing the allocation of the correct size.
The optimal pitch is computed by (1) getting the base address alignment preference for your card (CL_DEVICE_MEM_BASE_ADDR_ALIGN property with clGetDeviceInfo: note that the returned value is in bits, so you have to divide by 8 to get it in bytes); let's call this base (2) find the largest multiple of base that is no less than your natural data pitch (sizeof(type) times number of columns); this will be your pitch.
You then allocate pitch times number of rows bytes, and pass the pitch information to kernels.
Also, when copying data from the host to the device and converesely, you want to use clEnqueue{Read,Write}BufferRect, that are specifically designed to copy 2D data (they are the counterparts to cudaMemcpy2D).

Resources