Linux direct IO latency difference between reading 4GB and 32MB file under pread in NVME

Linux direct IO latency difference between reading 4GB and 32MB file under pread in NVME - linux

My test will randomly read 4K pages (random 4k page at time in a tight loop) in both 4GB and 32MB files using direct IO (O_DIRECT) and pread on NVME disk. The latency on 4GB file is about 41 microsecond per page and the latency is 79 microsecond for small 32MB file. Is there a rational explanation for such difference?
b_fd = open("large.txt", O_RDONLY | O_DIRECT);
s_fd = open("small.txt", O_RDONLY | O_DIRECT);
std::srand(std::time(nullptr));
void *buf;
int ps=getpagesize();
posix_memalign(&buf, ps, page_size);
long long nano_seconds = 0;
// number of random pages to read
int iter = 256 * 100;
for (int i = 0; i < iter; i++) {
int page_index = std::rand() % big_file_pages;
auto start = std::chrono::steady_clock::now();
pread (b_fd, buf, page_size, page_index * page_size);
auto end = std::chrono::steady_clock::now();
nano_seconds += std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
}
std::cout<<"large file average random 1 page direct IO read in nanoseconds is:"<<nano_seconds/iter<<"\n";```

Related

Write data to an SD card through a buffer without a race condition

I am writing firmware for a data logging device. It reads data from sensors at 20 Hz and writes data to an SD card. However, the time to write data to SD card is not consistent (about 200-300 ms). Thus one solution is writing data to a buffer at a consistent rate (using a timer interrupt), and have a second thread that writes data to the SD card when the buffer is full.
Here is my naive implementation:
#define N 64
char buffer[N];
int count;
ISR() {
if (count < N) {
char a = analogRead(A0);
buffer[count] = a;
count = count + 1;
}
}
void loop() {
if (count == N) {
myFile.open("data.csv", FILE_WRITE);
int i = 0;
for (i = 0; i < N; i++) {
myFile.print(buffer[i]);
}
myFile.close();
count = 0;
}
}
The code has the following problems:
Writing data to the SD card is blocking reading when the buffer is full
It might have a race conditions.
What is the best way to solve this problem? Using a circular buffer, or double buffering? How do I ensure that a race condition does not happen?

You have rather answered your own question; you should use either double buffering or a circular buffer. Double-buffering is probably simpler to implement and appropriate for devices such as an SD card for which block-writes are generally more efficient.
Buffer length selection may need some consideration; generally you would make the buffer the same as the SD sector buffer size (typically 512 bytes), but that may not be practical, and with a sample rate as low as 20 sps, optimising SD write performance is perhaps not an issue.
Another consideration is that you need to match your sample rate to the file-system latency by selecting an appropriate buffer size. In this case the 64 sample buffer buffer will fill in a little more than three seconds, but the block write takes only up-to 300 ms - so you could use a much smaller buffer if required - 8 samples would be sufficient - although be careful, you may have observed latency of 300 ms, but it may be larger when specific boundaries are crossed in the physical flash memory - I have seen significant latency on some cards at 1 Mbyte boundaries - moreover card performance varies significantly between sizes and manufacturers.
An adaptation of your implementation with double-buffering is below. I have reduced the buffer length to 32 samples, but with double-buffering the total is unchanged at 64, but the write lag is reduced to 1.6 seconds.
// Double buffer and its management data/constants
static volatile char buffer[2][32];
static const int BUFFLEN = sizeof(buffer[0]);
static const unsigned char EMPTY = 0xff;
static volatile unsigned char inbuffer = 0;
static volatile unsigned char outbuffer = EMPTY;
ISR()
{
static int count = 0;
// Write to current buffer
char a = analogRead(A0);
buffer[inbuffer][count] = a;
count++ ;
// If buffer full...
if( count >= BUFFLEN )
{
// Signal to loop() that data available (not EMPTY)
outbuffer = inbuffer;
// Toggle input buffer
inbuffer = inbuffer == 0 ? 1 : 0;
count = 0;
}
}
void loop()
{
// If buffer available...
if( outbuffer != EMPTY )
{
// Write buffer
myFile.open("data.csv", FILE_WRITE);
for( int i = 0; i < BUFFLEN; i++)
{
myFile.print(buffer[outbuffer][i]);
}
myFile.close();
// Set the buffer to empty
outbuffer = EMPTY;
}
}
Note the use of volatile and unsigned char for the shared data. It is important that data shared between concurrent execution contexts is accessed explicitly and atomically; access to an int on 8-bit AVR based Arduino requires multiple machine instructions and the interrupt may occur part way through a read/write in loop() and cause an incorrect value to be read.

Logical block number to address (Linux filesystem)

int block_count;
struct stat statBuf;
int block;
int fd = open("file.txt", O_RDONLY);
fstat(fd, &statBuf);
block_count = (statBuf.st_size + statBuf.st_blksize - 1) / statBuf.st_blksize;
int i,add;
for(i = 0; i < block_count; i++) {
block = i;
if (ioctl(fd, FIBMAP, &block)) {
perror("FIBMAP ioctl failed");
}
printf("%3d %10d\n", i, block);
add = block;
}
char buffer[255];
int fd2 = open("/dev/sda1", O_RDONLY);
lseek(fd2, add, SEEK_SET);
read(fd2, buffer, 20);
printf("%s%s\n","ss ",buffer);
Output:
0 5038060
1 5038061
2 5038062
3 5038063
4 5038064
5 5038065
ss
I am using the above code to get the logical block number of a file. Lets suppose I want to read the contents of the last block number, How would I do that?
Is there a way to get the address of a block from logical block number?
PS: I am using linux and filesystem is ext4

The above code actually is getting you the physical block number. The FIBMAP ioctl takes as input the logical block number, and returns the physical block number. You can then multiply that by the blocksize (which is 4k for most ext4 file systems, but you can get that by using the BLKBSZGET ioctl) to get the byte offset if you want to read that disk block using lseek and read.
Note that the more modern interface that people tend to use today is the FIEMAP interface. This doesn't require root, and returns the physical byte offset, plus a lot more data. For more information, please see:
https://www.kernel.org/doc/Documentation/filesystems/fiemap.txt
Or you can look at the source code for the filefrag command, which is part of e2fsprogs:
https://git.kernel.org/cgit/fs/ext2/e2fsprogs.git/tree/misc/filefrag.c

Actual physical RAM used by process

How can I determine actual physical RAM used by some process?
I can look into /proc/PID/status on VmRSS (or into top's RES column). However, this number is incorrect for the processes which use multiple mappings backed by the same file. For example, the following piece of code maps several areas into a small physical memory window.
size_t window_size = ...; // e.g. 128 MiB
size_t total_size = ...; // e.g. 4 TiB
char path[] = "/dev/shm/window-XXXXXX";
int fd = mkstemp(path);
ftruncate(fd, (off_t)window_size)
void *data = mmap(NULL, total_size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0);
for(ptrdiff_t offset = 0; offset < (ptrdiff_t)total_size; offset += window_size)
{
mmap( (void *)( (uintptr_t)data + offset ), window_size, PROT_READ|PROT_WRITE, MAP_FIXED|MAP_SHARED|MAP_NORESERVE, fd, 0);
}
Now, if I look into /proc/PID/status, the kernel reports VmRSS as a sum of all the windows above. Although this number is even higher than the total physical memory size.

What is the smallest audio buffer needed to produce Tone sound without distotions with WaveOUT API

Does the WaveOut API has some internal limitation of the size for the current piece of buffer played ? I mean if I provide a very small buffer does it affects somehow the sound played to the speakers. I am experiencing very strange noise when I am generating and playing the sinus wave with small buffer. Something like a peak, or "BUMP".
The complete Story:
I made a program that can generate Sinus sound signal in real time.
The variable parameters are Frequency and Volume. The project requirement was to have a maximum latency of 50 ms. So the program must be able to produce Sinus signals with manually adjustable frequency of audio signal in real time.
I used Windows WaveOut API, C# and P/invoke to access the API.
Everything works fine when the sound buffer is 1000 ms large. If I minimize the buffer to 50 ms as per latency requirement then for certain frequencies I am experiencing at the end of every buffer, a noise or "BUMP". I do not understand if the sound generated is malformed ( I checked and is not) or something happens with the Audio chip, or some delay in initializing and playing.
When I save the produced audio to .wav file everything is perfect.
This means the must be some bug in my code or the audio subsystem has a limitation to the buffer chunks sent to it.
For those who doesn't know WaveOut must be initialized at first time and then must be prepared with audio headers for each buffer that are containing the number of bytes that needs to be played and the pointer to a memory that contains the audio that needs to be player.
UPDATE
Noise happens with the following combinations 44100 SamplingRate, 16 Bits, 2 channels, 50 ms buffer and generated Sinus audio signal of 201Hz, 202Hz, 203Hz, 204Hz, 205Hz ... 219Hz,
220Hz, 240 Hz, is ok
Why is this difference of 20, I do not know.

There are a few things to keep in mind when you need to output audio smoothly:
waveOutXxxx API is a legacy/compatibility layer on top of lower level API and as such it has greater overhead and is not recommended when you are to reach minimal latency. Note that this is unlikely to be your primary problem, but this is a piece of general knowledge helpful for understanding
because Windows is not real time OS and its audio subsystem is not realtime either you don't have control over random latency involved between you queue audio data for output and the data is really played back, the key is to keep certain level of buffer fullness which protects you from playback underflows and delivers smooth playback
with waveOutXxxx you are no limited to having single buffer, you can allocate multiple reusable buffers and recycle them
All in all, waveOutXxxx, DirectSound, DirectShow APIs work well with latencies 50 ms and up. With WASAPI exclusive mode streams you can get 5 ms latencies and even lower.
EDIT: I seem to have said too early about 20 ms latencies. To compensate for this, here is a simple tool LowLatencyWaveOutPlay (Win32, x64) to estimate the latency you can achieve. With sufficient buffering playback is smooth, otherwise you hear stuttering.
My understanding is that buffers might be returned late and the optimal design in terms of smallest latency lies along the line of having more smaller buffers so that you are given them back as early as possible. For example, 10 buffers 3 ms/buffer rather than 3 buffers 10 ms/buffer.
D:\>LowLatencyWaveOutPlay.exe 48000 10 3
Format: 48000 Hz, 1 channels, 16 bits per sample
Buffer Count: 10
Buffer Length: 3 ms (288 bytes)
Signal Frequency: 1000 Hz
^C

So I came here because I wanted to find the basic latency of waveoutwrite() as well. I got around 25-26ms of latency before I got to the smooth sine tone.
This is for:
AMD Phenom(tm) 9850 Quad-Core Processor 2.51 GHz
4.00 GB ram
64-bit operating system, x64-based processor
Windows 10 Enterprise N
The code follows. It is a modfied version of Petzold's sine wave program, refactored to run on the command line. I also changed the polling of buffers to use of a callback on buffer complete with the idea that this would make the program more efficient, but it didn't make a difference.
It also has a setup for elapsed timing, which I used to probe various timings for operations on the buffers. Using those I get:
Sine wave output program
Channels: 2
Sample rate: 44100
Bytes per second: 176400
Block align: 4
Bits per sample: 16
Time per buffer: 0.025850
Total time prepare header: 87.5000000000 usec
Total time to fill: 327.9000000000 usec
Total time for waveOutWrite: 90.8000000000 usec
Program:
/*******************************************************************************
WaveOut example program
Based on C. Petzold's sine wave example, outputs a sine wave via the waveOut
API in Win32.
*******************************************************************************/
#include <stdio.h>
#include <windows.h>
#include <math.h>
#include <limits.h>
#include <unistd.h>
#define SAMPLE_RATE 44100
#define FREQ_INIT 440
#define OUT_BUFFER_SIZE 570*4
#define PI 3.14159
#define CHANNELS 2
#define BITS 16
#define MAXTIM 1000000000
double fAngle;
LARGE_INTEGER perffreq;
PWAVEHDR pWaveHdr1, pWaveHdr2;
int iFreq = FREQ_INIT;
VOID FillBuffer (short* pBuffer, int iFreq)
{
int i;
int c;
for (i = 0 ; i < OUT_BUFFER_SIZE ; i += CHANNELS) {
for (c = 0; c < CHANNELS; c++)
pBuffer[i+c] = (short)(SHRT_MAX*sin (fAngle));
fAngle += 2*PI*iFreq/SAMPLE_RATE;
if (fAngle > 2 * PI) fAngle -= 2*PI;
}
}
double elapsed(LARGE_INTEGER t)
{
LARGE_INTEGER rt;
long tt;
QueryPerformanceCounter(&rt);
tt = rt.QuadPart-t.QuadPart;
return (tt*(1.0/(double)perffreq.QuadPart));
}
void CALLBACK waveOutProc(HWAVEOUT hwo, UINT uMsg, DWORD_PTR dwInstance, DWORD_PTR dwParam1, DWORD_PTR dwParam2)
{
if (uMsg == WOM_DONE) {
if (pWaveHdr1->dwFlags & WHDR_DONE) {
FillBuffer((short*)pWaveHdr1->lpData, iFreq);
waveOutWrite(hwo, pWaveHdr1, sizeof(WAVEHDR));
}
if (pWaveHdr2->dwFlags & WHDR_DONE) {
FillBuffer((short*)pWaveHdr2->lpData, iFreq);
waveOutWrite(hwo, pWaveHdr2, sizeof(WAVEHDR));
}
}
}
int main()
{
HWAVEOUT hWaveOut ;
short* pBuffer1;
short* pBuffer2;
short* pBuffer3;
WAVEFORMATEX waveformat;
UINT wReturn;
int bytes;
long t;
LARGE_INTEGER rt;
double timprep;
double filtim;
double waveouttim;
printf("Sine wave output program\n");
fAngle = 0; /* start sine angle */
QueryPerformanceFrequency(&perffreq);
pWaveHdr1 = malloc (sizeof (WAVEHDR));
pWaveHdr2 = malloc (sizeof (WAVEHDR));
pBuffer1 = malloc (OUT_BUFFER_SIZE*sizeof(short));
pBuffer2 = malloc (OUT_BUFFER_SIZE*sizeof(short));
pBuffer3 = malloc (OUT_BUFFER_SIZE*sizeof(short));
if (!pWaveHdr1 || !pWaveHdr2 || !pBuffer1 || !pBuffer2) {
if (!pWaveHdr1) free (pWaveHdr1) ;
if (!pWaveHdr2) free (pWaveHdr2) ;
if (!pBuffer1) free (pBuffer1) ;
if (!pBuffer2) free (pBuffer2) ;
fprintf(stderr, "*** Error: No memory\n");
exit(1);
}
// Load prime parameters to format
waveformat.wFormatTag = WAVE_FORMAT_PCM;
waveformat.nChannels = CHANNELS;
waveformat.nSamplesPerSec = SAMPLE_RATE;
waveformat.wBitsPerSample = BITS;
waveformat.cbSize = 0;
// Calculate other parameters
bytes = waveformat.wBitsPerSample/8; /* find bytes per sample */
if (waveformat.wBitsPerSample&8) bytes++; /* round up */
bytes *= waveformat.nChannels; /* find total channels size */
waveformat.nBlockAlign = bytes; /* set block align */
/* find average bytes/sec */
waveformat.nAvgBytesPerSec = bytes*waveformat.nSamplesPerSec;
printf("Channels: %d\n", waveformat.nChannels);
printf("Sample rate: %d\n", waveformat.nSamplesPerSec);
printf("Bytes per second: %d\n", waveformat.nAvgBytesPerSec);
printf("Block align: %d\n", waveformat.nBlockAlign);
printf("Bits per sample: %d\n", waveformat.wBitsPerSample);
printf("Time per buffer: %f\n",
OUT_BUFFER_SIZE*sizeof(short)/(double)waveformat.nAvgBytesPerSec);
if (waveOutOpen (&hWaveOut, WAVE_MAPPER, &waveformat, (DWORD_PTR)waveOutProc, 0, CALLBACK_FUNCTION)
!= MMSYSERR_NOERROR) {
free (pWaveHdr1) ;
free (pWaveHdr2) ;
free (pBuffer1) ;
free (pBuffer2) ;
hWaveOut = NULL ;
fprintf(stderr, "*** Error: No memory\n");
exit(1);
}
// Set up headers and prepare them
pWaveHdr1->lpData = (LPSTR)pBuffer1;
pWaveHdr1->dwBufferLength = OUT_BUFFER_SIZE*sizeof(short);
pWaveHdr1->dwBytesRecorded = 0;
pWaveHdr1->dwUser = 0;
pWaveHdr1->dwFlags = WHDR_DONE;
pWaveHdr1->dwLoops = 1;
pWaveHdr1->lpNext = NULL;
pWaveHdr1->reserved = 0;
QueryPerformanceCounter(&rt);
waveOutPrepareHeader(hWaveOut, pWaveHdr1, sizeof (WAVEHDR));
timprep = elapsed(rt);
pWaveHdr2->lpData = (LPSTR)pBuffer2;
pWaveHdr2->dwBufferLength = OUT_BUFFER_SIZE*sizeof(short);
pWaveHdr2->dwBytesRecorded = 0;
pWaveHdr2->dwUser = 0;
pWaveHdr2->dwFlags = WHDR_DONE;
pWaveHdr2->dwLoops = 1;
pWaveHdr2->lpNext = NULL;
pWaveHdr2->reserved = 0;
waveOutPrepareHeader(hWaveOut, pWaveHdr2, sizeof (WAVEHDR));
// Send two buffers to waveform output device
QueryPerformanceCounter(&rt);
FillBuffer (pBuffer1, iFreq);
filtim = elapsed(rt);
QueryPerformanceCounter(&rt);
waveOutWrite (hWaveOut, pWaveHdr1, sizeof (WAVEHDR));
waveouttim = elapsed(rt);
FillBuffer (pBuffer2, iFreq);
waveOutWrite (hWaveOut, pWaveHdr2, sizeof (WAVEHDR));
// Run waveform loop
sleep(10);
printf("Total time prepare header: %.10f usec\n", timprep*1000000);
printf("Total time to fill: %.10f usec\n", filtim*1000000);
printf("Total time for waveOutWrite: %.10f usec\n", waveouttim*1000000);
waveOutUnprepareHeader(hWaveOut, pWaveHdr1, sizeof (WAVEHDR));
waveOutUnprepareHeader(hWaveOut, pWaveHdr2, sizeof (WAVEHDR));
// Close waveform file
free (pWaveHdr1) ;
free (pWaveHdr2) ;
free (pBuffer1) ;
free (pBuffer2) ;
}

CUDA performance test

I'm writing a simple CUDA program for performance test.
This is not related to vector calculation, but just for a simple (parallel) string conversion.
#include <stdio.h>
#include <string.h>
#include <cuda_runtime.h>
#define UCHAR unsigned char
#define UINT32 unsigned long int
#define CTX_SIZE sizeof(aes_context)
#define DOCU_SIZE 4096
#define TOTAL 100000
#define BBLOCK_SIZE 500
UCHAR pH_TXT[DOCU_SIZE * TOTAL];
UCHAR pH_ENC[DOCU_SIZE * TOTAL];
UCHAR* pD_TXT;
UCHAR* pD_ENC;
__global__
void TEST_Encode( UCHAR *a_input, UCHAR *a_output )
{
UCHAR *input;
UCHAR *output;
input = &(a_input[threadIdx.x * DOCU_SIZE]);
output = &(a_output[threadIdx.x * DOCU_SIZE]);
for ( int i = 0 ; i < 30 ; i++ ) {
if ( (input[i] >= 'a') && (input[i] <= 'z') ) {
output[i] = input[i] - 'a' + 'A';
}
else {
output[i] = input[i];
}
}
}
int main(int argc, char** argv)
{
struct cudaDeviceProp xCUDEV;
cudaGetDeviceProperties(&xCUDEV, 0);
// Prepare Source
memset(pH_TXT, 0x00, DOCU_SIZE * TOTAL);
for ( int i = 0 ; i < TOTAL ; i++ ) {
strcpy((char*)pH_TXT + (i * DOCU_SIZE), "hello world, i need an apple.");
}
// Allocate vectors in device memory
cudaMalloc((void**)&pD_TXT, DOCU_SIZE * TOTAL);
cudaMalloc((void**)&pD_ENC, DOCU_SIZE * TOTAL);
// Copy vectors from host memory to device memory
cudaMemcpy(pD_TXT, pH_TXT, DOCU_SIZE * TOTAL, cudaMemcpyHostToDevice);
// Invoke kernel
int threadsPerBlock = BLOCK_SIZE;
int blocksPerGrid = (TOTAL + threadsPerBlock - 1) / threadsPerBlock;
printf("Total Task is %d\n", TOTAL);
printf("block size is %d\n", threadsPerBlock);
printf("repeat cnt is %d\n", blocksPerGrid);
TEST_Encode<<<blocksPerGrid, threadsPerBlock>>>(pD_TXT, pD_ENC);
cudaMemcpy(pH_ENC, pD_ENC, DOCU_SIZE * TOTAL, cudaMemcpyDeviceToHost);
// Free device memory
if (pD_TXT) cudaFree(pD_TXT);
if (pD_ENC) cudaFree(pD_ENC);
cudaDeviceReset();
}
And when i change BLOCK_SIZE value from 2 to 1000, I got a following duration time (from NVIDIA Visual Profiler)
TOTAL BLOCKS BLOCK_SIZE Duration(ms)
100000 50000 2 28.22
100000 10000 10 22.223
100000 2000 50 12.3
100000 1000 100 9.624
100000 500 200 10.755
100000 250 400 29.824
100000 200 500 39.67
100000 100 1000 81.268
My GPU is GeForce GT520 and max threadsPerBlock value is 1024, so I predicted that I would get best performance when BLOCK is 1000, but the above table shows different result.
I can't understand why Duration time is not linear, and how can I fix this problem. (or how can I find optimized Block value (mimimum Duration time)

It seems 2, 10, 50 threads doesn't utilize the capabilities of the gpu since its design is to start much more threads.
Your card has compute capability 2.1.
Maximum number of resident threads per multiprocessor = 1536
Maximum number of threads per block = 1024
Maximum number of resident blocks per multiprocessor = 8
Warp size = 32
There are two issues:
1.
You try to occupy so much register memory per thread that it will definetly is outsourced to slow local memory space if your block sizes increases.
2.
Perform your tests with multiple of 32 since this is the warp size of your card and many memory operations are optimized for thread sizes with multiple of the warp size.
So if you use only around 1024 (1000 in your case) threads per block 33% of your gpu is idle since only 1 block can be assigned per SM.
What happens if you use the following 100% occupancy sizes?
128 = 12 blocks -> since only 8 can be resident per sm the block execution is serialized
192 = 8 resident blocks per sm
256 = 6 resident blocks per sm
512 = 3 resident blocks per sm

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string