I'm trying to pipe extremely high speed data from one application to another using 64-bit CentOS6. I have done the following benchmarks using dd to discover that the pipes are holding me back and not the algorithm in my program. My goal is to achieve somewhere around 1.5 GB/s.
First, without pipes:
dd if=/dev/zero of=/dev/null bs=8M count=1000
1000+0 records in
1000+0 records out
8388608000 bytes (8.4 GB) copied, 0.41925 s, 20.0 GB/s
Next, a pipe between two dd processes:
dd if=/dev/zero bs=8M count=1000 | dd of=/dev/null bs=8M
1000+0 records in
1000+0 records out
8388608000 bytes (8.4 GB) copied, 9.39205 s, 893 MB/s
Are there any tweaks I can make to the kernel or anything else that will improve performance of running data through a pipe? I have tried named pipes as well, and gotten similar results.
Have you tried with smaller blocks?
When I try on my own workstation I note successive improvement when lowering the block size.
It is only in the realm of 10% in my test, but still an improvement. You are looking for 100%.
As it turns out testing further, really small block sizes seem to do the trick:
I tried
dd if=/dev/zero bs=32k count=256000 | dd of=/dev/null bs=32k
256000+0 records in
256000+0 records out
256000+0 records in
256000+0 records out
8388608000 bytes (8.4 GB) copied8388608000 bytes (8.4 GB) copied, 1.67965 s, 5.0 GB/s
, 1.68052 s, 5.0 GB/s
And with your original
dd if=/dev/zero bs=8M count=1000 | dd of=/dev/null bs=8M
1000+0 records in
1000+0 records out
1000+0 records in
1000+0 records out
8388608000 bytes (8.4 GB) copied8388608000 bytes (8.4 GB) copied, 6.25782 s, 1.3 GB/s
, 6.25203 s, 1.3 GB/s
5.0/1.3 = 3.8 so that is a sizable factor.
It seems that Linux pipes only yield up 4096 bytes at a time to the reader, regardless of how large the writer's writes were.
So trying to stuff more than 4096 bytes into a already stuffed pipe per write(2) system call will just cause the writer to stall, until the reader can invoke the multiple reads needed to pull that much data out of the pipe and do whatever processing it has in mind to do.
This tells me that on multi-core or multi-thread CPU's (does anyone still make a single core, single thread, CPU?), one can get more parallelism and hence shorter elapsed clock times by having each writer in a pipeline only write 4096 bytes at a time, before going back to whatever data processing or production it can do towards making the next 4096 block.
Related
We're running a standard B8ms VM with a 257GB Premium SSD. According to the docs it says the throughput should be Up to 170 MB/second Provisioned 100 MB/second
https://azure.microsoft.com/en-us/pricing/details/managed-disks/
However when i test it, the throughput looks to be about 35 MB/Second
▶ dd if=/dev/zero of=/tmp/test1.img bs=1G count=1 oflag=dsync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 30.8976 s, 34.8 MB/s
Is there something else i need to account for in order to maximize the throughput?
You have diffrent limits, you have the IOPS limit on the disk and you have the Throughput limit on the disk. If you use bigger blocks when testing you will hit the througput limit and if you use smaller blocks you will hit the IOPS limit.
Then you have the VM limits and you have the Disk/Storage limits. So there is many things to take into consideration when doing this types of tests.
And you also have the caching settings to take into consideration on the disks.
https://learn.microsoft.com/en-us/azure/virtual-machines/windows/disks-benchmarks
I want to run dd over a SanDisk 32GB micro SD but I'm not sure how to decide on the block size.
Usually I use bs=1M, but could I go any higher than that?
Try it!
#!/bin/bash
bs=( 32k 64k 128k 256k 512k 1m 2m 4m )
ct=( 32768 16384 8192 4096 2048 1024 512 256 )
for (( x=0;x<${#bs[#]};x++ )); do
echo Testing bs=${bs[x]},count=${ct[x]}
dd if=/dev/zero bs=${bs[x]} count=${ct[x]} of=junk
done
Output
Testing bs=32k,count=32768
32768+0 records in
32768+0 records out
1073741824 bytes transferred in 3.094462 secs (346988217 bytes/sec)
Testing bs=64k,count=16384
16384+0 records in
16384+0 records out
1073741824 bytes transferred in 3.445761 secs (311612394 bytes/sec)
Testing bs=128k,count=8192
8192+0 records in
8192+0 records out
1073741824 bytes transferred in 2.937460 secs (365534116 bytes/sec)
Testing bs=256k,count=4096
4096+0 records in
4096+0 records out
1073741824 bytes transferred in 3.247829 secs (330602946 bytes/sec)
Testing bs=512k,count=2048
2048+0 records in
2048+0 records out
1073741824 bytes transferred in 3.212303 secs (334259206 bytes/sec)
Testing bs=1m,count=1024
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 3.129765 secs (343074260 bytes/sec)
Testing bs=2m,count=512
512+0 records in
512+0 records out
1073741824 bytes transferred in 2.908048 secs (369231132 bytes/sec)
Testing bs=4m,count=256
256+0 records in
256+0 records out
1073741824 bytes transferred in 2.996609 secs (358318964 bytes/sec)
You could go higher, but it probably won't make any difference. If you go too high, things might actually slow down.
Different SSD devices have different performance profiles. There is no universal, ultimate, answer that's right for every SSD device that exists in this entire world.
The only way to get the right answer is to experiment with various block sizes, and benchmark the performance.
I have a situation where i have to read sparse file. This file is having data at specific offset. Now i want to achieve.
1) Read 3 blocks(custom sizes) from the given offset
2) offset need to be seek using 1M
So, i am trying below command but not successful. I am reading more contents for sure.
dd if=a_Sparse_file_ofSIZe_1024M of=/dev/null ibs=1M skip=512 obs=262144 count=3
skip 512M of blocks and read from 512M+1 th offset using block of 256K for 3 counts.
skip always should be in MBs and count blocks are variable.
I am sure i am reading more data. Can someone please correct me.
You can always string 2 dds together, the first one to skip and the second one to read your actual data:
dd if=a_Sparse_file_ofSIZe_1024M bs=1M skip=N | dd bs=262144 count=3
The count parameter seems to be based on ibs, so the obs value does not matter here. As your obs value is four times smaller than ibs, I would suggest to set bs=256K and just multiply skip value by four: skip=2048.
i want to know what is the advantage of writing file block by block.i can think it will reduce the io operation. but in linux like environment data anyway goes to the page cache and background daemon doing the physical disk writing(correct me if i'm wrong).In that kind of environment what are the advantages of block writing?.
If I understand your question correctly, you are asking about the advantages of using larger blocks, rather than writing character-by-character.
You have to consider that each use of a system call (e.g. write()) has a minimum cost by itself, regardless of what is being done. In addition it may cause the calling process to be subjected to a context switch, which has a cost of its own and also allows other processes to use the CPU, causing even more significant delays.
Therefore - even if we forget about direct and synchronous I/O modes where each operation may make it to the disk immediately - it makes sense from a performance standpoint to reduce the impact of those constant costs by moving around larger blocks of data.
A simple demonstration using dd to transfer 1,000,000 bytes:
$ dd if=/dev/zero of=test.txt count=1000000 bs=1 # 1,000,000 blocks of 1 byte
1000000+0 records in
1000000+0 records out
1000000 bytes (1.0 MB) copied, 1.55779 s, 642 kB/s
$ dd if=/dev/zero of=test.txt count=100000 bs=10 # 100,000 blocks of 10 bytes
100000+0 records in
100000+0 records out
1000000 bytes (1.0 MB) copied, 0.172038 s, 5.8 MB/s
$ dd if=/dev/zero of=test.txt count=10000 bs=100 # 10,000 blocks of 100 bytes
10000+0 records in
10000+0 records out
1000000 bytes (1.0 MB) copied, 0.0262843 s, 38.0 MB/s
$ dd if=/dev/zero of=test.txt count=1000 bs=1000 # 1,000 blocks of 1,000 bytes
1000+0 records in
1000+0 records out
1000000 bytes (1.0 MB) copied, 0.0253754 s, 39.4 MB/s
$ dd if=/dev/zero of=test.txt count=100 bs=10000 # 100 blocks of 10,000 bytes
100+0 records in
100+0 records out
1000000 bytes (1.0 MB) copied, 0.00919108 s, 109 MB/s
As an additional benefit, using larger-blocks of data lets both the I/O scheduler and the allocator of the filesystem to make more accurate estimates about your actual workload.
I wanted to measure my disk throughput using the following command:
dd if=/dev/zero of=/mydir/junkfile bs=4k count=125000
If the junkfile exists, my disk throughput is 6 times smaller than if junkfile does not exist. I have repeated that many times and the results hold. Anybody knows why?
Thanks,
Amir.
In order to minimize disk caching, you need to copy an amount
significantly larger than the amount of memory in your system. 2X the
amount of RAM in your server is a useful amount.
from http://www.westnet.com/~gsmith/content/postgresql/pg-disktesting.htm