how to achieve disk write speed as high as fio does? [closed] - linux

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 4 years ago.
Improve this question
I bought a Highpoint a HBA card with 4 Samsung 960PRO in it.As the official site said this card can perform 7500MB/s in writing and 13000MB/s in reading.
When I test this card with fio in my ubuntu 16.04 system,I got a writing speed of about 7000MB/s,here is my test arguments:
sudo fio -filename=/home/xid/raid0_dir0/fio.test -direct=1 -rw=write -ioengine=sync -bs=2k -iodepth=32 -size=100G -numjobs=1 -runtime=100 -time_base=1 -group_reporting -name=test-seq-write
I have made a raid0 in the card and made a xfs filesystem.I want to know how to achieve disk writing speed as high as fio performed if I want to use functions such as "open(),read(),write()" or functions such as "fopen(),fread(),fwrite()" in my console applications.

I'll just note that the fio job you specified seems a little flawed:
-direct=1 -rw=write -ioengine=sync -bs=2k -iodepth=32
(for the sake of simplicity let's assume the dashes are actually double)
The above is asking trying to ask a synchronous ioengine to use an iodepth of greater than one. This usually doesn't make sense and the iodepth section of the fio documentation warns about this:
iodepth=int
Number of I/O units to keep in flight against the file. Note that
increasing iodepth beyond 1 will not affect synchronous ioengines
(except for small degrees when verify_async is in use). Even async
engines may impose OS restrictions causing the desired depth not to be
achieved. [emphasis added]
You didn't post the fio output so we can't tell if you ever achieved an iodepth of greater than one. 7.5GByte/s seems high for such a job and I can't help thinking your filesystem quietly went and did buffering behind your back but who knows? I can't say more because the output of you fio run is unavailable I'm afraid.
Also note the data fio was writing might not have been random enough to defeat compression thus helping to achieve an artificially high I/O speed...
Anyway to your main question:
how to achieve disk write speed as high as fio does?
Your example shows you are telling fio to use an ioengine that does regular write calls. With this in mind, theoretically you should be able to achieve a similar speed by:
Preallocating your file and only writing into the allocated portions of it (so you are not doing extending writes)
Fulfilling all the requirements of using O_DIRECT (there are strict memory alignment and size constraints that MUST be fulfilled)
Making sure your write operations are working on buffers writing in chunks of exactly 2048 bytes (or greater so long as they are powers of two)
Submitting your writes as soon as possible :-)
You may find not using O_DIRECT (and thus allowing buffered I/O to do coalescing) is better if for some reason you are unable to submit "large" well aligned buffers every time.

Related

Is it possible in linux to disable filesystem caching for specific files? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 4 years ago.
Improve this question
I have some large files and i am ok with them being read at disk I/O capacity. I wish to have file system cache free for other files.
Is it possible to turn of file system caching for specific files in linux ?
Your question hints that you might not be the author of the program you wish to control... If that's the case the answer is "not easily". If you are looking for something where you just mark (e.g. via extended attributes) a particular set of files "nocache" the answer is no. At best you are limited to having a LD_PRELOAD wrapper around the program and the wrapper would have to be written carefully to avoid impacting all files the program would try to open etc.
If you ARE the author of the program you should take a look at using fadvise (or the equivalent madvise if you're using mmap) because after you have finished reading the data you can hint to the kernel that it should discard the pieces it cached by using the FADV_DONTNEED parameter (why not use FADV_NOREUSE? Because with Linux kernels available at the time of writing it's a no-op).
Another technique if you're the author would be to open the file with the O_DIRECT flag set but I do not recommend this unless you really know what you're doing. O_DIRECT comes with a large set of usage constraints and conditions on its use (which people often don't notice or understand the impact of until it's too late):
You MUST do I/O in multiples of the disk's block size (no smaller than 512 bytes but not unusual for it to be 4Kbytes and it can be some other larger multiple) and you must only use offsets that are similarly well aligned.
The buffers of your program will have to conform to an alignment rule.
Filesystems can choose not to support O_DIRECT so your program has to handle that.
Filesystems may simply choose to put your I/O through the page cache anyway (O_DIRECT is a "best effort" hint).
NB: Not allowing caching to be used at all (i.e. not even on the initial read) may lead to the file being read in at speeds far below what the disk can achieve.
I think you can do this by the open system call with O_DIRECT for the file you don't want to cache file on the page cache of kernel.
The meaning of O_DIRECT flag from the open manual is the following:
O_DIRECT (Since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this file. In general this will degrade perfor‐
mance, but it is useful in special situations, such as when applications do their own caching. File
I/O is done directly to/from user-space buffers. The O_DIRECT flag on its own makes an effort to
transfer data synchronously, but does not give the guarantees of the O_SYNC flag that data and neces‐
sary metadata are transferred. To guarantee synchronous I/O, O_SYNC must be used in addition to
O_DIRECT. See NOTES below for further discussion.

Guide for working with Linux thread priorities and scheduling policies? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I'm having trouble getting the hang of thread(/process) prioritization on Linux, scheduling policy selection, what to choose when and how, and what the exact effects are.
Is there any documentation (like a guide) somewhere, preferably with concrete examples and timelines, which I could consult?
I'm having trouble getting the hang of thread(/process) prioritization on Linux, scheduling policy selection
The prioritization works by utilizing the underlying OS' thread and process priorities and it is hard to generalize about the specifics of it from the standpoint of documentation which may be why you've not found guides online.
My recommendation is (frankly) to not bother with thread priorities. I've done a large amount of threaded programming and I've never found the need to do anything but the default prioritization. About the only time thread prioritization will make a difference is if all of the threads are completely CPU bound and you want one task or another to get more cycles.
In addition, I'm pretty sure that under Linux at least, this isn't about preemption but more about run frequency. Many thread implementations use a priority scheduling queue so higher frequency threads get preference with logic in there to avoid starving the lower frequency threads. This means that any IO or other blocking operation is going to cause a lower priority thread to run and get its time slice.
This page is a good example of the complexities of the issue. To quote:
As can be seen, thread priorities 1-8 end up with a practically equal share of the CPU, whilst priorities 9 and 10 get a vastly greater share (though with essentially no difference between 9 and 10). The version tested was Java 6 Update 10. For what it's worth, I repeated the experiment on a dual core machine running Vista, and the shape of the resulting graph is the same. My best guess for the special behaviour of priorities 9 and 10 is that THREAD_PRIORITY_HIGHEST in a foreground window has just enough priority for certain other special treatment by the scheduler to kick in (for example, threads of internal priority 14 and above have their full quantum replenished after a wait, whereas lower priorities have them reduced by 1).
If you must use thread priorities then to you may have to write some test programs to understand how your architecture utilizes them.

Why don't hardware failures show up at the programming language level? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 9 years ago.
Improve this question
I am wondering if anyone can give my a good answer, or at least point me in the direction of a good reference to the following question:
How come I have never heard of a computer breaking in a very fundamental way? How come when I declare x to be a double it stays as a double? How come there is never a short circuit that robs it of some bytes and makes it an integer? Why do we have faith that when we initialize x to 10, there will never be a power surge that will cause it to become 11, or something similar?
I think I need a better understanding of memory.
Thanks, and please don't bash me over the head for such a simple/abstract question.
AFAIK, the biggest such problem is cosmic background radiation. If a gamma ray hits your memory chip, it can randomly flip a memory bit. This does happen even in your computer every now and then. It usually doesn't cause any problem, since it is unlikely that it is the very bit in your Excel input field, for example, and magnetic drives are protected against such accidents. However, it is a problem for long, large calculations. That's what ECC memory is invented for. You can also find more info about this phenomenon here:
http://en.wikipedia.org/wiki/ECC_memory
"The actual error rate found was several orders of magnitude higher than previous small-scale or laboratory studies, with 25,000 to 70,000 errors per billion device hours per megabit (about 2.5–7 × 10−11 error/bit·h)(i.e. about 5 single bit errors in 8 Gigabytes of RAM per hour using the top-end error rate), and more than 8% of DIMM memory modules affected by errors per year."
How come I have never heard of a computer breaking in a very fundamental way?
Hardware is fantastically complicated and there are a huge number of engineers whose job it is to make sure that the hardware works as intended. Whenever Intel, AMD, etc. release chips, they've extensively tested the design and run all sorts of diagnostics before it leaves the plant. They have an economic incentive to do this: if there's a mistake somewhere, it can be extremely costly. Look at the Intel FDIV bug for an example.
How come when I declare x to be a double it stays as a double? How come there is never a short circuit that robs it of some bytes and makes it an integer?
Part of this has to do with how the assembly works. Typically, compiled application binaries don't have any type information in them. Instead, they just issue commands like "take the four bytes at position 0x243598F0 and load them into a register." For a variable's type to mutate somehow, a huge amount of application code would have to change. If there was an error that underallocated the space for the variable, it would mess up the stack layout and probably cause a pretty quick program crash, so chances are the result would be "it crashed" rather than "the type got mutated," especially since at a binary level the operations on doubles and integral types are so different.
Why do we have faith that when we initialize x to 10, there will never be a power surge that will cause it to become 11, or something similar?
There might be! It's extremely rare, though, because the hardware people do such a good job designing everything. One of the nifty things about being a software engineer is that you sit on top of the food chain:
Software engineers write software that runs in an operating system,
which was written by systems programmers and talks to the hardware,
which was designed by electrical engineers and is built out of hardware gates,
which were fabricated and designed by materials engineers,
who got their materials due to the efforts of mining engineers,
etc.
Lots of engineers make a good living at each link in the chain, which is why everything is so well-tested. Errors do occur, and they do take down real computer systems, but it's relatively rare unless you have thousands or millions of computers running.
Hope this helps!
Answer 1, most of us rarely or never work on a large enough system or one where this type of consideration is needed. In large databases or in file systems, they have error detection to notice what you describe present. In physically or data large systems, errors when writing or storing data happen (ex,. packets get lost or corrupted mid-journey, gamma rays hit). Sectors go bad in your hard drive all the time. We have hashing, parity checking, and a host of other methods to notify us when funny issues happen.
Answer 2: Our axioms & models. The models we tend to use, the Imperative or Functional model, doesn't have the 'gamma ray from the sun changed a bit' present as a consideration. Just like how an environmental scientist may abstract away quarks when studying environmental changes, we abstract away hardware.
Edit #X:
This is a great question. I actually heard this recently, by accident. In Physics, their models are wrong. Dead wrong. And they know they are wrong but they use them anyway. When I said this to justify my distain for the subject, the CS technician at my school verbally back-handed me. He basically said what you said, how do we know 'int x=10' isn't 11 randomly later?

In embedded design, what is the actual overhead of using a linux os vs programming directly against the cpu? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I understand that the answer to this question, like most, is "it depends", but what I am looking for is not so much an answer as much as a rationale for the different things affecting the decision.
My use case is that I have an ARM Cortex A8 (TI AM335x) running an embedded device. My options are to use some embedded linux to take advantage of some prebuilt drivers and other things to make development faster, but my biggest concern for this project is the speed of the device. Memory and disk space are not much of a concern. I think it is a safe assumption that programming directly against the mpu and not using a full OS would certainly make the application faster, but gaining a 1 or 2 percent speedup is not worth the extra development time.
I imagine that the largest slowdowns are going to come from the kernel context switching and memory mapping but I do not have the knowledge to correctly assess or gauge the extent of those slowdowns. Any guidance would be greatly appreciated!
Your concerns are reasonable. Going bare metal can/will improve performance but it may only be a few percent improvement..."it depends".
Going bare metal for something that has fully functional drivers in linux but no fully functional drivers bare metal, will cost you development and possibly maintenance time, is it worth that to get the performance gain?
You have to ask yourself as well am I using the right platform, and/or am I using the right approach for whatever it is you want to do on that processor that you think or know is too slow. Are you sure you know where the bottleneck is? Are you sure your optimization is in the right place?
You have not provided any info that would give us a gut feel, so you have to go on your gut feel as to what path to take. A different embedded platform (pros and cons), bare metal or operating system. Linux or rtos or other. One programming language vs another, one peripheral vs another, and so on and so on. You wont actually know until you try each of these paths, but that can be and likely is cost and time prohibitive...
As far as the generic title question of os vs bare metal, the answer is "it depends". The differences can swing widely, from almost the same to hundreds to thousands of times faster on bare metal. But for any particular application/task/algorithm...it depends.

Can someone suggest a high performance shared memory API that supports complex data? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I'm looking at porting an old driver that generates a large, complex set of tables of data into user space - because the tables have become large enough that memory consumption is serious a problem.
Since performance is critical and because there will be 16-32 simultaneous readers of the data, we thought we'd replace the old /dev based interface to the code with a shared-memory model that would allow clients to directly search the tables rather than querying the daemon directly.
The question is - what's the best way to do that? I could use shm_open() directly, but that would probably require me to devise my own record locking and even, possibly, an ISAM data structure for the shared memory.
Rather than writing my own code to re-visit the 1970s, is there a high-performance shared memory API that provides a hash based lookup mechanism? The data is completely numeric, the search keys are fixed-length bit fields that may be 8, 16, or 32 bytes long.
This is something i've wanted to write for some time, but there's always some more pressing thing to do...
still, for most of the usecases of a shared key-data RAM store, memcached would be the simplest answer.
In your case, it looks like it's lower-level, so it memcached, fast as it is, might not be the best answer. I'd try Judy Arrays on a shmem block. They're really fast, so even if you wrap the access with a simplistic lock, you'd still get high performance access.
For more complex tasks, I'd search about lock-free structures (some links: 1, 2, 3,4). I even wrote one some time ago, with hopes of integrating it in a Lua kernel, but it proved really hard to keep with the existing implementation. Still, it might interest you.

Resources