In this problem you are to compare reading a file using a single-threaded file server with a multi-threaded file server. It takes
16 msec to get a request for work, dispatch it, and do the rest of the necessary processing, assuming the data are in the block
cache. If a disk operation is needed (assume a spinning disk drive with 1 head), as is the case one-fourth of the time, an
additional 32 msec is required. What is the throughput (requests/sec) if a multi-threaded server is required with 4-cores and
4-threads, rounded to the nearest whole number?
You can use this CourseHero preview to see a similar problem that should sufficiently help you to complete this one.
Related
We're getting overnight lockups on our embedded (Arm) linux product but are having trouble pinning it down. It usually takes 12-16 hours from power on for the problem to manifest itself. I've installed sysstat so I can run sar logging, and I've got a bunch of data, but I'm having trouble interpreting the results.
The targets only have 512Mb RAM (we have other models which have 1Gb, but they see this issue much less often), and have no disk swap files to avoid wearing the eMMCs.
Some kind of paging / virtual memory event is initiating the problem. In the sar logs, pgpin/s, pgnscand/s and pgsteal/s, and majflt/s all increase steadily before snowballing to crazy levels. This puts the CPU up correspondingly high levels (30-60 on dual core Arm chips). At the same time, the frmpg/s values go very negative, whilst campg/s go highly positive. The upshot is that the system is trying to allocate a large amount of cache pages all at once. I don't understand why this would be.
The target then essentially locks up until it's rebooted or someone kills the main GUI process or it crashes and is restarted (We have a monolithic GUI application that runs all the time and generally does all the serious work on the product). The network shuts down, telnet blocks forever, as do /proc filesystem queries and things that rely on it like top. The memory allocation profile of the main application in this test is dominated by reading data in from file and caching it as textures in video memory (shared with main RAM) in an LRU using OpenGL ES 2.0. Most of the time it'll be accessing a single file (they are about 50Mb in size), but I guess it could be triggered by having to suddenly use a new file and trying to cache all 50Mb of it all in one go. I haven't done the test (putting more logging in) to correlate this event with these system effects yet.
The odd thing is that the actual free and cached RAM levels don't show an obvious lack of memory (I have seen oom-killer swoop in the kill the main application with >100Mb free and 40Mb cache RAM). The main application's memory usage seems reasonably well-behaved with a VmRSS value that seems pretty stable. Valgrind hasn't found any progressive leaks that would happen during operation.
The behaviour seems like that of a system frantically swapping out to disk and making everything run dog slow as a result, but I don't know if this is a known effect in a free<->cache RAM exchange system.
My problem is superficially similar to question: linux high kernel cpu usage on memory initialization but that issue seemed driven by disk swap file management. However, dirty page flushing does seem plausible for my issue.
I haven't tried playing with the various vm files under /proc/sys/vm yet. vfs_cache_pressure and possibly swappiness would seem good candidates for some tuning, but I'd like some insight into good values to try here. vfs_cache_pressure seems ill-defined as to what the difference between setting it to 200 as opposed to 10000 would be quantitatively.
The other interesting fact is that it is a progressive problem. It might take 12 hours for the effect to happen the first time. If the main app is killed and restarted, it seems to happen every 3 hours after that fact. A full cache purge might push this back out, though.
Here's a link to the log data with two files, sar1.log, which is the complete output of sar -A, and overview.log, a extract of free / cache mem, CPU load, MainGuiApp memory stats, and the -B and -R sar outputs for the interesting period between midnight and 3:40am:
https://drive.google.com/folderview?id=0B615EGF3fosPZ2kwUDlURk1XNFE&usp=sharing
So, to sum up, what's my best plan here? Tune vm to tend to recycle pages more often to make it less bursty? Are my assumptions about what's happening even valid given the log data? Is there a cleverer way of dealing with this memory usage model?
Thanks for your help.
Update 5th June 2013:
I've tried the brute force approach and put a script on which echoes 3 to drop_caches every hour. This seems to be maintaining the steady state of the system right now, and the sar -B stats stay on the flat portion, with very few major faults and 0.0 pgscand/s. However, I don't understand why keeping the cache RAM very low mitigates a problem where the kernel is trying to add the universe to cache RAM.
I'm recording audio and writing the same to a SD Card, the data rate is around 1.5 MB/s. I'm using a class 4 SD Card with ext4 file system.
After certain interval, kernel auto syncs the files. The downside of this is, my application buffers pile up waiting to be written to disk.
I think, if the kernel syncs frequently that what it is doing now, it may solve the issue.
I used fsync() in application to sync after certain intervals. But this does not solve the problem, because certain times kernel has synced just before the application called fsync(), so the fsync() called from application was a waste of time.
I need a syncing mechanism (say, smart_fsync() ), so that when application calls smart_fsync(), then the kernel will sync only if it has not synced in a while, else it will just return.
Since there is no function as smart_fsync(). what can be a possible workaround?
The first question to ask is, what exactly is the problem you're experiencing? The kernel will flush dirty (unwritten cached) buffers periodically - this is because doing so tends to be faster than flushing synchronously (less latency hit for applications). The downside is that this means a larger latency hit if you reach the kernel's limit on dirty data (and potentially more data loss after an unclean shutdown).
If you want to ensure that the data hits disk ASAP, then you should simply open the file with the O_SYNC option. This will flush the data to disk immediately upon write(). Of course, this implies a significant performance penalty, but on the other hand you have complete control over when the data is flushed.
If you are experiencing drops in throughput while the syncing is going on, most likely you are attempting to write faster than the disk can support, and reaching the dirty page memory limit. Unfortunately, this would mean the hardware is simply not up to the write rate you are attempting to push at it - you'll need to write slower, or buffer the data up on faster media (or add more RAM!).
Note also that your 'smart fsync' is exactly what the kernel implements - it will flush pages when one of the following is true:
* There is too much dirty data in memory. Triggers asynchronously (without blocking writes) when the total amount of dirty data exceeds /proc/sys/vm/dirty_background_bytes, or when the percentage of total memory exceeds /proc/sys/vm/dirty_background_ratio. Triggers synchronously (blocking your application's write() for an extended time) when the total amount of data exceeds /proc/sys/vm/dirty_bytes, or the percentage of total memory exceeds /proc/sys/vm/dirty_ratio.
* Dirty data has been pending in memory for too long. The pdflush daemon checks for old dirty blocks every /proc/sys/vm/dirty_writeback_centisecs centiseconds (1/100 seconds), and will expire blocks if they have been in memory for longer than /proc/sys/vm/dirty_expire_centisecs.
It's possible that tuning these parameters might help a bit, but you're probably better off figuring out why the defaults aren't keeping up as is.
I'm seeing a huge (~200++) faults/sec number in my mongostat output, though very low lock %:
My Mongo servers are running on m1.large instances on the amazon cloud, so they each have 7.5GB of RAM ::
root:~# free -tm
total used free shared buffers cached
Mem: 7700 7654 45 0 0 6848
Clearly, I do not have enough memory for all the cahing mongo wants to do (which, btw, results in huge CPU usage %, due to disk IO).
I found this document that suggests that in my scenario (high fault, low lock %), I need to "scale out reads" and "more disk IOPS."
I'm looking for advice on how to best achieve this. Namely, there are LOTS of different potential queries executed by my node.js application, and I'm not sure where the bottleneck is happening. Of course, I've tried
db.setProfilingLevel(1);
However, this doesn't help me that much, because the outputted stats just show me slow queries, but I'm having a hard time translating that information into which queries are causing the page faults...
As you can see, this is resulting in a HUGE (nearly 100%) CPU wait time on my PRIMARY mongo server, though the 2x SECONDARY servers are unaffected...
Here's what the Mongo docs have to say about page faults:
Page faults represent the number of times that MongoDB requires data not located in physical memory, and must read from virtual memory. To check for page faults, see the extra_info.page_faults value in the serverStatus command. This data is only available on Linux systems.
Alone, page faults are minor and complete quickly; however, in aggregate, large numbers of page fault typically indicate that MongoDB is reading too much data from disk and can indicate a number of underlying causes and recommendations. In many situations, MongoDB’s read locks will “yield” after a page fault to allow other processes to read and avoid blocking while waiting for the next page to read into memory. This approach improves concurrency, and in high volume systems this also improves overall throughput.
If possible, increasing the amount of RAM accessible to MongoDB may help reduce the number of page faults. If this is not possible, you may want to consider deploying a shard cluster and/or adding one or more shards to your deployment to distribute load among mongod instances.
So, I tried the recommended command, which is terribly unhelpful:
PRIMARY> db.serverStatus().extra_info
{
"note" : "fields vary by platform",
"heap_usage_bytes" : 36265008,
"page_faults" : 4536924
}
Of course, I could increase the server size (more RAM), but that is expensive and seems to be overkill. I should implement sharding, but I'm actually unsure what collections need sharding! Thus, I need a way to isolate where the faults are happening (what specific commands are causing faults).
Thanks for the help.
We don't really know what your data/indexes look like.
Still, an important rule of MongoDB optimization:
Make sure your indexes fit in RAM. http://www.mongodb.org/display/DOCS/Indexing+Advice+and+FAQ#IndexingAdviceandFAQ-MakesureyourindexescanfitinRAM.
Consider that the smaller your documents are, the higher your key/document ratio will be, and the higher your RAM/Disksize ratio will need to be.
If you can adjust your schema a bit to lump some data together, and reduce the number of keys you need, that might help.
My application is using O_DIRECT for flushing 2MB worth of data directly to a 3-way-stripe storage (mounted as an lvm volume)..
I am getting a very pathetic write speed on this storage. The iostat shows that the large request size is being broken into smaller ones.
avgrq-sz is <20... There aren't much read on that drive.
It takes around 2 seconds to flush down 2MB worth of contiguous memory blocks (using mlock to assure that), sector aligned (using posix_memalign), whereas tests with dd and iozone rate the storage capable of > 20Mbps of write speed.
I would appreciate any clues on how to investigate this issue further.
PS: If this is not the right forum for this query, I would appreciate indicators to a one that could be helpful.
Thanks.
Write IO breakups on linux?
The disk itself may have a maximum request size, there is a tradeoff being block size and latency (the bigger the request being sent to the disk the longer it will likely take to to be consumed) and there can be constraints on how much vectored I/O a driver can consume in a single request. Given all the above, the kernel is going to "break up" single requests that are too large when submitting further down the stack.
I would appreciate any clues on how to investigate this issue further.
Unfortunately it's hard to say why the avgrq-sz is so small (if its in sectors that about 10KBytes per I/O) without seeing the code actually submitting the I/O (maybe your program is submitting 10KByte buffers?). We also don't know if iozone and dd were using O_DIRECT during the questioners test. If they weren't then their I/O would have been going into the write back cache and then streamed out later and the kernel can do that in a more optimal fashion.
Note: Using O_DIRECT is NOT a go faster stripe. In the right circumstances O_DIRECT can lower overhead BUT writing O_DIRECTly to do disk increases the pressure on you to submit I/O in parallel (e.g. via AIO/io_uring or via multiple processes/threads) if you want to reach the highest possible throughput because you have robbed the kernel of its best way of creating parallel submission to the device for you.
I'm writing lots and lots of data that will not be read again for weeks - as my program runs the amount of free memory on the machine (displayed with 'free' or 'top') drops very quickly, the amount of memory my app uses does not increase - neither does the amount of memory used by other processes.
This leads me to believe the memory is being consumed by the filesystems cache - since I do not intend to read this data for a long time I'm hoping to bypass the systems buffers, such that my data is written directly to disk. I dont have dreams of improving perf or being a super ninja, my hope is to give a hint to the filesystem that I'm not going to be coming back for this memory any time soon, so dont spend time optimizing for those cases.
On Windows I've faced similar problems and fixed the problem using FILE_FLAG_NO_BUFFERING|FILE_FLAG_WRITE_THROUGH - the machines memory was not consumed by my app and the machine was more usable in general. I'm hoping to duplicate the improvements I've seen but on Linux. On Windows there is the restriction of writing in sector sized pieces, I'm happy with this restriction for the amount of gain I've measured.
is there a similar way to do this in Linux?
The closest equivalent to the Windows flags you mention I can think of is to open your file with the open(2) flags O_DIRECT | O_SYNC:
O_DIRECT (Since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this file. In
general this will degrade performance, but it is useful in special
situations, such as when applications do their own caching. File I/O
is done directly to/from user space buffers. The O_DIRECT flag on its
own makes at an effort to transfer data synchronously, but does not
give the guarantees of the O_SYNC that data and necessary metadata are
transferred. To guarantee synchronous I/O the O_SYNC must be used in
addition to O_DIRECT. See NOTES below for further discussion.
A semantically similar (but deprecated) interface for block devices is
described in raw(8).
Granted, trying to do research on this flag to confirm it's what you want I found this interesting piece telling you that unbuffered I/O is a bad idea, Linus describing it as "brain damaged". According to that you should be using madvise() instead to tell the kernel how to cache pages. YMMV.
You can use O_DIRECT, but in that case you need to do the block IO yourself; you must write in multiples of the FS block size and on block boundaries (it is possible that it is not mandatory but if you do not its performance will suck x1000 because every unaligned write will need a read first).
Another much less impacting way of stopping your blocks using up the OS cache without using O_DIRECT, is to use posix_fadvise(fd, offset,len, POSIX_FADV_DONTNEED). Under Linux 2.6 kernels which support it, this immediately discards (clean) blocks from the cache. Of course you need to use fdatasync() or such like first, otherwise the blocks may still be dirty and hence won't be cleared from the cache.
It is probably a bad idea of fdatasync() and posix_fadvise( ... POSIX_FADV_DONTNEED) after every write, but instead wait until you've done a reasonable amount (50M, 100M maybe).
So in short
after every (significant chunk) of writes,
Call fdatasync followed by posix_fadvise( ... POSIX_FADV_DONTNEED)
This will flush the data to disc and immediately remove them from the OS cache, leaving space for more important things.
Some users have found that things like fast-growing log files can easily blow "more useful" stuff out of the disc cache, which reduces cache hits a lot on a box which needs to have a lot of read cache, but also writes logs quickly. This is the main motivation for this feature.
However, like any optimisation
a) You're not going to need it so
b) Do not do it (yet)
as my program runs the amount of free memory on the machine drops very quickly
Why is this a problem? Free memory is memory that isn't serving any useful purpose. When it's used to cache data, at least there is a chance it will be useful.
If one of your programs requests more memory, file caches will be the first thing to go. Linux knows that it can re-read that data from disk whenever it wants, so it will just reap the memory and give it a new use.
It's true that Linux by default waits around 30 seconds (this is what the value used to be anyhow) before flushing writes to disk. You can speed this up with a call to fsync(). But once the data has been written to disk, there's practically zero cost to keeping a cache of the data in memory.
Seeing as you write to the file and don't read from it, Linux will probably guess that this data is the best to throw out, in preference to other cached data. So don't waste effort trying to optimise unless you've confirmed that it's a performance problem.