zero read requests but high average IO queue size - linux

I have a 4GB file that are just rows of sorted key,value pairs that I have "primed" into memory (I have 32GB of memory) by
cat file_name > /dev/null
Then from a program I access the rows by the key. When I spawn 1000 threads to simultaneously read from the file and monitor the IO using iostat -xkd I see 0 r/s(read requests) but 45 avgqu-sz (average queue size). The 0 r/s makes sense to me since the pages from the file are most likely in the page cache due to the command above. Are the IO requests still queued even if the pages are in the page cache? If only the actual disk reads are queued, how do I explain this phenomenon?

Related

How can iostat utilization be 100% with a queue size of < 1

Consider understanding the performance of a C program which writes blocks between 12 and 16 KiB to a few files sequentially within each file.
It produces the following iostat -x line1 from a one second interval for a disk under a fairly heavy load of ~16 KiB writes:
Device w/s wkB/s wrqm/s %wrqm w_await wareq-sz aqu-sz %util
nvme1n1 7497.00 105250.50 0.00 0.00 0.03 14.04 0.23 100.00
It shows 100% utilization and aqu-sz of 0.23. This is typical reading for this workload: the queue size generally hovers around 0.20 to 0.25 and the utilizatio is usually 100%, but for an occasional measurement interval somewhat less like 90% or 95%.
As I understand it, utilization is "the fraction of time at least 1 request was outstanding to the disk"2. So 100% means that at least 1 request was outstanding at all times. The aqu-sz or average queue size, as I understand it, is the time-averaged size of the requests outstanding queue, i.e., the average number of in-flight requests submitted to the block layer which haven't yet completed.
Something is wrong though: this reading is not consistent with these definitions. If there were at least 1 request outstanding at all times (implied by utilization 100%) then the average queue size must be at least 1.0, since it is never less than 1.0 at any moment.
So what's up with that?
1 I've cut out the read and discard stats since they are both ~0 for this run.
2 Unlike single-head disks, for HDDs with multiple independent heads and for modern SSDs which can both handle several outstanding requests, a utilization of 100% does not correspond to "saturated device".

Write high bandwidth real-time data to SSD in Linux

I have a real-time process that receives 16 kB of data every 200 us for about 1 hr. I need to store this data.
I have a 240 GB SSD on a SATA III channel and I thought I could use it as a plain storage device without any filesystem on it. I am running 5.4.0-109-generic kernel with 8 GB or ram.
Here is what I have done so far:
I set up a shared memory shm, dimension of 1 GB, where I write the data and I use a semaphore to tell a logger process when data is available.
In the logger process:
I open the SSD:
fd = open("/dev/sdb",O_WRONLY|O_LARGEFILE);
I wait for the data to be available in the shm and then I write a chunk, writing_size, of it to the SSD:
written_size = write(fd,local_buffer,writing_size);
I checked and written_size is always equal to writing_size;
performs an fsync() after N cycles:
if(written_cycles > N)
ret = fsync(fd);
I checked and fsync never returns -1.
I did set the I/O scheduler of /dev/sdb as noop and I did experiment with different writing_size and N. The final values I came up with are writing_size = 64 kB and N = 16.
The behavior that I am seeing is this:
the whole process works very well up until 17 GB have been written. At that point the logger process is being put to "uninterruptible sleep (D)" quite often and for quite some time, 1 or 2 seconds. This is still fine as the shared memory buffer will fill up in ~13 sec. When the data written reaches 20 GB, the logger process is being put to sleep for way longer, until it reaches 13 sec and I start to lose data. The threshold when the process is starting to being put to sleep is quite repeatable, 16 - 17 GB, but the maximum amount of data I can save before I lose it is random.
This is the best I can achieve so far with my method and the writing_size and N tuning mentioned previously.
I tried to set the logger process nice to -20 with no improvements.
It also looks like the noop I/O scheduler does not support the ionice so I tried the CFQ scheduler with maximum ionice but still got worse performances.
I bet the logger process is being put to sleep for I/O access but I do not understand why it happens after a certain number of bytes have been written. iotop shows that the I/O bandwidth of the logger process is stable around 85 MB/s.
I welcome any suggestions.
PS: I did try to mmap the the SSD and do memcpy instead of write()+fsync() but mmap is slower and results are worse.

Profiling for Node.js Application

I followed this link https://nodejs.org/uk/docs/guides/simple-profiling/, to profile a particular endPoint in my app (Expressjs).
The endpoint will download a pdf from s3 , use a thread pool ( a pool of 4 worker_threads) to fill the pdf with data (I use HummusJS for pdf filling), then upload the filled file to s3 and respond with a signedUrl for the filled file.
The test was done by apache benchmark:
ab -p req.json -T application/json -c 20 -n 2000 http://{endpoint}
The output from profilling was like this :
[Bottom up (heavy) profile]:
Note: percentage shows a share of a particular caller in the total
amount of its parent calls.
Callers occupying less than 1.0% are not shown.
ticks parent name
287597 89.2% epoll_pwait
[Bottom up (heavy) profile]:
Note: percentage shows a share of a particular caller in the total
amount of its parent calls.
Callers occupying less than 1.0% are not shown.
ticks parent name
1515166 98.5% epoll_wait
So, my question is , what does the epoll_wait and epoll_pwait mean, since they are taking almost 100% of CPU time taken by the program ?
See google.com/search?q=epoll_wait.
In short, the thread was waiting for something (maybe the network? maybe another thread?).

PostgreSQL out of memory: Linux OOM killer

I am having issues with a large query, that I expect to rely on wrong configs of my postgresql.config. My setup is PostgreSQL 9.6 on Ubuntu 17.10 with 32GB RAM and 3TB HDD. The query is running pgr_dijkstraCost to create an OD-Matrix of ~10.000 points in a network of 25.000 links. Resulting table is thus expected to be very big ( ~100'000'000 rows with columns from, to, costs). However, creating simple test as select x,1 as c2,2 as c3
from generate_series(1,90000000) succeeds.
The query plan:
QUERY PLAN
--------------------------------------------------------------------------------------
Function Scan on pgr_dijkstracost (cost=393.90..403.90 rows=1000 width=24)
InitPlan 1 (returns $0)
-> Aggregate (cost=196.82..196.83 rows=1 width=32)
-> Seq Scan on building_nodes b (cost=0.00..166.85 rows=11985 width=4)
InitPlan 2 (returns $1)
-> Aggregate (cost=196.82..196.83 rows=1 width=32)
-> Seq Scan on building_nodes b_1 (cost=0.00..166.85 rows=11985 width=4)
This leads to a crash of PostgreSQL:
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
normally and possibly corrupted shared memory.
Running dmesg I could trace it down to be an Out of memory issue:
Out of memory: Kill process 5630 (postgres) score 949 or sacrifice child
[ 5322.821084] Killed process 5630 (postgres) total-vm:36365660kB,anon-rss:32344260kB, file-rss:0kB, shmem-rss:0kB
[ 5323.615761] oom_reaper: reaped process 5630 (postgres), now anon-rss:0kB,file-rss:0kB, shmem-rss:0kB
[11741.155949] postgres invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null), order=0, oom_score_adj=0
[11741.155953] postgres cpuset=/ mems_allowed=0
When running the query I also can observe with topthat my RAM is going down to 0 before the crash. The amount of committed memory just before the crash:
$grep Commit /proc/meminfo
CommitLimit: 18574304 kB
Committed_AS: 42114856 kB
I would expect the HDD is used to write/buffer temporary data, when RAM is not enough. But the available space on my hdd does not change during the processing. So I began to dig for missing configs (expecting issues due to my relocated data-directory) and following different sites:
https://www.postgresql.org/docs/current/static/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT
https://www.credativ.com/credativ-blog/2010/03/postgresql-and-linux-memory-management
My original settings of postgresql.conf are default except for changes in the data-directory:
data_directory = '/hdd_data/postgresql/9.6/main'
shared_buffers = 128MB # min 128kB
#huge_pages = try # on, off, or try
#temp_buffers = 8MB # min 800kB
#max_prepared_transactions = 0 # zero disables the feature
#work_mem = 4MB # min 64kB
#maintenance_work_mem = 64MB # min 1MB
#replacement_sort_tuples = 150000 # limits use of replacement selection sort
#autovacuum_work_mem = -1 # min 1MB, or -1 to use maintenance_work_mem
#max_stack_depth = 2MB # min 100kB
dynamic_shared_memory_type = posix # the default is the first option
I changed the config:
shared_buffers = 128MB
work_mem = 40MB # min 64kB
maintenance_work_mem = 64MB
Relaunched with sudo service postgresql reload and tested the same query, but found no change in behavior. Does this simply mean, such a large query can not be done? Any help appreciated.
I'm having similar trouble, but not with PostgreSQL (which is running happily): what is happening is simply that the kernel cannot allocate more RAM to the process, whichever process it is.
It would certainly help to add some swap to your configuration.
To check how much RAM and swap you have, run: free -h
On my machine, here is what it returns:
total used free shared buff/cache available
Mem: 7.7Gi 5.3Gi 928Mi 865Mi 1.5Gi 1.3Gi
Swap: 9.4Gi 7.1Gi 2.2Gi
You can clearly see that my machine is quite overloaded: about 8Gb of RAM, and 9Gb of swap, from which 7 are used.
When the RAM-hungry process got killed after Out of memory, I saw both RAM and swap being used at 100%.
So, allocating more swap may improve our problems.

Does node's readFile() use a single Buffer with all the file size allocated?

I'm quite new to Node and filesystem streams concerns. I wanted to now if the readFile function maybe reads the file stats, get the size and create a single Buffer with all the file size allocated. Or in other words: I know it loads the entire file, ok. But does it do it by internally splitting the file in more buffers or does it use only a single big Buffer? Depending on the method used, it has different memory usage/leaks implications.
Found the answer here on chapter 9.3:
http://book.mixu.net/node/ch9.html
As expected, readFile uses 1 full Buffer. From the link above, this is the execution of readFile:
// Fully buffered access
[100 Mb file]
-> 1. [allocate 100 Mb buffer]
-> 2. [read and return 100 Mb buffer]
So if you use readFile() your app wiil need exactly the memory for all the file size at once.
To break that memory into chunks, use read() or createReadStream()

Resources