Write high bandwidth real-time data to SSD in Linux

Write high bandwidth real-time data to SSD in Linux - linux

I have a real-time process that receives 16 kB of data every 200 us for about 1 hr. I need to store this data.
I have a 240 GB SSD on a SATA III channel and I thought I could use it as a plain storage device without any filesystem on it. I am running 5.4.0-109-generic kernel with 8 GB or ram.
Here is what I have done so far:
I set up a shared memory shm, dimension of 1 GB, where I write the data and I use a semaphore to tell a logger process when data is available.
In the logger process:
I open the SSD:
fd = open("/dev/sdb",O_WRONLY|O_LARGEFILE);
I wait for the data to be available in the shm and then I write a chunk, writing_size, of it to the SSD:
written_size = write(fd,local_buffer,writing_size);
I checked and written_size is always equal to writing_size;
performs an fsync() after N cycles:
if(written_cycles > N)
ret = fsync(fd);
I checked and fsync never returns -1.
I did set the I/O scheduler of /dev/sdb as noop and I did experiment with different writing_size and N. The final values I came up with are writing_size = 64 kB and N = 16.
The behavior that I am seeing is this:
the whole process works very well up until 17 GB have been written. At that point the logger process is being put to "uninterruptible sleep (D)" quite often and for quite some time, 1 or 2 seconds. This is still fine as the shared memory buffer will fill up in ~13 sec. When the data written reaches 20 GB, the logger process is being put to sleep for way longer, until it reaches 13 sec and I start to lose data. The threshold when the process is starting to being put to sleep is quite repeatable, 16 - 17 GB, but the maximum amount of data I can save before I lose it is random.
This is the best I can achieve so far with my method and the writing_size and N tuning mentioned previously.
I tried to set the logger process nice to -20 with no improvements.
It also looks like the noop I/O scheduler does not support the ionice so I tried the CFQ scheduler with maximum ionice but still got worse performances.
I bet the logger process is being put to sleep for I/O access but I do not understand why it happens after a certain number of bytes have been written. iotop shows that the I/O bandwidth of the logger process is stable around 85 MB/s.
I welcome any suggestions.
PS: I did try to mmap the the SSD and do memcpy instead of write()+fsync() but mmap is slower and results are worse.

Related

PostgreSQL out of memory: Linux OOM killer

I am having issues with a large query, that I expect to rely on wrong configs of my postgresql.config. My setup is PostgreSQL 9.6 on Ubuntu 17.10 with 32GB RAM and 3TB HDD. The query is running pgr_dijkstraCost to create an OD-Matrix of ~10.000 points in a network of 25.000 links. Resulting table is thus expected to be very big ( ~100'000'000 rows with columns from, to, costs). However, creating simple test as select x,1 as c2,2 as c3
from generate_series(1,90000000) succeeds.
The query plan:
QUERY PLAN
--------------------------------------------------------------------------------------
Function Scan on pgr_dijkstracost (cost=393.90..403.90 rows=1000 width=24)
InitPlan 1 (returns $0)
-> Aggregate (cost=196.82..196.83 rows=1 width=32)
-> Seq Scan on building_nodes b (cost=0.00..166.85 rows=11985 width=4)
InitPlan 2 (returns $1)
-> Aggregate (cost=196.82..196.83 rows=1 width=32)
-> Seq Scan on building_nodes b_1 (cost=0.00..166.85 rows=11985 width=4)
This leads to a crash of PostgreSQL:
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
normally and possibly corrupted shared memory.
Running dmesg I could trace it down to be an Out of memory issue:
Out of memory: Kill process 5630 (postgres) score 949 or sacrifice child
[ 5322.821084] Killed process 5630 (postgres) total-vm:36365660kB,anon-rss:32344260kB, file-rss:0kB, shmem-rss:0kB
[ 5323.615761] oom_reaper: reaped process 5630 (postgres), now anon-rss:0kB,file-rss:0kB, shmem-rss:0kB
[11741.155949] postgres invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null), order=0, oom_score_adj=0
[11741.155953] postgres cpuset=/ mems_allowed=0
When running the query I also can observe with topthat my RAM is going down to 0 before the crash. The amount of committed memory just before the crash:
$grep Commit /proc/meminfo
CommitLimit: 18574304 kB
Committed_AS: 42114856 kB
I would expect the HDD is used to write/buffer temporary data, when RAM is not enough. But the available space on my hdd does not change during the processing. So I began to dig for missing configs (expecting issues due to my relocated data-directory) and following different sites:
https://www.postgresql.org/docs/current/static/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT
https://www.credativ.com/credativ-blog/2010/03/postgresql-and-linux-memory-management
My original settings of postgresql.conf are default except for changes in the data-directory:
data_directory = '/hdd_data/postgresql/9.6/main'
shared_buffers = 128MB # min 128kB
#huge_pages = try # on, off, or try
#temp_buffers = 8MB # min 800kB
#max_prepared_transactions = 0 # zero disables the feature
#work_mem = 4MB # min 64kB
#maintenance_work_mem = 64MB # min 1MB
#replacement_sort_tuples = 150000 # limits use of replacement selection sort
#autovacuum_work_mem = -1 # min 1MB, or -1 to use maintenance_work_mem
#max_stack_depth = 2MB # min 100kB
dynamic_shared_memory_type = posix # the default is the first option
I changed the config:
shared_buffers = 128MB
work_mem = 40MB # min 64kB
maintenance_work_mem = 64MB
Relaunched with sudo service postgresql reload and tested the same query, but found no change in behavior. Does this simply mean, such a large query can not be done? Any help appreciated.

I'm having similar trouble, but not with PostgreSQL (which is running happily): what is happening is simply that the kernel cannot allocate more RAM to the process, whichever process it is.
It would certainly help to add some swap to your configuration.
To check how much RAM and swap you have, run: free -h
On my machine, here is what it returns:
total used free shared buff/cache available
Mem: 7.7Gi 5.3Gi 928Mi 865Mi 1.5Gi 1.3Gi
Swap: 9.4Gi 7.1Gi 2.2Gi
You can clearly see that my machine is quite overloaded: about 8Gb of RAM, and 9Gb of swap, from which 7 are used.
When the RAM-hungry process got killed after Out of memory, I saw both RAM and swap being used at 100%.
So, allocating more swap may improve our problems.

Stress-ng: RAM testing commands

Stress-ng: Can we test RAM using stress-ng? What are the commands used to test RAM on a MIPS 32 device?

There are many memory based stressors in stress-ng:
stress-ng --class memory?
class 'memory' stressors: atomic bsearch context full heapsort hsearch
lockbus lsearch malloc matrix membarrier memcpy memfd memrate memthrash
mergesort mincore null numa oom-pipe pipe qsort radixsort remap
resources rmap stack stackmmap str stream tlb-shootdown tmpfs tsearch
vm vm-rw wcs zero zlib
Alternatively, one can also use VM based stressors too:
stress-ng --class vm?
class 'vm' stressors: bigheap brk madvise malloc mlock mmap mmapfork mmapmany
mremap msync shm shm-sysv stack stackmmap tmpfs userfaultfd vm vm-rw
vm-splice
I suggest looking at the vm stressor first as this contains a large range of stressor methods that exercise memory patterns and can possibly find broken memory:
-m N, --vm N
start N workers continuously calling mmap(2)/munmap(2) and writ‐
ing to the allocated memory. Note that this can cause systems to
trip the kernel OOM killer on Linux systems if not enough physi‐
cal memory and swap is not available.
--vm-bytes N
mmap N bytes per vm worker, the default is 256MB. One can spec‐
ify the size as % of total available memory or in units of
Bytes, KBytes, MBytes and GBytes using the suffix b, k, m or g.
--vm-ops N
stop vm workers after N bogo operations.
--vm-hang N
sleep N seconds before unmapping memory, the default is zero
seconds. Specifying 0 will do an infinite wait.
--vm-keep
do not continually unmap and map memory, just keep on re-writing
to it.
--vm-locked
Lock the pages of the mapped region into memory using mmap
MAP_LOCKED (since Linux 2.5.37). This is similar to locking
memory as described in mlock(2).
--vm-madvise advice
Specify the madvise 'advice' option used on the memory mapped
regions used in the vm stressor. Non-linux systems will only
have the 'normal' madvise advice, linux systems support 'dont‐
need', 'hugepage', 'mergeable' , 'nohugepage', 'normal', 'ran‐
dom', 'sequential', 'unmergeable' and 'willneed' advice. If this
option is not used then the default is to pick random madvise
advice for each mmap call. See madvise(2) for more details.
--vm-method m
specify a vm stress method. By default, all the stress methods
are exercised sequentially, however one can specify just one
method to be used if required. Each of the vm workers have 3
phases:
1. Initialised. The anonymously memory mapped region is set to a
known pattern.
2. Exercised. Memory is modified in a known predictable way.
Some vm workers alter memory sequentially, some use small or
large strides to step along memory.
3. Checked. The modified memory is checked to see if it matches
the expected result.
The vm methods containing 'prime' in their name have a stride of
the largest prime less than 2^64, allowing to them to thoroughly
step through memory and touch all locations just once while also
doing without touching memory cells next to each other. This
strategy exercises the cache and page non-locality.
Since the memory being exercised is virtually mapped then there
is no guarantee of touching page addresses in any particular
physical order. These workers should not be used to test that
all the system's memory is working correctly either, use tools
such as memtest86 instead.
The vm stress methods are intended to exercise memory in ways to
possibly find memory issues and to try to force thermal errors.
Available vm stress methods are described as follows:
Method Description
all iterate over all the vm stress methods
as listed below.
flip sequentially work through memory 8
times, each time just one bit in memory
flipped (inverted). This will effec‐
tively invert each byte in 8 passes.
galpat-0 galloping pattern zeros. This sets all
bits to 0 and flips just 1 in 4096 bits
to 1. It then checks to see if the 1s
are pulled down to 0 by their neighbours
or of the neighbours have been pulled up
to 1.
galpat-1 galloping pattern ones. This sets all
bits to 1 and flips just 1 in 4096 bits
to 0. It then checks to see if the 0s
are pulled up to 1 by their neighbours
or of the neighbours have been pulled
down to 0.
gray fill the memory with sequential gray
codes (these only change 1 bit at a time
between adjacent bytes) and then check
if they are set correctly.
incdec work sequentially through memory twice,
the first pass increments each byte by a
specific value and the second pass
decrements each byte back to the origi‐
nal start value. The increment/decrement
value changes on each invocation of the
stressor.
inc-nybble initialise memory to a set value (that
changes on each invocation of the stres‐
sor) and then sequentially work through
each byte incrementing the bottom 4 bits
by 1 and the top 4 bits by 15.
rand-set sequentially work through memory in 64
bit chunks setting bytes in the chunk to
the same 8 bit random value. The random
value changes on each chunk. Check that
the values have not changed.
rand-sum sequentially set all memory to random
values and then summate the number of
bits that have changed from the original
set values.
read64 sequentially read memory using 32 x 64
bit reads per bogo loop. Each loop
equates to one bogo operation. This
exercises raw memory reads.
ror fill memory with a random pattern and
then sequentially rotate 64 bits of mem‐
ory right by one bit, then check the
final load/rotate/stored values.
swap fill memory in 64 byte chunks with ran‐
dom patterns. Then swap each 64 chunk
with a randomly chosen chunk. Finally,
reverse the swap to put the chunks back
to their original place and check if the
data is correct. This exercises adjacent
and random memory load/stores.
move-inv sequentially fill memory 64 bits of mem‐
ory at a time with random values, and
then check if the memory is set cor‐
rectly. Next, sequentially invert each
64 bit pattern and again check if the
memory is set as expected.
modulo-x fill memory over 23 iterations. Each
iteration starts one byte further along
from the start of the memory and steps
along in 23 byte strides. In each
stride, the first byte is set to a ran‐
dom pattern and all other bytes are set
to the inverse. Then it checks see if
the first byte contains the expected
random pattern. This exercises cache
store/reads as well as seeing if neigh‐
bouring cells influence each other.
prime-0 iterate 8 times by stepping through mem‐
ory in very large prime strides clearing
just on bit at a time in every byte.
Then check to see if all bits are set to
zero.
prime-1 iterate 8 times by stepping through mem‐
ory in very large prime strides setting
just on bit at a time in every byte.
Then check to see if all bits are set to
one.
prime-gray-0 first step through memory in very large
prime strides clearing just on bit
(based on a gray code) in every byte.
Next, repeat this but clear the other 7
bits. Then check to see if all bits are
set to zero.
prime-gray-1 first step through memory in very large
prime strides setting just on bit (based
on a gray code) in every byte. Next,
repeat this but set the other 7 bits.
Then check to see if all bits are set to
one.
rowhammer try to force memory corruption using the
rowhammer memory stressor. This fetches
two 32 bit integers from memory and
forces a cache flush on the two
addresses multiple times. This has been
known to force bit flipping on some
hardware, especially with lower fre‐
quency memory refresh cycles.
walk-0d for each byte in memory, walk through
each data line setting them to low (and
the others are set high) and check that
the written value is as expected. This
checks if any data lines are stuck.
walk-1d for each byte in memory, walk through
each data line setting them to high (and
the others are set low) and check that
the written value is as expected. This
checks if any data lines are stuck.
walk-0a in the given memory mapping, work
through a range of specially chosen
addresses working through address lines
to see if any address lines are stuck
low. This works best with physical mem‐
ory addressing, however, exercising
these virtual addresses has some value
too.
walk-1a in the given memory mapping, work
through a range of specially chosen
addresses working through address lines
to see if any address lines are stuck
high. This works best with physical mem‐
ory addressing, however, exercising
these virtual addresses has some value
too.
write64 sequentially write memory using 32 x 64
bit writes per bogo loop. Each loop
equates to one bogo operation. This
exercises raw memory writes. Note that
memory writes are not checked at the end
of each test iteration.
zero-one set all memory bits to zero and then
check if any bits are not zero. Next,
set all the memory bits to one and check
if any bits are not one.
--vm-populate
populate (prefault) page tables for the memory mappings; this
can stress swapping. Only available on systems that support
MAP_POPULATE (since Linux 2.5.46).
So to run 1 vm stressor that uses 75% of memory using all the vm stressors with verification for 10 minutes with verbose mode enabled, use:
stress-ng --vm 1 --vm-bytes 75% --vm-method all --verify -t 10m -v

How to increase kernel poll rate for accelerometer?

I'm using the hwmon/mxc_mma8451.c module to access an accelerometer. Using /sys/devices/virtual/input/input0/poll I can change the polling rate to some degree... if I set a larger millisecond value the polling becomes slower. However, I cannot seem to get below around 30ms per poll, despite the device driver source apparently allowing as low as 1ms per poll. The accelerometer itself supports 800Hz sample rate, so that is not the bottleneck. When I write a value of 1 to the above file, I see each sample occurs either 30ms or 60ms from the previous sample, so it is not even consistent. However, even 30ms is unacceptably slow at it is only 33Hz.
The kernel source for the module clearly shows that I should be able to use a value of 1:
#define POLL_INTERVAL_MIN 1
#define POLL_INTERVAL_MAX 500
#define POLL_INTERVAL 100 /* msecs */
...
mma8451_idev->poll_interval = POLL_INTERVAL;
mma8451_idev->poll_interval_min = POLL_INTERVAL_MIN;
mma8451_idev->poll_interval_max = POLL_INTERVAL_MAX;
I'm not familiar with exactly how Linux does this kind of polling, but this system has a 10ms tick, so even if sampling with ticks, why is it taking 3 or 6 ticks per sample and nothing else? Is there some kernel parameter somewhere else that is throttling how fast polling can occur?
Linux kernel version is 3.14.28 for IMX28 (ARM) if that makes any difference. This is the version available for the device in question, so I can't just up and use a different/newer one.

zero read requests but high average IO queue size

I have a 4GB file that are just rows of sorted key,value pairs that I have "primed" into memory (I have 32GB of memory) by
cat file_name > /dev/null
Then from a program I access the rows by the key. When I spawn 1000 threads to simultaneously read from the file and monitor the IO using iostat -xkd I see 0 r/s(read requests) but 45 avgqu-sz (average queue size). The 0 r/s makes sense to me since the pages from the file are most likely in the page cache due to the command above. Are the IO requests still queued even if the pages are in the page cache? If only the actual disk reads are queued, how do I explain this phenomenon?

How to find the processor queue length in linux

Trying to determine the Processor Queue Length (the number of processes that ready to run but currently aren't) on a linux machine. There is a WMI call in Windows for this metric, but not knowing much about linux I'm trying to mine /proc and 'top' for the information. Is there a way to determine the queue length for the cpu?
Edit to add: Microsoft's words concerning their metric: "The collection of one or more threads that is ready but not able to run on the processor due to another active thread that is currently running is called the processor queue."

sar -q will report queue length, task list length and three load averages.
Example:
matli#tornado:~$ sar -q 1 0
Linux 2.6.27-9-generic (tornado) 01/13/2009 _i686_
11:38:32 PM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15
11:38:33 PM 0 305 1.26 0.95 0.54
11:38:34 PM 4 305 1.26 0.95 0.54
11:38:35 PM 1 306 1.26 0.95 0.54
11:38:36 PM 1 306 1.26 0.95 0.54
^C

vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
2 0 256368 53764 75980 220564 2 28 60 54 774 1343 15 4 78 2
The first column (r) is the run queue - 2 on my machine right now
Edit: Surprised there isn't a way to just get the number
Quick 'n' dirty way to get the number (might vary a little on different machines):
vmstat|tail -1|cut -d" " -f2

The metrics you seek exist in /proc/schedstat.
The format of this file is described in sched-stats.txt in the kernel source. Specifically, the cpu<N> lines are what you want:
CPU statistics
--------------
cpu<N> 1 2 3 4 5 6 7 8 9
First field is a sched_yield() statistic:
1) # of times sched_yield() was called
Next three are schedule() statistics:
2) This field is a legacy array expiration count field used in the O(1)
scheduler. We kept it for ABI compatibility, but it is always set to zero.
3) # of times schedule() was called
4) # of times schedule() left the processor idle
Next two are try_to_wake_up() statistics:
5) # of times try_to_wake_up() was called
6) # of times try_to_wake_up() was called to wake up the local cpu
Next three are statistics describing scheduling latency:
7) sum of all time spent running by tasks on this processor (in jiffies)
8) sum of all time spent waiting to run by tasks on this processor (in
jiffies)
9) # of timeslices run on this cpu
In particular, field 8. To find the run queue length, you would:
Observe field 8 for each CPU and record the value.
Wait for some interval.
Observe field 8 for each CPU again, and calculate how much the value has increased.
Dividing that difference by the length of the time interval waited (the documentation says it's in jiffies, but it's actually in nanoseconds since the addition of CFS), by Little's Law, yields the mean length of the scheduler run queue over the interval.
Unfortunately, I'm not aware of any utility to automate this process which is usually installed or even packaged in a Linux distribution. I've not used it, but the kernel documentation suggests http://eaglet.rain.com/rick/linux/schedstat/v12/latency.c, which unfortunately refers to a domain that is no longer resolvable. Fortunately, it's available on the wayback machine.
Why not sar or vmstat?
These tools report the number of currently runnable processes. Certainly if this number is greater than the number of CPUs, some of them must be waiting. However, processes can still be waiting even when the number of processes is less than the number of CPUs, for a variety of reasons:
A process may be pinned to a particular CPU.
The scheduler may decide to schedule a process on a particular CPU to make better utilization of cache, or for NUMA optimization reasons.
The scheduler may intentionally idle a CPU to allow more time to a competing, higher priority process on another CPU that shares the same execution core (a hyperthreading optimization).
Hardware interrupts may be processable only on particular CPUs for a variety of hardware and software reasons.
Moreover, the number of runnable processes is only sampled at an instant in time. In many cases this number may fluctuate rapidly, and the contention may be occurring between the times the metric is being sampled.
These things mean the number of runnable processes minus the number of CPUs is not a reliable indicator of CPU contention.

uptime will give you the recent load average, which is approximately the average number of active processes. uptime reports the load average over the last 1, 5, and 15 minutes. It's a per-system measurement, not per-CPU.
Not sure what the processor queue length in Windows is, hopefully it's close enough to this?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string