How to measure random/sequential read/write on Linux. SSD considerations - linux

I wish to purchase an SSD drive for my Linux Fedora 16 work station.
It will be mainly used for web development over Gnome3, IDE and a virtual server (for the web env).
I have two candidates.
crucial m4 128gb ( have much better 4k random read )
SanDisk extreme 3 120GB ( much faster in sequential and writes )
I wonder what will benefit my system most and if there is a tool to measure the actual random/sequential read/write on my system.
Thanks

yout can try the commercial version tool named PassMark BurnInTest™. The neweset version is BurnInTest 7.1.
also you can try the Freeware AS SSD Benchmark 1.7.4739.38088
Both BurnInTest(also available for Linux) ) and AS SSD Benchmark is for Windows Platform.
There is some utility under linux: Several I/O benchmark options exist under Linux.
Using hddparm with the -Tt switch, one can time sequential reads. This method is independent of partition alignment!
There is a graphical benchmark called gnome-disks contained in the gnome-disk-utility package that will give min/max/ave reads along with
ave access time and a nice graphical display. This method is
independent of partition alignment!
The dd utility can be used to measure both reads and writes. This method is dependent on partition alignment! In other words, if you
failed to properly align your partitions, this fact will be seen here
since you're writing and reading to a mounted filesystem.
Bonnie++
And if your want do more control of the benchmark, I recommand you to use SMARTCTL

Actual system RAM is faster, & cheaper ;) Then go for the fastest system & program loading drive, IIRC that would be the 4k speed.

You can try my benchmark described in the following that also has link to download the program:
http://www.roylongbottom.org.uk/linux_disk_usb_lan_benchmarks.htm
Latest version includes random writing. Example results are below. A run time parameter can select larger files.
###############################################################
Selected File Path:
/media/PAT//
Total MB 7620, Free MB 7620, Used MB 0
Linux Storage Speed Test 64-Bit Version 1.2, Tue Dec 10 16:08:20 2013
Copyright (C) Roy Longbottom 2012
8 MB File 1 2 3 4 5
Writing MB/sec 5.18 11.97 12.12 12.04 11.96
Reading MB/sec 30.88 31.57 31.60 31.45 29.22
16 MB File 1 2 3 4 5
Writing MB/sec 11.90 11.95 12.26 12.07 12.14
Reading MB/sec 30.94 31.47 31.51 31.51 30.31
32 MB File 1 2 3 4 5
Writing MB/sec 11.97 12.07 12.11 12.15 12.18
Reading MB/sec 31.26 31.49 31.52 31.50 30.90
---------------------------------------------------------------------
8 MB Cached File 1 2 3 4 5
Writing MB/sec 396.50 77.49 11.19 11.01 13.36
Reading MB/sec 2810.03 2675.62 2972.66 3139.78 3079.33
---------------------------------------------------------------------
Bus Speed Block KB 64 128 256 512 1024
Reading MB/sec 28.34 28.31 30.25 31.00 31.41
---------------------------------------------------------------------
1 KB Blocks File MB > 2 4 8 16 32 64 128
Random Read msecs 0.52 0.51 0.50 0.50 0.50 0.50 0.52
Random Write msecs 5.47 5.56 3.31 8.48 3.35 3.35 194.27
---------------------------------------------------------------------
500 Files Write Read Delete
File KB MB/sec ms/File MB/sec ms/File Seconds
2 0.09 22.35 3.07 0.67 0.014
4 0.15 26.79 6.12 0.67 0.014
8 0.35 23.43 9.94 0.82 0.014
16 0.74 22.12 16.16 1.01 0.014
32 1.59 20.61 22.12 1.48 0.015
64 3.03 21.66 28.06 2.34 0.015

Related

Java performance issue On Oracle Linux

I'm running very "simple" Test with.
#Fork(value = 1, jvmArgs = { "--illegal-access=permit", "-Xms10G", "-XX:+UnlockDiagnosticVMOptions", "-XX:+DebugNonSafepoints", "-XX:ActiveProcessorCount=7",
"-XX:+UseNUMA"
, "-XX:+UnlockDiagnosticVMOptions", "-XX:DisableIntrinsic=_currentTimeMillis,_nanoTime",
"-Xmx10G", "-XX:+UnlockExperimentalVMOptions", "-XX:ConcGCThreads=5", "-XX:ParallelGCThreads=10", "-XX:+UseZGC", "-XX:+UsePerfData", "-XX:MaxMetaspaceSize=10G", "-XX:MetaspaceSize=256M"}
)
#Benchmark
public String generateRandom() {
return UUID.randomUUID().toString();
}
May be it's not very simple, because uses random, but same issue is on any other tests with java
On my home desktop
Intel(R) Core(TM) i7-8700 CPU # 3.20GHz 12 Threads (hyperthreading enabled ), 64 GB Ram, "Ubuntu" VERSION="20.04.2 LTS (Focal Fossa)"
Linux homepc 5.8.0-59-generic #66~20.04.1-Ubuntu SMP Thu Jun 17 11:14:10 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Performance with 7 threads:
Benchmark Mode Cnt Score Error Units
RulesBenchmark.generateRandom thrpt 5 1312295.357 ± 27853.707 ops/s
Flame Graph with AsyncProfiler Result with 7 Thread At Home
I have an issue on Oracle Linux
Linux 5.4.17-2102.201.3.el8uek.x86_64 #2 SMP Fri Apr 23 09:05:57 PDT 2021 x86_64 x86_64 x86_64 GNU/Linux
Intel(R) Xeon(R) Gold 6258R CPU # 2.70GHz with 56 Threads(hyperthreading disabled, the same when enabled and there is 112 cpu threads ) and 1 TB RAM I have half of performance (Even increasing threads) NAME="Oracle Linux Server" VERSION="8.4"
with 1 thread, I have very great performance:
Benchmark Mode Cnt Score Error Units
RulesBenchmark.generateRandom thrpt 5 2377471.113 ± 8049.532 ops/s
Flame Graph with AsyncProfiler Result 1 Thread
But with 7 thread
Benchmark Mode Cnt Score Error Units
RulesBenchmark.generateRandom thrpt 5 688612.296 ± 70895.058 ops/s
Flame Graph with AsyncProfiler Result 7 Thread
May be it's an issue of NUMA becase there is 2 Sockets, and system is configured with only 1 NUMA node
numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
node 0 size: 1030835 MB
node 0 free: 1011029 MB
node distances:
node 0
0: 10
But after disabling some cpu threads using:
for i in {12..55}
do
# your-unix-command-here
echo '0'| sudo tee /sys/devices/system/cpu/cpu$i/online
done
Performance little improved, not much.
This is just very "simple" test. On complex test with real code, it's even worth,
It spends a lot of time on .annobin___pthread_cond_signal.start
I also deployed vagrant image with the same version of Oracle Linux and kernel version on my home desktop and run it with 10 cpu threads, and performance was nearly as same (~1M op/sec) as on my descktop. So it's not about OS or kernel, but some configuration
Tested with several jDK versions and vendors (jdk 11 and above). It's very little performance when using OpenJDK 11 from YUM distribution, but not significant.
Can you sugest some advice
Thanks in advance
In essense, your benchmark tests the throughput of SecureRandom. The default implementation is synchronized (more precisely, the default implementation mixes the input form /dev/urandom and the above provider).
The paradox is, more threads result in more contention, and thus lower overall performance, as the main part of the algorithm is under a global lock anyway. Async-profiler indeed shows that the bottleneck is the synchronization on a Java monitor: __lll_unlock_wake, __pthread_cond_wait, __pthread_cond_signal - all come from that synchronization.
The contention overhead definitely depends on the hardware, the firmware, and the OS configuration. Instead of trying to reduce this overhead (which can be hard, as, you know, some day will arrive yet another security patch that will make syscalls 2x slower, for example), I'd suggest to get rid of the contention in the first place.
This can be achieved by installing a different, non-blocking SecureRandom provider like shown in this answer. I won't give a recommendation on a particular SecureRandomSpi, as it depends on your specific requirements (throughput/scalability/security). Will just mention that an implementation can be based on
rdrand
OpenSSL
raw /dev/urandom access, etc.

Optimal number of threads for GNU parallel

I think I have a fairly basic question. I just discovered the GNU parallel package and I think my workflow can really benefit from it!
I am using a loop which loops through my read files and generates the desired output. The command that is excecuted for each read looks something like this:
STAR --runThreadN 8 --genomeDir star_index/ --readFilesIn R1.fq R2.fq
As you can see I specified 8 threads, which is the amount of threads my virtual machine has.
My question now is this following:
If I use GNU parallel with a command like this:
cat reads| parallel -j 3 STAR --runThreadN 8 --genomeDir star_index/ --readFilesIn {}_R1.fq {}_R2.fq
Can my virtual machine handle the number of threads I specified, if I execute 3 jobs in parallel?
Or do I need 24 threads (3*8 threads) to properly excecute this command?
Im sorry if this is a basic question, I am very new to the field and any help is much appreciated!
The best advice is simply: Try different values and measure.
In parallelization there are sooo many factors that can affect the results: Disk I/O, shared CPU cache, and shared RAM bandwidth just to name three.
top is your friend when measuring. If you can manage to get all CPUs to have <5% idle you are unlikely to go any faster - no matter what you do.
top - 14:49:10 up 10 days, 5:48, 123 users, load average: 2.40, 1.72, 1.67
Tasks: 751 total, 3 running, 616 sleeping, 8 stopped, 4 zombie
%Cpu(s): 17.3 us, 6.2 sy, 0.0 ni, 76.2 id, 0.3 wa, 0.0 hi, 0.0 si, 0.0 st
GiB Mem : 31.239 total, 1.441 free, 21.717 used, 8.081 buff/cache
GiB Swap: 117.233 total, 104.146 free, 13.088 used. 4.706 avail Mem
This machine is 76.2% idle. If your processes use loads of CPU then starting more processes in parallel here may help. If they use loads of disk I/O it may or may not help. Only way to know is to test and measure.
top - 14:51:00 up 10 days, 5:50, 124 users, load average: 3.41, 2.04, 1.78
Tasks: 759 total, 8 running, 619 sleeping, 8 stopped, 4 zombie
%Cpu(s): 92.8 us, 6.9 sy, 0.0 ni, 0.1 id, 0.0 wa, 0.0 hi, 0.2 si, 0.0 st
GiB Mem : 31.239 total, 1.383 free, 21.772 used, 8.083 buff/cache
GiB Swap: 117.233 total, 104.146 free, 13.087 used. 4.649 avail Mem
This machine is 0.1% idle. Starting more processes is unlikely to make things go faster.
So increase the parallelization until idle time hits a minimum or until average processing time hits a minimum (--joblog my.log can be useful to see how long a job takes).
And yes: GNU Parallel is likely to speed-up bioinformatics (being written by a fellow bioinformatician).
Consider reading GNU Parallel 2018 (paper: http://www.lulu.com/shop/ole-tange/gnu-parallel-2018/paperback/product-23558902.html download: https://doi.org/10.5281/zenodo.1146014) Read at least chapter 1+2. It should take you less than 20 minutes. Your command line will love you for it.

NodeJS CPU spikes to 100% one CPU at a time

I have a SOCKS5 Proxy server that I wrote in NodeJS.
I am utilizing the native net and dgram libraries to open TCP and UDP sockets.
It's working fine for around 2 days and all the CPUs are around 30% max. After 2 days with no restarts, one CPU spikes to 100%. After that, all CPUs take turns and stay at 100% one CPU at a time.
Here is a 7 day chart of the CPU spikes:
I am using Cluster to create instances such as:
for (let i = 0; i < Os.cpus().length; i++) {
Cluster.fork();
}
This is the output of strace while the cpu is at 100%:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
99.76 0.294432 79 3733 epoll_pwait
0.10 0.000299 0 3724 24 futex
0.08 0.000250 0 3459 15 rt_sigreturn
0.03 0.000087 0 8699 write
0.01 0.000023 0 190 190 connect
0.01 0.000017 0 3212 38 read
0.00 0.000014 0 420 close
0.00 0.000008 0 612 180 recvmsg
0.00 0.000000 0 34 mmap
0.00 0.000000 0 16 ioctl
0.00 0.000000 0 190 socket
0.00 0.000000 0 111 sendmsg
0.00 0.000000 0 190 bind
0.00 0.000000 0 482 getsockname
0.00 0.000000 0 218 getpeername
0.00 0.000000 0 238 setsockopt
0.00 0.000000 0 432 getsockopt
0.00 0.000000 0 3259 104 epoll_ctl
------ ----------- ----------- --------- --------- ----------------
100.00 0.295130 29219 551 total
And the node profile result (heavy up):
[Bottom up (heavy) profile]:
Note: percentage shows a share of a particular caller in the total
amount of its parent calls.
Callers occupying less than 1.0% are not shown.
ticks parent name
1722861 81.0% syscall
28897 1.4% UNKNOWN
Since I only use the native libraries most of my code actually runs on C++ and not JS. So any debugging that I have to do is in v8 engine. Here is a summary of node profiler (for language):
[Summary]:
ticks total nonlib name
92087 4.3% 4.5% JavaScript
1937348 91.1% 94.1% C++
15594 0.7% 0.8% GC
68976 3.2% Shared libraries
28897 1.4% Unaccounted
I was suspecting that it might be the garbage collector that was running. But I have increased the heap size of Node and the memory seems to be within range. I don't really know how to debug it since each iteration takes around 2 days.
Anyone had a similar issue and had success debugging it? I can use any help I can get.
In your question there isn't enough information for reproduce your case. Things like OS, Node.js version, your code implementation and etc can be reason of such behavior.
There is list of best practices which can resolve or avoid such issue:
Usage pm2 as supervisor for your Node.js application.
Debugging your Node.js application in production. For that:
check your ssh connection to product server
bind your debug port to localhost with ssh -N -L 9229:127.0.0.1:9229 root#your-remove-host
start debugging by command kill -SIGUSR1 <nodejs pid>
open chrome://inspect in your Chrome or use any other debugger for Node.js
Before going to production make:
stress testing
longevity testing
A few months ago we realized that another service that was running on the same box that was keeping track of open sockets was causing the issue. This service was an older version and after a while it was spiking the cpu when tracking the sockets. Upgrading the service to the latest version solved the cpu issues.
Lesson learned: Sometimes it is not you, it is them

Why is kmeans so slow on high spec Ubuntu machine but not Windows?

My Ubuntu machine's performance is terrible for R kmeans {stats}, whereas Windows 7 shows no problems.
X is a 5000 x 5 matrix (numerical variables).
k = 6
My desktop machine is an Intel Xeon CPU W3530 # 2.80GHz x 8 (i.e., 8 cores) Dell Precision T3500, with Ubuntu 12.04.4 LTS (GNU/Linux 3.2.0-58-generic x86_64) with 24 GB RAM.
R version 3.0.2 (2013-09-25) -- "Frisbee Sailing" Copyright (C) 2013
The R Foundation for Statistical Computing Platform:
x86_64-pc-linux-gnu (64-bit)
> system.time(X.km <- kmeans(X, centers=k, nstart=25))
user system elapsed
49.763 52.347 103.426
Compared to a Windows 7 64-bit laptop with Intel Core i5-2430M # 2.40GHz, 2 cores, 8 GB RAM, R 3.0.1, and the same data:
> system.time(X.km <- kmeans(X, centers=k, nstart=25))
user system elapsed
0.36 0.00 0.37
Much, much faster. For nstart=1 the problem still exists, I just wanted to amplify the execution time.
Is there something obvious I'm missing?
Try it for yourselves, see what times you achieve:
set.seed(101)
k <- 6
n <- as.integer(10)
el.time <- vector(length=n)
X <- matrix(rnorm(25000, mean=0.5, sd=1), ncol=5)
for (i in 1:n) { # sorry, not clever enough to vectorise
el.time[i] <- system.time(kmeans(X, centers=k, nstart=i))[[3]]
}
print(el.time)
plot(el.time, type="b")
My results (ubuntu machine):
> print(el.time)
[1] 0.056 0.243 0.288 0.489 0.510 0.572 0.623 0.707 0.830 0.846
Windows machine:
> print(el.time)
[1] 0.01 0.12 0.14 0.19 0.20 0.21 0.22 0.25 0.28 0.30
Are you running Ubuntu in a Virtual Machine? If that were the case I could see the the results are much slower - Depending on how much memory, processors, diskspace was allocated for the VM. If it isn't running in a VM then the results are puzzling. I would want to see Performance counter for each of the runs (what is the cpu usage, memory usage, etc) on both systems when you run this? Otherwise, the only thing I could link of is that the code "fits" in the L1 cache of your windows system but doesn't in the Linux system. The Xeon has 8GB (L3?) Cache where the Core i5 only has 3MB - but I'm assuming that's L3. I don't know what the L1 and L2 cache structures look like.
My guess is it's a BLAS issue. R might use the internal BLAS if compiled like that. In addition to this, different BLAS versions can show significant performance differences (openBLAS <> MKL).

How to check usage of different parts of memory?

I have computer with 2 Intel Xeon CPUs and 48 GB of RAM. RAM is divided between CPUs - two parts 24 GB + 24 GB. How can I check how much of each specific part is used?
So, I need something like htop, which shows how fully each core is used (see this example), but rather for memory than for cores. Or something that would specify which part (addresses) of memory are used and which are not.
The information is in /proc/zoneinfo, contains very similar information to /proc/vmstat except broken down by "Node" (Numa ID). I don't have a NUMA system here to test it for you and provide a sample output for a multi-node config; it looks like this on a one-node machine:
Node 0, zone DMA
pages free 2122
min 16
low 20
high 24
scanned 0
spanned 4096
present 3963
[ ... followed by /proc/vmstat-like nr_* values ]
Node 0, zone Normal
pages free 17899
min 932
low 1165
high 1398
scanned 0
spanned 223230
present 221486
nr_free_pages 17899
nr_inactive_anon 3028
nr_active_anon 0
nr_inactive_file 48744
nr_active_file 118142
nr_unevictable 0
nr_mlock 0
nr_anon_pages 2956
nr_mapped 96
nr_file_pages 166957
[ ... more of those ... ]
Node 0, zone HighMem
pages free 5177
min 128
low 435
high 743
scanned 0
spanned 294547
present 292245
[ ... ]
I.e. a small statistic on the usage/availability total followed by the nr_* values also found on a system-global level in /proc/vmstat (which then allow a further breakdown as of what exactly the memory is used for).
If you have more than one memory node, aka NUMA, you'll see these zones for all nodes.
edit
I'm not aware of a frontend for this (i.e. a numa vmstat like htop is a numa-top), but please comment if anyone knows one !
The numactl --hardware command will give you a short answer like this:
node 0 cpus: 0 1 2 3 4 5
node 0 size: 49140 MB
node 0 free: 25293 MB
node 1 cpus: 6 7 8 9 10 11
node 1 size: 49152 MB
node 1 free: 20758 MB
node distances:
node 0 1
0: 10 21
1: 21 10

Resources