I have for project to code a virus scanner by using clamAV signature database.
For increasing speed I use threads. (combo & and wait)
How my code works:
It read all files in a folder and sub-folders
function recursive_files()
{
files=$(find $folder_path -type f)
for f in $files
do
raw_and_scan "$f" &
done
wait
}
As you can see, for each file there is a thread.
function raw_and_scan()
{
raw_test_file $1
read_signature_db_by_line $1
}
Read_signature.. read each line of signature database
function read_signature_db_by_line()
{
cat $signature_path | (while read LINE ; do
stringtokenizer_line_db $LINE $1 $raw_file &
done
wait
) }
As you can see, for each line of DB there is a thread.
I did the double thread implementation because I saw a huge performance (using time benchmark)
When I scan 50 files with 50 lines into the DB. It works fine.
But when I scan my home folder (800 files) it doesn't work and worse I have got a warning(cant fork() anymore) and my computer freeze, It needs to reboot.
I watch the process (htop) until 5000 tasks it works.
You can file my poject https://github.com/peondusud/Bash.antivir
At the end, I would to scan folder with a database 65000 lines.
If you have any idea to limit threads or something like that.
Thanks.
The fact you see a huge improvement going from a single process (not thread) into two does not mean you will go super fast using 5000 processes! Actually it is the opposite- if you plan to have processes doing intensive work you should limit it to 2* number of cpu cores in your system (this is a generic rule of thumb)
Related
While reading the APUE (3rd edition) book, I came across the open system call and its ability to let user open file for write atomic operation with O_APPEND mode meaning that, multiple processes can write to a file descriptor and kernel ensures that the data written to the single file by multiple processes, doesn't overlap and all lines are intact.
Upon experimenting successfully with open system call with a C/C++ program, I was able to validate the same and it works just like the book describes. I was able to launch multiple processes which wrote to a single file and all lines could be accounted for w.r.t to their process PIDs.
I was hoping to observe the same behavior with perl sysopen, as I have some tasks at work which could benefit with this behavior. Tried it out but it actually did not work. When I analyzed the output file, I was able to see signs of race condition (probably) as there many a times interleaved lines.
Question: Is perl sysopen call not same as the linux's open system call? Is it possible to achieve this type of atomic write operation by multiple processes to a single file?
EDIT: adding C code, and perl code used for testing.
C/C++ code
int main(void)
{
if ((fd = open("outfile.txt",O_WRONLY|O_CREAT|O_APPEND)) == -1) {
printf ("failed to create outfile! exiting!\n");
return -1;
}
for (int counter{1};counter<=MAXLINES;counter++)
{ /* write string 'line' for MAXLINES no. of times */
std::string line = std::to_string(ACE_OS::getpid())
+ " This is a sample data line ";
line += std::to_string(counter) + " \n";
if ((n = write(fd,line.c_str(),strlen(line.c_str()))) == -1) {
printf("Failed to write to outfile!\n";
}
}
return 0;
}
Perl code
#!/usr/bin/perl
use Fcntl;
use strict;
use warnings;
my $maxlines = 100000;
sysopen (FH, "testfile", O_CREAT|O_WRONLY|O_APPEND) or die "failed sysopen\n";
while ($maxlines != 0) {
print FH "($$) This is sample data line no. $maxlines\n";
$maxlines--;
}
close (FH);
__END__
Update (after initial troubleshooting):
Thanks to the information provided in the answer below, I was able to get it working. Although I ran into issue of some missing lines which was caused by me opening the file with each process with O_TRUNC, which I shouldn't have done, but missed it initially. After some careful analysis - I spotted the issue and corrected it. As always - linux never fails you :).
Here is a bash script I used to launch the processes:
#!/bin/bash
# basically we spawn "$1" instances of the same
# executable which should append to the same output file.
max=$1
[[ -z $max ]] && max=6
echo "creating $max processes for appending into same file"
# this is our output file collecting all
# the lines from all the processes.
# we truncate it before we start
>testfile
for i in $(seq 1 $max)
do
echo $i && ./perl_read_write_with_syscalls.pl 2>>_err &
done
# end.
Verification from output file:
[compuser#lenovoe470:07-multiple-processes-append-to-a-single-file]$ ls -lrth testfile
-rw-rw-r--. 1 compuser compuser 252M Jan 31 22:52 testfile
[compuser#lenovoe470:07-multiple-processes-append-to-a-single-file]$ wc -l testfile
6000000 testfile
[compuser#lenovoe470:07-multiple-processes-append-to-a-single-file]$ cat testfile |cut -f1 -d" "|sort|uniq -c
1000000 (PID: 21118)
1000000 (PID: 21123)
1000000 (PID: 21124)
1000000 (PID: 21125)
1000000 (PID: 21126)
1000000 (PID: 21127)
[compuser#lenovoe470:07-multiple-processes-append-to-a-single-file]$
Observations:
To my surprise, there wasn't any wait average load on the system - at all. I was not expecting it. I believe somehow kernel must have taken care of that but don't know how it works. I would be interested to know more about it.
What could be the possible applications of this?
I do a lot of file to file reconciliation(s), and we (at work) always have need to parse huge data files (like 30gb - 50gb each). With this working - I could now do parallel operations instead of my previous approach which comprised of: hashing file1, then hashing file2, then compare key,value pairs from 2 files. Now I could do the hashing part in parallel and bring down the time it takes - even further.
Thanks
It doesn't matter if you open or sysopen; the key is using syswrite and sysread instead of print/printf/say/etc and readline/read/eof/etc.
syswrite maps to a single write(2) call, while print/printf/say/etc can result in multiple calls to write(2) (even if autoflush is enabled).[1]
sysread maps to a single read(2) call, while readline/read/eof/etc can result in multiple calls to read(2).
So, by using syswrite and sysread, you are subject to all the assurances that POSIX gives about those calls (whatever they might be) if you're on a POSIX system.
If you use print/printf/say/etc, and limit your writes to less than the size of the buffer between (explicit or automatic) flushes, you'll get a single write(2) call. The buffer size was 4 KiB in older versions of Perl, and it's 8 KiB by default in newer versions of Perl. (The size is decided when perl is built.)
I have been doing some research and I got this situation. If you want to write to the STDOUT (screen), you won't be able to do a multithread script which prints the data faster tan a simple single thread script. But, if you write to a file like this:
myPrinter.perl > myPrint
The result change and you can see that the multithread approach gets better time. My doubt is, since STDOUT (screen) or the output file are both shared resources, wouldn't be the access time similar? why multithread approach only performs better writting to file?
the perl scripts that I used in the experiments are:
Single thread
for my $i (1..100000000){
print("things\n");
}
Multithread
use threads;
use Thread::Queue 3.01 qw( );
use constant NUM_WORKERS => 4;
sub worker {
for my $i (1 .. 25000000){
print("things\n");
}
}
my $q = Thread::Queue->new(); #::any
async { while (defined( my $job = $q->dequeue() )) { worker($job); } }
for 1..NUM_WORKERS;
for my $i (1 .. 4){
$q->enqueue($i);
}
$q->end();
$_->join for threads->list;
Credits: the queue implementation was taken from one of the ikegami answers.
This could be explained if writing to STDOUT requires some form of locking internally.
When STDOUT is connected to a terminal, the output is flushed after every newline. Otherwise, STDOUT is only flushed every 4 KiB or 8 KiB (depending on your version of Perl). The latter scenario presumably required fewer or shorter locks.
You could use |cat instead of >file to get the same effect.
If your actual worker spends a much smaller proportion of time writing to STDOUT, this problem should go away.
An example, following up on my comment. I understand from the question that you compare STDOUT prints that wind up on the terminal to those that are redirected to a file.
Timed to print to console, and to file
time perl -we'print "hello\n" for 1..1_000_000'
Time: 0.209u 1.562s 0:17.65 9.9% 0+0k 0+0io 0pf+0w (tcsh)
time perl -we'print "hello\n" for 1..1_000_000' > many_writes.out
Time: 0.104u 0.005s 0:00.11 90.9% 0+0k 0+11720io 0pf+0w
That is 17.65 seconds vs. 0.11 seconds. Printing to a terminal is very, very slow.
With multiple threads I expect the difference to be even more pronounced.
How fast you can output data is restricted by the performance of the target. If you write to a local file the performance is restricted by the underlying OS, the file system and the speed of disk. If you write to a file on a network file system it is further restricted by the speed of the network and the performance of the file server etc. Some OS level buffering helps to make this faster.
If you write to STDOUT it depends what the target of STDOUT is. STDOUT can be redirected to a file, piped into another process and also printed to a terminal. In all of these cases the write speed is again dependend on the target medium. Terminals are usually very slow in writing compared to a local file. But again, this is not a question of STDOUT vs. file but of where STDOUT ends up.
Whats the best way to generate 1000K text files? (with Perl and Windows 7) I want to generate those text files in as possible in less time (possibly withing 5 minutes)? Right now I am using Perl threading with 50 threads. Still it is taking longer time.
What will be best solution? Do I need to increase thread count? Is there any other way to write 1000K files in under five minutes? Here is my code
$start = 0;
$end = 10000;
my $start_run = time();
my #thr = "";
for($t=0; $t < 50; $t++) {
$thr[$t] = threads->create(\&files_write, $start, $end);
#start again from 10000 to 20000 loop
.........
}
for($t=0; $t < 50; $t++) {
$thr[$t]->join();
}
my $end_run = time();
my $run_time = $end_run - $start_run;
print "Job took $run_time seconds\n";
I don't want return result of those threads. I used detach() also but it didn't worked me.
For generating 500k files (with only size of 20kb) it took 1564 seconds (26min). Can I able to achieve within 5min?
Edited: The files_write will only take values from array predefined structure and write into file. thats it.
Any other solution?
The time needed depends on lots of factors, but heavy threading is probably not the solution:
creating files in the same directory at the same time needs probably locking in the OS, so it's better done not too much in parallel
the layout how the data gets written on disk depend on the amount of data and on how many writes you do in parallel. A bad layout can impact the performance a lot, especially on HDD. But even a SDD cannot do lots of parallel writes. This all depends a lot on the disk you use, e.g. it is a desktop system which is optimized for sequential writes or is it a server system which can do more parallel writes as required by databases.
... lots of other factors, often depending on the system
I would suggest that you use a thread pool with a fixed size of threads to benchmark, what the optimal number of threads is for your specific hardware. E.g. start with a single thread and slowly increase the number. My guess is, that the optimal number might be between factor 0.5 and 4 of the number of processor cores you have, but like I said, it heavily depends on your real hardware.
The slow performance is probably due to Windows having to lock the filesystem down while creating files.
If it is only for testing - and not critical data - a RAMdisk may be ideal. Try Googling DataRam RAMdisk.
I was doing some experiments to learn more about Linux process states.
So, there's a directory(named big_dir) with over a billion files in it(the directory has many sub-directories recursively), and then I run tar -cv big_dir | ssh anotherServer "tar -xv -C big_dir" and found out via executing top that, the tar process stays in D status. Meanwhile, the tar command keeps outputting the paths of the files.
I know that, the process was in D status because it was doing disk I/O, but why didn't its status keep switching between D and R? Printing the file names under the directory must have used some CPU computation, isn't it? Otherwise how could the find command know that it should print something?
If I run dd if=/dev/zero of=/dev/null, then the dd process status kept in R status from the top output. But why wasn't it in D status? Wasn't it doing I/O all the time?
/dev/zero and /dev/null are pseudo-devices. So there's no physical device behind them.
If I do
dd if=/dev/zero of=/tmp/zeroes
then top does show me dd in the D status. However it does spend a lot of it's time in R (in CPU time). top will simply sample the process table and consequently you may need to watch it for some time in order to see transient states.
I suspect for your tar example above that the amount of time outputting to stdout is negligible compared to the disk time. Note also that outputting to stdout will also involve the windowing system writing and whilst it's doing that the process will be sleeping. e.g. I'm running yes right now, and the majority of the work is being performed by my X server. The yes process is sleeping for most of the time I'm watching it (via top)
I'm sure your tar process SOMETIMES goes to R, but it's probably for a very short period of time, because it doesn't do that much - particularly since you are sending the data through a network. Unless that's a 10Gb/s network card [and everything else to "anotherServer" is really working at 1GB/s], this will be the slowest part of the chain. ssh itself will take a little bit of overhead as it encrypts the data.
It probably takes tar a few microseconds to ask for some data from the disk, and a few milliseconds for the disk to move its head and read the actual data. So you have about 0.1% of the time in "R", the rest is in "D".
I have two programs A and B. I can't change the program A - I can only run it with some parameters, but I have written the B myself, and I can modify it the way I like.
Program A runs for a long time (20-40 hours) and during that time it produces output to the file, so that its size increases constantly and can be huge at the end of run (like 100-200 GB). The program B then reads the file and calculates some stuff. The special property of the file is that its content is not correlated: I can divide the file in half and run calculations on each part independently, so that I don't need to store all the data at once: I can calculate on the first part, then throw it away, calculate on the second one, etc.
The problem is that I don't have enough space to store such a big files. I wonder if it is possible to pipe somehow the output of the A to B without storing all the data at once and without making huge files. Is it possible to do something like that?
Thank you in advance, this is crucial for me now, Roman.
If program A supports it, simply pipe.
A | B
Otherwise, use a fifo.
mkfifo /tmp/fifo
ls -la > /tmp/fifo &
cat /tmp/fifo
EDIT: Adjust buffer sizes with ulimit -p and then:
cat /tmp/fifo | B
It is possible to pipeline output of one program into another.
Read here to know the syntax and know-hows of Unix pipelining.
you can use socat which can take stdout and feed it to network and get from network and feed it to stdin
named or unnamed pipe have a problem of small ( 4k ? ) buffer .. that means too many process context switches if you are writing multi gb ...
Or if you are adventurous enough .. you can LD_PRELOAD a so in process A, and trap the open/write calls to do whatever ..