run multiple commands in parallel in unix - linux

I have few files and I have to cut few columns from that files to generate new files unix.
I tried to do it in loop as selecting files in directory and generating new files but as directory having 100 such files it takes lot of time to generate new files.
Can anyone please help if I can select 10 files in parallel and generate 10 new files and again next set of 10 files as it will reduce the time.
i need sample unix code block for this
cut -b 1-10,25-50,65-79 file1.txt > file_cut1.txt
cut -b 1-10,25-50,65-79 file2.txt > file_cut2.txt

You can do that quite simply with GNU Parallel like this:
parallel 'cut -b 1-10,25-50,65-79 {} > {.}_cut.txt' ::: file*txt
where:
{} represents the current filename, and
{.} represents the current filename without its extension.
Make a backup of the files in your directory before trying this, or any unfamiliar commands.
It will process your files in parallel, doing N at a time, where N is the number of cores in your CPU. If you want it to do, say 8, jobs at a time, use:
parallel -j 8 ...
If you want to see what it would do, without actually doing anything, use:
parallel --dry-run ...

Related

Read only nth first lines [sublime text]

I've got some files so big to directly open them in Sublime Text. Is there any way to open only the nth first lines? Something like head in bash? Thanks
If you're on Linux or Mac, or have Cygwin, Git Bash, or similar installed on a Windows machine, check out the split utility, which is part of the coreutils package. It does exactly what it says: it splits input into separate files. It is configurable via command-line options, like every Unix utility. For example, if you wanted to split your input file into separate 10,000-line files starting with notsobigfile and using numeric suffixes ending with .txt, you would run
split -d -l 10000 --additional-suffix=".txt" reallybigfile.txt notsobigfile
and it would output files named notsobigfile01.txt, notsobigfile02.txt, etc. If this would generate more than 100 files (00 through 99), just add -a x where x is the number of digits (the default is 2).
For all the possible options, just read the man page:
man split
If you only want to output the first part of the file, check out the options for the -n/--number flag.
To figure out how many lines your input file has, run the word counting utility using the lines option:
wc -l reallybigfile.txt

Use more than one core in bash

I have a linux tool that (greatly simplifying) cuts me the sequences specified in illumnaSeq file. I have 32 files to grind. One file is processed in about 5 hours. I have a server on the centos, it has 128 cores.
I've found a few solutions, but each one works in a way that only uses one core. The last one seems to fire 32 nohups, but it'll still pressurize the whole thing with one core.
My question is, does anyone have any idea how to use the server's potential? Because basically every file can be processed independently, there are no relations between them.
This is the current version of the script and I don't know why it only uses one core. I wrote it with the help of advice here on stack and found on the Internet:
#!/bin/bash
FILES=/home/daw/raw/*
count=0
for f in $FILES
to
base=${f##*/}
echo "process $f file..."
nohup /home/daw/scythe/scythe -a /home/daw/scythe/illumina_adapters.fa -o "OUT$base" $f &
(( count ++ ))
if (( count = 31 )); then
wait
count=0
fi
done
I'm explaining: FILES is a list of files from the raw folder.
The "core" line to execute nohup: the first path is the path to the tool, -a path is the path to the file with paternas to cut, out saves the same file name as the processed + OUT at the beginning. The last parameter is the input file to be processed.
Here readme tools:
https://github.com/vsbuffalo/scythe
Does anybody know how you can handle it?
P.S. I also tried move nohup before count, but it's still use one core. I have no limitation on server.
IMHO, the most likely solution is GNU Parallel, so you can run up to say, 64 jobs in parallel something like this:
parallel -j 64 /home/daw/scythe/scythe -a /home/daw/scythe/illumina_adapters.fa -o OUT{.} {} ::: /home/daw/raw/*
This has the benefit that jobs are not batched, it keeps 64 running at all times, starting a new one as each job finishes, which is better than waiting potentially 4.9 hours for all 32 of your jobs to finish before starting the last one which takes a further 5 hours after that. Note that I arbitrarily chose 64 jobs here, if you don't specify otherwise, GNU Parallel will run 1 job per CPU core you have.
Useful additional parameters are:
parallel --bar ... gives a progress bar
parallel --dry-run ... does a dry run so you can see what it would do without actually doing anything
If you have multiple servers available, you can add them in a list and GNU Parallel will distribute the jobs amongst them too:
parallel -S server1,server2,server3 ...

Shell script Fetching data from 5 different directories

I'm trying to run a shell script to get data from multiple directories.
My target (targetDir) has 5 directories. So the program, when executed, should search data from these 5 different directories, but when I execute it, it treats all the 5 folders same line. Any advice?
targetDir="snavis_bub snavis_bub2 snavis_bub3 snavis_hdw snavis_ldw"
datadir=/opt/pkg/home/tools/zform/marnel/$targetDir/of_inspect
Upon execute:
./orsInspect.sh: line 60:
cd: /opt/pkg/home/tools/zform/marnel/snavis_bub,snavis_bub2,snavis_bub3,snavis_hdw,snavis_ldw/oref_inspect: No such file or directory
Many things you can do. For example you can use arrays and for loops and perform a task each iteration of the loop:
#!/bin/bash
declare -a targetDirs=("snavis_bub" "snavis_bub2" "snavis_bub3" "snavis_hdw" "snavis_ldw")
for the_dir in "${targetDirs[#]}" ;do
datadir="/opt/pkg/home/tools/zform/marnel/${the_dir}/of_inspect"
echo "$datadir"
# ... do something for each datadir
done
example output (just echoing):
/opt/pkg/home/tools/zform/marnel/snavis_bub/of_inspect
/opt/pkg/home/tools/zform/marnel/snavis_bub2/of_inspect
/opt/pkg/home/tools/zform/marnel/snavis_bub3/of_inspect
/opt/pkg/home/tools/zform/marnel/snavis_hdw/of_inspect
/opt/pkg/home/tools/zform/marnel/snavis_ldw/of_inspect

Bash script to search multiple files with a string mentioned in a different file, then copy those files into a new directory

I have multiple files recorded per date. At the end of every day, I need to run a script to grep those files which contain particular numbers mentioned in a different file, then copy all those files (CSV records) which contain matching UID records to another file location.
Working Dir = /var/output
Search file name = /var/output/UID.txt
##Cat UID.txt
639867675
123466490
123334555
filenames = CSV_name_date.csv
Each filename is unique, and in a day I get roughly around 5000 files.
I'm using this code,
grep -f uid.txt -e stringpattern -l | xargs cp -t /var/output2/
I need to run a search on a particular date, the script should ask you which date you want to run and run the search on files of those dates only.

How to use sed command to delete lines without backup file?

I have large file with size of 130GB.
# ls -lrth
-rw-------. 1 root root 129G Apr 20 04:25 syslog.log
So I need to reduce file size by deleting line which starts with "Nov 2" , So I have given the following command,
sed -i '/Nov 2/d' syslog.log
So I can't edit file using VIM editor also.
When I trigger SED command , its creating backup file also. But I don't have much space in root. Please try to give alternate solution to delete particular line from this file without increasing space in server.
It does not create a real backup file. sed is a stream editor. When applied to a file with option -i it will stream that file through the sed process, write the output to a new file (a temporary one), when everything is done, it will rename the new file to the original name.
(There are options to create backup files also, but you didn't give them, so I won't mention that further.)
In your case you have a very large file and don't want to create any copy, however temporary. For this you need to open the file for reading and writing at the same time, then your sed process can overwrite the original. After this, you will have to truncate the file at the end of the writing.
To demonstrate how this can be done, we first perform a test case.
Create a test file, containing lots of lines:
seq 0 999999 > x
Now, lets say we want to remove all lines containing the digit 4:
grep -v 4 1<>x <x
This will open the file for reading and writing as STDOUT (1), and for reading as STDIN. The grep command will read all lines and will output only the lines not containing a 4 (option -v).
This will effectively overwrite the beginning of the original file.
You will not know how long the output is, so after the output the original contents of the file will appear:
…
999991
999992
999993
999995
999996
999997
999998
999999
537824
537825
537826
537827
537828
537829
…
You can use the Unix tool truncate to shorten your file manually afterwards. In a real scenario you will have trouble finding the right spot for this, so it makes sense to count the number of bytes written (using wc):
(Don't forget to recreate the original x for this test.)
(grep -v 4 <x | tee /dev/stderr 1<>x) |& wc -c
This will preform the step above and additionally print out the number of bytes written to the terminal, in this example case the output will be 3653658. Now use truncate:
truncate -s 3653658 x
Now you have the result you want.
If you want to do this in a script, i. e. without interaction, you can use this:
length=$((grep -v 4 <x | tee /dev/stderr 1<>x) |& wc -c)
truncate -s "$length" x
I cannot guarantee that this will work for files >2GB or >4GB on your machine; depending on your operating system (32bit?) and the versions of the installed tools you might run into largefile issues. I'd perform tests with large files first (>4GB as this is typically a limit for many things) and then cross your fingers and give it a try :)
Some caveats you have to keep in mind:
Of course, nobody is supposed to append log entries to that log file while the procedure is running.
Also, any abort during the running of the process (power failure, signal caught, etc.) will leave the file in an undefined state. But re-running the command again after such a mishap will in most cases produce the correct output; some lines might be doubled, but not more than a single line should be corrupted then.
The output must be smaller than the input, of course, otherwise the writing will overtake the reading, corrupting the whole result so that lines which should be there will be missing (or truncated at the start).

Resources