How to append the output of Parallel Grep to a file? - linux

I have a file of 500 MB, and a pattern file of 20MB. Since it was taking too much time to grep the 1.2 million patterns from the file with 5 million lines, I split the pattern file into 100 parts.
I tried to run Grep parallely with the multiple patterns as below.
for pat1 in vailtar_*
do
parallel --block 75M --pipe grep $pat1 infile >> outfile
done;
But I cannot get the output to append to a file. I tried without the block option and as below too -
cat infile | parallel --block 75M --pipe grep $pat1 >> outfile
< infile parallel --block 75M --pipe grep $pat1 >> outfile
Is there anyway to make the parallel grep append the output to a file?
Thanks in advance.

Perhaps it will work better like this?
for pat1 in vailtar_*
do
parallel --block 75M --pipe grep -f $pat1 < infile
done > outfile
That will take all the output from everything inside the for loop, and put it in outfile.
Incidentally, I think you meant to use infile as stdin, instead of as an argument to grep, and I think you meant to have -f $pat, not just the filename as the pattern. I've fixed both issues in my version.
However, if I were trying to solve this problem I might do it like this:
parallel 'grep -f {} infile' ::: vailtar_*
(I've not tested that.)

Related

Print lines between line numbers from a line list and save every instance in separate file using GNU Parallel

I have a file, say "Line_File" with a list of line start & end numbers and file ID :
F_a 1 108
F_b 109 1210
F_c 131 1190
I have another file, "Data_File" from where I need to fetch all the lines between the line numbers fetched from the Line_File.
The command in sed:
'sed -n '1,108p' Data_File > F_a.txt
does the job but I need to do this for all the values in columns 2 & 3 of Line_File and save it with the file name mentioned in the column 1 of the Line_File.
If $1, $2 and $3 are the three cols of Line_File then I am looking for a command something like
'sed -n '$2,$3p' Data_File > $1.txt
I can run the same using Bash Loop but that will be very slow for a very large file, say 40GB.
I specifically want to do this because I am trying to use GNU Parallel to make it faster and line number based slicing will make the output non-overlapping. I am trying to execute command like this
cat Data_File | parallel -j24 --pipe --block 1000M --cat LC_ALL=C sed -n '$2,$3p' > $1.txt
But I am no able to actually use the column assignment $1,$2 and $3 properly.
I tried the following command:
awk '{system("sed -n \""$2","$3"p\" Data_File > $1"NR)}' Line_File
But it doesn't work. Any idea where I am going wrong?
P.S If my question is not clear then please point out what else I should be sharing.
You may use xargs with -P (parallel) option:
xargs -P 8 -L 1 bash -c 'sed -n "$2,$3p" Data_File > $1.txt' _ < Line_File
Explanation:
This xargs command takes Line_File as input by using <
-P 8 option allows it to run up to 8 processes in parallel
-L 1 makes xargs process one line at a time
bash -c ... forks bash for each line in input file
_ before < passes _ as $0 and passes remaining 3 column in each input line as $1, $2,$3`
sed -n runs sed command for each line by forming a command line
Or you may use gnu parallel like this:
parallel --colsep '[[:blank:]]' "sed -n '{2},{3}p' Data_File > {1}.txt" :::: Line_File
Check parallel examples from official doc
awk to the rescue!
this scans the data file only once
$ awk 'NR==FNR {k=$1; s[k]=$2; e[k]=$3; next}
{for(k in s) if(FNR>=s[k] && FNR<=e[k]) print > (k".txt")}' lines data
This might work for you (GNU parallel and sed):
parallel --dry-run -a lineFile -C' ' "sed -n '{2},{3}p' dataFile > {1}'
This uses the column separator -C ' ' and sets it to a space, this then sets the first 3 fields of the lineFile to {1},{2} and {3}. The --dry-run option allows you to check the commands parallel generates before running for real. Once the commands look correct remove the --dry-run option.
You are likely not to be CPU constrained. It is more likely your disks will be the limiting factor. To avoid reading DataFile over and over again, you should run as many jobs as possible in parallel. That way caching will help you:
cat Line_file |
parallel -j0 --colsep ' ' sed -n {2},{3}p Data_File \> {1}.txt

Retrieve last 100 lines logs

I need to retrieve last 100 lines of logs from the log file.
I tried the sed command
sed -n -e '100,$p' logfilename
Please let me know how can I change this command to specifically retrieve the last 100 lines.
You can use tail command as follows:
tail -100 <log file> > newLogfile
Now last 100 lines will be present in newLogfile
EDIT:
More recent versions of tail as mentioned by twalberg use command:
tail -n 100 <log file> > newLogfile
"tail" is command to display the last part of a file, using proper available switches helps us to get more specific output. the most used switch for me is -n and -f
SYNOPSIS
tail [-F | -f | -r] [-q] [-b number | -c number | -n number] [file ...]
Here
-n number :
The location is number lines.
-f : The -f option causes tail to not stop when end of file is
reached, but rather to wait for additional data to be appended to the
input. The -f option is ignored if the
standard input is a pipe, but not if it is a FIFO.
Retrieve last 100 lines logs
To get last static 100 lines
tail -n 100 <file path>
To get real time last 100 lines
tail -f -n 100 <file path>
You can simply use the following command:-
tail -NUMBER_OF_LINES FILE_NAME
e.g tail -100 test.log
will fetch the last 100 lines from test.log
In case, if you want the output of the above in a separate file then you can pipes as follows:-
tail -NUMBER_OF_LINES FILE_NAME > OUTPUT_FILE_NAME
e.g tail -100 test.log > output.log
will fetch the last 100 lines from test.log and store them into a new file output.log)
Look, the sed script that prints the 100 last lines you can find in the documentation for sed (https://www.gnu.org/software/sed/manual/sed.html#tail):
$ cat sed.cmd
1! {; H; g; }
1,100 !s/[^\n]*\n//
$p
$ sed -nf sed.cmd logfilename
For me it is way more difficult than your script so
tail -n 100 logfilename
is much much simpler. And it is quite efficient, it will not read all file if it is not necessary. See my answer with strace report for tail ./huge-file: https://unix.stackexchange.com/questions/102905/does-tail-read-the-whole-file/102910#102910
I know this is very old, but, for whoever it may helps.
less +F my_log_file.log
that's just basic, with less you can do lot more powerful things. once you start seeing logs you can do search, go to line number, search for pattern, much more plus it is faster for large files.
its like vim for logs[totally my opinion]
original less's documentation : https://linux.die.net/man/1/less
less cheatsheet : https://gist.github.com/glnds/8862214
len=`cat filename | wc -l`
len=$(( $len + 1 ))
l=$(( $len - 99 ))
sed -n "${l},${len}p" filename
first line takes the length (Total lines) of file
then +1 in the total lines
after that we have to fatch 100 records so, -99 from total length
then just put the variables in the sed command to fetch the last 100 lines from file
I hope this will help you.

Concatenation of huge number of selective files from a directory in Shell

I have more than 50000 files in a directory such as file1.txt, file2.txt, ....., file50000.txt. I would like to concatenate of some files whose file numbers are listed in the following text file (need.txt).
need.txt
1
4
35
45
71
.
.
.
I tried with the following. Though it is working, but I look for more simpler and short way.
n1=1
n2=$(wc -l < need.txt)
while [ $n1 -le $n2 ]
do
f1=$(awk 'NR=="$n1" {print $1}' need.txt)
cat file$f1.txt >> out.txt
(( n1++ ))
done
This might also work for you:
sed 's/.*/file&.txt/' < need.txt | xargs cat > out.txt
Something like this should work for you:
sed -e 's/.*/file&.txt/' need.txt | xargs cat > out.txt
It uses sed to translate each line into the appropriate file name and then hands the filenames to xargs to hand them to cat.
Using awk it could be done this way:
awk 'NR==FNR{ARGV[ARGC]="file"$1".txt"; ARGC++; next} {print}' need.txt > out.txt
Which adds each file to the ARGV array of files to process and then prints every line it sees.
It is possible do it without any sed or awk command. Directly using bash built-in functions and cat (of course).
for i in $(cat need.txt); do cat file${i}.txt >> out.txt; done
And as you want, it is quite simple.

How do I insert a new line before concatenating?

I have about 80000 files which I am trying to concatenate. This one:
cat files_*.raw >> All
is extremely fast whereas the following:
for f in `ls files_*.raw`; do cat $f >> All; done;
is extremely slow. Because of this reason, I am trying to stick with the first option except that I need to be able to insert a new line after each file is concatenated to All. Is there any fast way of doing this?
What about
ls files_*.raw | xargs -L1 sed -e '$s/$/\n/' >>ALL
That will insert an extra newline at the end of each file as you concat them.
And a parallel version if you don't care about the order of concatenation:
find ./ -name "*.raw" -print | xargs -n1 -P4 sed -e '$s/$/\n/' >>All
The second command might be slow because you are opening the 'All' file for append 80000 times vs. 1 time in the first command. Try a simple variant of the second command:
for f in `ls files_*.raw`; do cat $f ; echo '' ; done >> All
I don't know why it would be slow, but I don't think you have much choice:
for f in `ls files_*.raw`; do cat $f >> All; echo '' >> All; done
Each time awk opens another file to process, the FRN equals 0, so:
awk '(0==FRN){print ""} {print}' files_*.raw >> All
Note, it's all done in one awk process. Performance should be close to the cat command from the question.

Linux using grep to print the file name and first n characters

How do I use grep to perform a search which, when a match is found, will print the file name as well as the first n characters in that file? Note that n is a parameter that can be specified and it is irrelevant whether the first n characters actually contains the matching string.
grep -l pattern *.txt |
while read line; do
echo -n "$line: ";
head -c $n "$line";
echo;
done
Change -c to -n if you want to see the first n lines instead of bytes.
You need to pipe the output of grep to sed to accomplish what you want. Here is an example:
grep mypattern *.txt | sed 's/^\([^:]*:.......\).*/\1/'
The number of dots is the number of characters you want to print. Many versions of sed often provide an option, like -r (GNU/Linux) and -E (FreeBSD), that allows you to use modern-style regular expressions. This makes it possible to specify numerically the number of characters you want to print.
N=7
grep mypattern *.txt /dev/null | sed -r "s/^([^:]*:.{$N}).*/\1/"
Note that this solution is a lot more efficient that others propsoed, which invoke multiple processes.
There are few tools that print 'n characters' rather than 'n lines'. Are you sure you really want characters and not lines? The whole thing can perhaps be best done in Perl. As specified (using grep), we can do:
pattern="$1"
shift
n="$2"
shift
grep -l "$pattern" "$#" |
while read file
do
echo "$file:" $(dd if="$file" count=${n}c)
done
The quotes around $file preserve multiple spaces in file names correctly. We can debate the command line usage, currently (assuming the command name is 'ngrep'):
ngrep pattern n [file ...]
I note that #litb used 'head -c $n'; that's neater than the dd command I used. There might be some systems without head (but they'd pretty archaic). I note that the POSIX version of head only supports -n and the number of lines; the -c option is probably a GNU extension.
Two thoughts here:
1) If efficiency was not a concern (like that would ever happen), you could check $status [csh] after running grep on each file. E.g.: (For N characters = 25.)
foreach FILE ( file1 file2 ... fileN )
grep targetToMatch ${FILE} > /dev/null
if ( $status == 0 ) then
echo -n "${FILE}: "
head -c25 ${FILE}
endif
end
2) GNU [FSF] head contains a --verbose [-v] switch. It also offers --null, to accomodate filenames with spaces. And there's '--', to handle filenames like "-c". So you could do:
grep --null -l targetToMatch -- file1 file2 ... fileN |
xargs --null head -v -c25 --

Resources