Awk, tail, sed or others - which one faster for big files? - linux

I have scripts for big log files. I can check all line and do something with tail and awk.
Tail:
tail -n +$startline $LOG
Awk:
awk 'NR>='"$startline"' {print}' $LOG
And checking time, tail working 6 mins 39 seconds, awk working 6 mins 42 seconds. So two commands do same thing / same time.
I don't know how to do with sed. Sed can be faster than tail and awk? Or maybe other commands.
Second question, I use $startline and every time continue remains from the last line. For example:
I use script line this:
10:00AM -> ./script -> $startline=1 and do something -> write line number to save file(for ex. 25),
10:05AM -> ./script -> $startline=26(read save file +1) and do something -> write line number save file(55),
10:10AM -> ./script -> $startline=56(read save file +1) and do something ....
But when script is running, checking all lines and when see $startline, doing something. And it's a little slow because of huge files.
Any suggestions for it be faster?
Script example:
lastline=$(tail -1 "line.save")
startline=$(($lastline + 1))
tail -n +$startline $LOG | while read -r
do
....
done
linecount=$(wc -l "$LOG" | awk '{print $1}')
echo $linecount >> line.save

tail and head are tools especially created for this purposes, so the intuitive idea is that their are quite optimized for it. On the other hand, awk and sed can perfectly do it because they are like a Swiss Army knife, but this is not supposed to be its best "skill" over the multiple others that they have.
In Efficient way to print lines from a massive file using awk, sed, or something else? there is a nice comparison on methods and head / tail is seen as the best approach.
Hence, I would go for tail + head.
Note also that if it is not only the last lines, but a set of them within the text, in awk (or in sed) you have the option to exit after the last line you wanted. This way, you avoid the script to run the file until the last line.
So this:
awk '{if (NR>=10 && NR<20) print} NR==20 {print; exit}'
is faster than
awk 'NR>=10 && NR<=20'
If your input happens to contain more than 20 lines.
Regarding your expression:
awk 'NR>='"$startline"' {print}' $LOG
note that it is more straight forward to write:
awk -v start="$startline" 'NR>=start' $LOG
there is no need to say print because it is implicit.

Related

how to show the third line of multiple files

I have a simple question. I am trying to check the 3rd line of multiple files in a folder, so I used this:
head -n 3 MiseqData/result2012/12* | tail -n 1
but this doesn't work obviously, because it only shows the third line of the last file. But I actually want to have last line of every file in the result2012 folder.
Does anyone know how to do that?
Also sorry just another questions, is it also possible to show which file the particular third line belongs to?
like before the third line is shown, is it also possible to show the filename of each of the third line extracted from?
because if I used head or tail command, the filename is also shown.
thank you
With Awk, the variable FNR is the number of the "record" (line, by default) in the current file, so you can simply compare it to 3 to print the third line of each input file:
awk 'FNR == 3' MiseqData/result2012/12*
A more optimized version for long files would skip to the next file on match, since you know there's only that one line where the condition is true:
awk 'FNR == 3 { print; nextfile }' MiseqData/result2012/12*
However, not all Awks support nextfile (but it is also not exclusive to GNU Awk).
A more portable variant using your head and tail solution would be a loop in the shell:
for f in MiseqData/result2012/12*; do head -n 3 "$f" | tail -n 1; done
Or with sed (without GNU extensions, i.e., the -s argument):
for f in MiseqData/result2012/12*; do sed '3q;d' "$f"; done
edit: As for the additional question of how to print the name of each file, you need to explicitly print it for each file yourself, e.g.,
awk 'FNR == 3 { print FILENAME ": " $0; nextfile }' MiseqData/result2012/12*
for f in MiseqData/result2012/12*; do
echo -n `basename "$f"`': '
head -n 3 "$f" | tail -n 1
done
for f in MiseqData/result2012/12*; do
echo -n "$f: "
sed '3q;d' "$f"
done
With GNU sed:
sed -s -n '3p' MiseqData/result2012/12*
or shorter
sed -s '3!d' MiseqData/result2012/12*
From man sed:
-s: consider files as separate rather than as a single continuous long stream.
You can do this:
awk 'FNR==3' MiseqData/result2012/12*
If you like the file name as well:
awk 'FNR==3 {print FILENAME,$0}' MiseqData/result2012/12*
This might work for you (GNU sed & parallel):
parallel -k sed -n '3p\;3q' {} ::: file1 file2 file3
Parallel applies the sed command to each file and returns the results in order.
N.B. All files will only be read upto the 3rd line.
Also,you may be tempted (as I was) to use:
sed -ns '3p;3q' file1 file2 file3
but this will only return the first file.
Hi bro I am answering this question as we know FNR is used to check no of lines so we can run this command to get 3rd line of every file.
awk 'FNR==3' MiseqData/result2012/12*

How To Delete First X Lines Based On Minimum Lines In File

I have a file with 10,000 lines. Using the following command, I am deleting all lines after line 10,000.
sed -i '10000,$ d' file.txt
However, now I would like to delete the first X lines so that the file has no more than 10,000 lines.
I think it would be something like this:
sed -i '1,$x d' file.txt
Where $x would be the number of lines over 10,000. I'm a little stuck on how to write the if, then part of it. Or, I was thinking I could use the original command and just cat the file in reverse?
For example, if we wanted just 3 lines from the bottom (seems simpler after a few helpful answers):
Input:
Example Line 1
Example Line 2
Example Line 3
Example Line 4
Example Line 5
Expected Output:
Example Line 3
Example Line 4
Example Line 5
Of course, if you know a more efficient way to write the command, I would be open to that too. Your positive input is highly appreciated.
tail can do exactly what you want.
tail -n 10000 file.txt
For simplicity, I would reverse the file, keep the first 10000 lines, then re-reverse the file.
It makes saving the file in-place a touch more complicated
source=file.txt
temp=$(mktemp)
tac "$source" | sed '10000 q' | tac > "$temp" && mv "$temp" "$source"
Without reversing the file, you'd count the number of lines and do some arithmetic:
sed -i "1,$(( $(wc -l < file.txt) - 10000 )) d" file.txt
$ awk -v n=3 '{a[NR%n]=$0} END{for (i=NR+1;i<=(NR+n);i++) print a[i%n]}' file
Example Line 3
Example Line 4
Example Line 5
Add -i inplace if you have GNU awk and want to do "inplace" editing.
To keep the first 10000 lines :
head -n 10000 file.txt
To keep the last 10000 lines :
tail -n 10000 file.txt
Test with your file Example
tail -n 3 file.txt
Example Line 3
Example Line 4
Example Line 5
tac file.txt | sed "$x q" | tac | sponge file.txt
The sponge command is useful here in avoiding an additional temporary file.
tail -10000 <<<"$(cat file.txt)" > file.txt
Okay, not «just» tail, but this way it`s capable of inplace truncation.

Concatenation of huge number of selective files from a directory in Shell

I have more than 50000 files in a directory such as file1.txt, file2.txt, ....., file50000.txt. I would like to concatenate of some files whose file numbers are listed in the following text file (need.txt).
need.txt
1
4
35
45
71
.
.
.
I tried with the following. Though it is working, but I look for more simpler and short way.
n1=1
n2=$(wc -l < need.txt)
while [ $n1 -le $n2 ]
do
f1=$(awk 'NR=="$n1" {print $1}' need.txt)
cat file$f1.txt >> out.txt
(( n1++ ))
done
This might also work for you:
sed 's/.*/file&.txt/' < need.txt | xargs cat > out.txt
Something like this should work for you:
sed -e 's/.*/file&.txt/' need.txt | xargs cat > out.txt
It uses sed to translate each line into the appropriate file name and then hands the filenames to xargs to hand them to cat.
Using awk it could be done this way:
awk 'NR==FNR{ARGV[ARGC]="file"$1".txt"; ARGC++; next} {print}' need.txt > out.txt
Which adds each file to the ARGV array of files to process and then prints every line it sees.
It is possible do it without any sed or awk command. Directly using bash built-in functions and cat (of course).
for i in $(cat need.txt); do cat file${i}.txt >> out.txt; done
And as you want, it is quite simple.

Filter lines by number of fields

I am filtering very long text files in Linux (usually > 1GB) to get only those lines I am interested in. I use with this command:
cat ./my/file.txt | LC_ALL=C fgrep -f ./my/patterns.txt | $decoder > ./path/to/result.txt
$decoder is the path to a program I was given to decode these files. The problem now is that it only accept lines with 7 fields, this is, 7 strings separated by spaces (e.g. "11 22 33 44 55 66 77"). Whenever a string with more or less fields is passed into this program makes it crash, and I get a broken pipe error message.
To fix it, I wrote a super simple script in Bash:
while read line ; do
if [[ $( echo $line | awk '{ print NF }') == 7 ]]; then
echo $line;
fi;
done
But the problem is that now it take ages to finish. Before it took seconds and now it takes ~30 minutes.
Does anyone know a better/faster way to do this? Thank you in advance.
Well perhaps you can insert awk between instead. No need to rely on Bash:
LC_ALL=C fgrep -f ./my/patterns.txt ./my/file.txt | awk 'NF == 7' | "$decoder" > ./path/to/result.txt
Perhaps awk can be the starter. Performance may be better that way:
awk 'NF == 7' ./my/file.txt | LC_ALL=C fgrep -f ./my/patterns.txt | "$decoder" > ./path/to/result.txt
You can merge fgrep and awk as a single awk command however I'm not sure if that would affect anything that require LC_ALL=C and that it would give better performance.

Print a file, skipping the first X lines, in Bash [duplicate]

This question already has answers here:
How can I remove the first line of a text file using bash/sed script?
(19 answers)
Closed 3 years ago.
I have a very long file which I want to print, skipping the first 1,000,000 lines, for example.
I looked into the cat man page, but I did not see any option to do this. I am looking for a command to do this or a simple Bash program.
You'll need tail. Some examples:
$ tail great-big-file.log
< Last 10 lines of great-big-file.log >
If you really need to SKIP a particular number of "first" lines, use
$ tail -n +<N+1> <filename>
< filename, excluding first N lines. >
That is, if you want to skip N lines, you start printing line N+1. Example:
$ tail -n +11 /tmp/myfile
< /tmp/myfile, starting at line 11, or skipping the first 10 lines. >
If you want to just see the last so many lines, omit the "+":
$ tail -n <N> <filename>
< last N lines of file. >
Easiest way I found to remove the first ten lines of a file:
$ sed 1,10d file.txt
In the general case where X is the number of initial lines to delete, credit to commenters and editors for this:
$ sed 1,Xd file.txt
If you have GNU tail available on your system, you can do the following:
tail -n +1000001 huge-file.log
It's the + character that does what you want. To quote from the man page:
If the first character of K (the number of bytes or lines) is a
`+', print beginning with the Kth item from the start of each file.
Thus, as noted in the comment, putting +1000001 starts printing with the first item after the first 1,000,000 lines.
If you want to skip first two line:
tail -n +3 <filename>
If you want to skip first x line:
tail -n +$((x+1)) <filename>
A less verbose version with AWK:
awk 'NR > 1e6' myfile.txt
But I would recommend using integer numbers.
Use the sed delete command with a range address. For example:
sed 1,100d file.txt # Print file.txt omitting lines 1-100.
Alternatively, if you want to only print a known range, use the print command with the -n flag:
sed -n 201,300p file.txt # Print lines 201-300 from file.txt
This solution should work reliably on all Unix systems, regardless of the presence of GNU utilities.
Use:
sed -n '1d;p'
This command will delete the first line and print the rest.
If you want to see the first 10 lines you can use sed as below:
sed -n '1,10 p' myFile.txt
Or if you want to see lines from 20 to 30 you can use:
sed -n '20,30 p' myFile.txt
Just to propose a sed alternative. :) To skip first one million lines, try |sed '1,1000000d'.
Example:
$ perl -wle 'print for (1..1_000_005)'|sed '1,1000000d'
1000001
1000002
1000003
1000004
1000005
You can do this using the head and tail commands:
head -n <num> | tail -n <lines to print>
where num is 1e6 + the number of lines you want to print.
This shell script works fine for me:
#!/bin/bash
awk -v initial_line=$1 -v end_line=$2 '{
if (NR >= initial_line && NR <= end_line)
print $0
}' $3
Used with this sample file (file.txt):
one
two
three
four
five
six
The command (it will extract from second to fourth line in the file):
edu#debian5:~$./script.sh 2 4 file.txt
Output of this command:
two
three
four
Of course, you can improve it, for example by testing that all argument values are the expected :-)
cat < File > | awk '{if(NR > 6) print $0}'
I needed to do the same and found this thread.
I tried "tail -n +, but it just printed everything.
The more +lines worked nicely on the prompt, but it turned out it behaved totally different when run in headless mode (cronjob).
I finally wrote this myself:
skip=5
FILE="/tmp/filetoprint"
tail -n$((`cat "${FILE}" | wc -l` - skip)) "${FILE}"

Resources