Print lines between line numbers from a line list and save every instance in separate file using GNU Parallel - linux

I have a file, say "Line_File" with a list of line start & end numbers and file ID :
F_a 1 108
F_b 109 1210
F_c 131 1190
I have another file, "Data_File" from where I need to fetch all the lines between the line numbers fetched from the Line_File.
The command in sed:
'sed -n '1,108p' Data_File > F_a.txt
does the job but I need to do this for all the values in columns 2 & 3 of Line_File and save it with the file name mentioned in the column 1 of the Line_File.
If $1, $2 and $3 are the three cols of Line_File then I am looking for a command something like
'sed -n '$2,$3p' Data_File > $1.txt
I can run the same using Bash Loop but that will be very slow for a very large file, say 40GB.
I specifically want to do this because I am trying to use GNU Parallel to make it faster and line number based slicing will make the output non-overlapping. I am trying to execute command like this
cat Data_File | parallel -j24 --pipe --block 1000M --cat LC_ALL=C sed -n '$2,$3p' > $1.txt
But I am no able to actually use the column assignment $1,$2 and $3 properly.
I tried the following command:
awk '{system("sed -n \""$2","$3"p\" Data_File > $1"NR)}' Line_File
But it doesn't work. Any idea where I am going wrong?
P.S If my question is not clear then please point out what else I should be sharing.

You may use xargs with -P (parallel) option:
xargs -P 8 -L 1 bash -c 'sed -n "$2,$3p" Data_File > $1.txt' _ < Line_File
Explanation:
This xargs command takes Line_File as input by using <
-P 8 option allows it to run up to 8 processes in parallel
-L 1 makes xargs process one line at a time
bash -c ... forks bash for each line in input file
_ before < passes _ as $0 and passes remaining 3 column in each input line as $1, $2,$3`
sed -n runs sed command for each line by forming a command line
Or you may use gnu parallel like this:
parallel --colsep '[[:blank:]]' "sed -n '{2},{3}p' Data_File > {1}.txt" :::: Line_File
Check parallel examples from official doc

awk to the rescue!
this scans the data file only once
$ awk 'NR==FNR {k=$1; s[k]=$2; e[k]=$3; next}
{for(k in s) if(FNR>=s[k] && FNR<=e[k]) print > (k".txt")}' lines data

This might work for you (GNU parallel and sed):
parallel --dry-run -a lineFile -C' ' "sed -n '{2},{3}p' dataFile > {1}'
This uses the column separator -C ' ' and sets it to a space, this then sets the first 3 fields of the lineFile to {1},{2} and {3}. The --dry-run option allows you to check the commands parallel generates before running for real. Once the commands look correct remove the --dry-run option.

You are likely not to be CPU constrained. It is more likely your disks will be the limiting factor. To avoid reading DataFile over and over again, you should run as many jobs as possible in parallel. That way caching will help you:
cat Line_file |
parallel -j0 --colsep ' ' sed -n {2},{3}p Data_File \> {1}.txt

Related

how to show the third line of multiple files

I have a simple question. I am trying to check the 3rd line of multiple files in a folder, so I used this:
head -n 3 MiseqData/result2012/12* | tail -n 1
but this doesn't work obviously, because it only shows the third line of the last file. But I actually want to have last line of every file in the result2012 folder.
Does anyone know how to do that?
Also sorry just another questions, is it also possible to show which file the particular third line belongs to?
like before the third line is shown, is it also possible to show the filename of each of the third line extracted from?
because if I used head or tail command, the filename is also shown.
thank you
With Awk, the variable FNR is the number of the "record" (line, by default) in the current file, so you can simply compare it to 3 to print the third line of each input file:
awk 'FNR == 3' MiseqData/result2012/12*
A more optimized version for long files would skip to the next file on match, since you know there's only that one line where the condition is true:
awk 'FNR == 3 { print; nextfile }' MiseqData/result2012/12*
However, not all Awks support nextfile (but it is also not exclusive to GNU Awk).
A more portable variant using your head and tail solution would be a loop in the shell:
for f in MiseqData/result2012/12*; do head -n 3 "$f" | tail -n 1; done
Or with sed (without GNU extensions, i.e., the -s argument):
for f in MiseqData/result2012/12*; do sed '3q;d' "$f"; done
edit: As for the additional question of how to print the name of each file, you need to explicitly print it for each file yourself, e.g.,
awk 'FNR == 3 { print FILENAME ": " $0; nextfile }' MiseqData/result2012/12*
for f in MiseqData/result2012/12*; do
echo -n `basename "$f"`': '
head -n 3 "$f" | tail -n 1
done
for f in MiseqData/result2012/12*; do
echo -n "$f: "
sed '3q;d' "$f"
done
With GNU sed:
sed -s -n '3p' MiseqData/result2012/12*
or shorter
sed -s '3!d' MiseqData/result2012/12*
From man sed:
-s: consider files as separate rather than as a single continuous long stream.
You can do this:
awk 'FNR==3' MiseqData/result2012/12*
If you like the file name as well:
awk 'FNR==3 {print FILENAME,$0}' MiseqData/result2012/12*
This might work for you (GNU sed & parallel):
parallel -k sed -n '3p\;3q' {} ::: file1 file2 file3
Parallel applies the sed command to each file and returns the results in order.
N.B. All files will only be read upto the 3rd line.
Also,you may be tempted (as I was) to use:
sed -ns '3p;3q' file1 file2 file3
but this will only return the first file.
Hi bro I am answering this question as we know FNR is used to check no of lines so we can run this command to get 3rd line of every file.
awk 'FNR==3' MiseqData/result2012/12*

How to skip first line of a file and read the remaining lines as input of a C program?

How to write the shell command to skip the first line in file a.csv and redirect the remaining lines as input to myProgram, which is my C program?
I wrote
./myProgram < a.csv | tail -n + 2
But this does not work, it seems like it will skip the first line of the output from myProgram.
Erm...
tail -n +2 a.csv | ./myProgram
If you want to skip the first line, the traditional solution is sed:
sed -e 1d a.csv | ./myProgram
If your shell is Bash, it supports process substitution: a mechanism that lets you treat the output of a command just like a file. So instead of what you wrote, you can use
./myProgram < <(tail -n +2 a.csv)
What your command did instead was to use the complete file a.csv as the input to myProgram, then pipe the output to tail -n + 2 (did you really use a space between + and 2?).

Retrieve last 100 lines logs

I need to retrieve last 100 lines of logs from the log file.
I tried the sed command
sed -n -e '100,$p' logfilename
Please let me know how can I change this command to specifically retrieve the last 100 lines.
You can use tail command as follows:
tail -100 <log file> > newLogfile
Now last 100 lines will be present in newLogfile
EDIT:
More recent versions of tail as mentioned by twalberg use command:
tail -n 100 <log file> > newLogfile
"tail" is command to display the last part of a file, using proper available switches helps us to get more specific output. the most used switch for me is -n and -f
SYNOPSIS
tail [-F | -f | -r] [-q] [-b number | -c number | -n number] [file ...]
Here
-n number :
The location is number lines.
-f : The -f option causes tail to not stop when end of file is
reached, but rather to wait for additional data to be appended to the
input. The -f option is ignored if the
standard input is a pipe, but not if it is a FIFO.
Retrieve last 100 lines logs
To get last static 100 lines
tail -n 100 <file path>
To get real time last 100 lines
tail -f -n 100 <file path>
You can simply use the following command:-
tail -NUMBER_OF_LINES FILE_NAME
e.g tail -100 test.log
will fetch the last 100 lines from test.log
In case, if you want the output of the above in a separate file then you can pipes as follows:-
tail -NUMBER_OF_LINES FILE_NAME > OUTPUT_FILE_NAME
e.g tail -100 test.log > output.log
will fetch the last 100 lines from test.log and store them into a new file output.log)
Look, the sed script that prints the 100 last lines you can find in the documentation for sed (https://www.gnu.org/software/sed/manual/sed.html#tail):
$ cat sed.cmd
1! {; H; g; }
1,100 !s/[^\n]*\n//
$p
$ sed -nf sed.cmd logfilename
For me it is way more difficult than your script so
tail -n 100 logfilename
is much much simpler. And it is quite efficient, it will not read all file if it is not necessary. See my answer with strace report for tail ./huge-file: https://unix.stackexchange.com/questions/102905/does-tail-read-the-whole-file/102910#102910
I know this is very old, but, for whoever it may helps.
less +F my_log_file.log
that's just basic, with less you can do lot more powerful things. once you start seeing logs you can do search, go to line number, search for pattern, much more plus it is faster for large files.
its like vim for logs[totally my opinion]
original less's documentation : https://linux.die.net/man/1/less
less cheatsheet : https://gist.github.com/glnds/8862214
len=`cat filename | wc -l`
len=$(( $len + 1 ))
l=$(( $len - 99 ))
sed -n "${l},${len}p" filename
first line takes the length (Total lines) of file
then +1 in the total lines
after that we have to fatch 100 records so, -99 from total length
then just put the variables in the sed command to fetch the last 100 lines from file
I hope this will help you.

Find line number in a text file - without opening the file

In a very large file I need to find the position (line number) of a string, then extract the 2 lines above and below that string.
To do this right now - I launch vi, find the string, note it's line number, exit vi, then use sed to extract the lines surrounding that string.
Is there a way to streamline this process... ideally without having to run vi at all.
Maybe using grep like this:
grep -n -2 your_searched_for_string your_large_text_file
Will give you almost what you expect
-n : tells grep to print the line number
-2 : print 2 additional lines (and the wanted string, of course)
You can do
grep -C 2 yourSearch yourFile
To send it in a file, do
grep -C 2 yourSearch yourFile > result.txt
Use grep -n string file to find the line number without opening the file.
you can use cat -n to display the line numbers and then use awk to get the line number after a grep in order to extract line number:
cat -n FILE | grep WORD | awk '{print $1;}'
although grep already does what you mention if you give -C 2 (above/below 2 lines):
grep -C 2 WORD FILE
You can do it with grep -A and -B options, like this:
grep -B 2 -A 2 "searchstring" | sed 3d
grep will find the line and show two lines of context before and after, later remove the third one with sed.
If you want to automate this, simple you can do a Shell Script. You may try the following:
#!/bin/bash
VAL="your_search_keyword"
NUM1=`grep -n "$VAL" file.txt | cut -f1 -d ':'`
echo $NUM1 #show the line number of the matched keyword
MYNUMUP=$["NUM1"-1] #get above keyword
MYNUMDOWN=$["NUM1"+1] #get below keyword
sed -n "$MYNUMUP"p file.txt #display above keyword
sed -n "$MYNUMDOWN"p file.txt #display below keyword
The plus point of the script is you can change the keyword in VAL variable as you like and execute to get the needed output.

Print a file, skipping the first X lines, in Bash [duplicate]

This question already has answers here:
How can I remove the first line of a text file using bash/sed script?
(19 answers)
Closed 3 years ago.
I have a very long file which I want to print, skipping the first 1,000,000 lines, for example.
I looked into the cat man page, but I did not see any option to do this. I am looking for a command to do this or a simple Bash program.
You'll need tail. Some examples:
$ tail great-big-file.log
< Last 10 lines of great-big-file.log >
If you really need to SKIP a particular number of "first" lines, use
$ tail -n +<N+1> <filename>
< filename, excluding first N lines. >
That is, if you want to skip N lines, you start printing line N+1. Example:
$ tail -n +11 /tmp/myfile
< /tmp/myfile, starting at line 11, or skipping the first 10 lines. >
If you want to just see the last so many lines, omit the "+":
$ tail -n <N> <filename>
< last N lines of file. >
Easiest way I found to remove the first ten lines of a file:
$ sed 1,10d file.txt
In the general case where X is the number of initial lines to delete, credit to commenters and editors for this:
$ sed 1,Xd file.txt
If you have GNU tail available on your system, you can do the following:
tail -n +1000001 huge-file.log
It's the + character that does what you want. To quote from the man page:
If the first character of K (the number of bytes or lines) is a
`+', print beginning with the Kth item from the start of each file.
Thus, as noted in the comment, putting +1000001 starts printing with the first item after the first 1,000,000 lines.
If you want to skip first two line:
tail -n +3 <filename>
If you want to skip first x line:
tail -n +$((x+1)) <filename>
A less verbose version with AWK:
awk 'NR > 1e6' myfile.txt
But I would recommend using integer numbers.
Use the sed delete command with a range address. For example:
sed 1,100d file.txt # Print file.txt omitting lines 1-100.
Alternatively, if you want to only print a known range, use the print command with the -n flag:
sed -n 201,300p file.txt # Print lines 201-300 from file.txt
This solution should work reliably on all Unix systems, regardless of the presence of GNU utilities.
Use:
sed -n '1d;p'
This command will delete the first line and print the rest.
If you want to see the first 10 lines you can use sed as below:
sed -n '1,10 p' myFile.txt
Or if you want to see lines from 20 to 30 you can use:
sed -n '20,30 p' myFile.txt
Just to propose a sed alternative. :) To skip first one million lines, try |sed '1,1000000d'.
Example:
$ perl -wle 'print for (1..1_000_005)'|sed '1,1000000d'
1000001
1000002
1000003
1000004
1000005
You can do this using the head and tail commands:
head -n <num> | tail -n <lines to print>
where num is 1e6 + the number of lines you want to print.
This shell script works fine for me:
#!/bin/bash
awk -v initial_line=$1 -v end_line=$2 '{
if (NR >= initial_line && NR <= end_line)
print $0
}' $3
Used with this sample file (file.txt):
one
two
three
four
five
six
The command (it will extract from second to fourth line in the file):
edu#debian5:~$./script.sh 2 4 file.txt
Output of this command:
two
three
four
Of course, you can improve it, for example by testing that all argument values are the expected :-)
cat < File > | awk '{if(NR > 6) print $0}'
I needed to do the same and found this thread.
I tried "tail -n +, but it just printed everything.
The more +lines worked nicely on the prompt, but it turned out it behaved totally different when run in headless mode (cronjob).
I finally wrote this myself:
skip=5
FILE="/tmp/filetoprint"
tail -n$((`cat "${FILE}" | wc -l` - skip)) "${FILE}"

Resources