extracting location specific record using sed based on given criteria using bash script - linux

I want to extract data from a ASCII file that looks like the one provided here the block starting 1NAME. The block starting 1NAME can repeat any number of times - I have files where there is only one block and in some files as many as 744:
AVERAGE MODELNAME -- RUNNAME
0 1 11121 0. 11122 24.
-9700000 4000000 0 -241200000000 -1620000
1.00000 1000.00000 10 10 1 2 0 15. 11. 0.
1 1 500 400
NAME
11121 0.00 11121 1.00
1NAME
0.0000000E+00 0.0000000E+00 0.0000000E+00 0.0000000E+00 0.0000000E+00
0.0000000E+00 0.0000000E+00 0.0000000E+00 0.0000000E+00 0.0000000E+00
0.0000000E+00 0.0000000E+00 0.0000000E+00 0.0000000E+00 0.0000000E+00
NAME
11121 1.00 11121 2.00
1NAME
1.0000000E+00 45.0000000E+00 01.0000000E+00 115.0000000E+00 5.0000000E+00
2.0000000E+00 66.0000000E+00 09.0000000E+00 180.0000000E+00 4.0000000E+00
3.0000000E+00 80.0000000E+00 70.0000000E+00 130.0000000E+00 5.0000000E+00
I would like to extract values from (1) given recurring location in the file, starting after "1NAME", (2) pipe outputs to a text file and create header that identifies what location it was pulled from and (3) create a custom code that can take in input for multiple locations (say record 1, 5, 8) after 1NAME and output them into separate outputs (example: one output for all records at location 1, one output file for location 5, ...).
As an example say I want to grab record 1, 5 and 8 after 1NAME in the given input file. Outputs for each record should be output as follows in separate record specific text file labeled as GRID#.txt:
GRID 1
0.0000000E+00
00.0000000E+00
GRID 5
0.0000000E+00
5.0000000E+00
GRID 8
0.0000000E+00
09.0000000E+00
I was able to extract data on at a time using sed. However I need to extract data from multiple locations from the input file. So I tried to put all the information in a script. Following are the steps I took.
Input file has multiple whitespaces and inconsistent blank lines. So I used sed to remove multiple spaces and replace with one space. And then using the piped output from this step, removed all blank lines. This resulted in all the data in the file arranged as one value per row.
sed 's/\s\+/\n/g' <input.txt>| sed '/^$/d
To extract data, I then used sed command (format as follows) from piped output from step 1.
sed -n -e 11p -e 50p
I tried to put all these commands as a bash (or csh, either option) script with custom row number. I tried using foreach (naively) and then learned that it cannot be used within bash. I will be using fellow user recommended scripts instead.
#!/bin/bash
set FILE=$cwd/sample_or_2day
foreach GRID (23729)
foreach GRIDTIME(28 41)
sed 's/\s\+/\n/g' $FILE | sed '/^$/d' | sed '1,36d' > temp_out
sed -n -e "$GRIDTIME" temp_out | tee $cwd/out_$GRID
Thanks for your patience. I am a nervous programmer and trying to master basics. I spent time looking at sed instruction pages, and user support forums. Any recommendations are welcome - especially with explicit instructions. Thanks!

You provided your attempt at a csh script, but tagged your question as bash. I am answering with a bash script.
The core of your question is how to extract information from a formatted printout. In general, one should avoid such situations: one should use programming environments which are aware of the data structures being manipulated, in order to avoid re-parsing at each step. In the real world, however, such situations arise very often, and one has to cope with them.
Your approach to transform all white spaces into newlines works in your case. Instead of multiple sed commands, the fastest way to achieve it is by
tr -s ' ' '\n'
(the -s option squeezes multiple occurrences of the target character into one, eliminating blank lines)
Then, you are interested in the 7th and 14th lines after each occurrence of a line containing 1NAME. This is done in sed by
sed -n -e '/^1NAME$/{n;n;n;n;n;n;n;p;n;n;n;n;n;n;n;p}'
which means: when you see 1NAME, execute the nextline command seven times, then execute the print command. Do this twice.
You could use a shell variable:
next7='n;n;n;n;n;n;n;p'
And
cat ./sample_or_2day | tr -s ' ' '\n' | sed -n -e '/^1NAME$/'"{$next7;$next7}"
would produce
0.0000000E+00
0.0000000E+00
66.0000000E+00
130.0000000E+00
correct, the first block was also taken in. To skip it, let’s add the sed instruction you had already figured out, -e1,36d.
$ cat ./sample_or_2day | tr -s ' ' '\n' | sed -n -e1,36d -e'/^1NAME$/'"{$next7;$next7}"
66.0000000E+00
130.0000000E+00
You might also want bash to construct the sed command line for you: for instance, the command
sed -n -e{7..29..7}p
would be expanded by the shell as
sed -n -e7p -e14p -e21p -e28p
which, as you know, means that sed would print those input lines only.
You might also want to learn about for loops in bash, which come in two different flavors, for example:
for var in word1 word2 word3 ...; do ... ; done
for (( i=0; i<10; i++ )); do ...; done
Now, it is not clear to me how you want to manage your output files. I provide a bash version of your script (providing a list of values for GRID, instead of just one), which shows another possible brace expansion in bash.
#!/bin/bash
FILE=./sample_or_2day
for GRID in 23729 23755 23768; do
cat "$FILE" | tr -s ' ' '\n' | sed -n -e{28,41}p >> "./out_$GRID"
done

This is what has worked for me so far, but not as a bash script:
sed 's/\s\+/\n/g' ./sample_or_2day | sed '/^$/d' | sed '1,36d'| sed -n -e{23724..194842..97421}p > './out'
In the above script:
sed 's/\s+/\n/g' -> replaces multiple whitespaces with one space
sed '/^$/d' -> deletes blank line from piped output
sed '1,36d' -> removes line 1-36 from piped output
sed -n -e{23724..194842..97421}p -> prints record starting line
23724 and at intervals of 97421, until line 194842
'./out' -> outputs to file labeled out

If you're open to adding a python dependency, you may find this'll help:
http://stromberg.dnsalias.org/~strombrg/context-split.html
Or try awk, replacing /^BEGIN/ and /^END/ with your own regex's:
#!/bin/sh
awk '
BEGIN { show=0 }
/^END/ { show=0 }
{ if (show==1) print $0 }
/^BEGIN/ { show=1 }' $#

Related

Make grep to exact match strings with and without dash "-"

The problem looks simple and common, so I've looked through many answers but seems that none of them provides appropriate general solution.
I need to grep large tab-separated 6 columns file (*.bed file in fact) to split it by the content of the first column using the list of string variables (items). I just need a row starting with a given string.
I was succesfully using
grep -w "$name" inputfile
$name is read from the list of strings
for that purpose until the case where strings have the following format (example): YAL038W but also YAL038W-A, YAL038W-B,...
So, grep with -w option considers YAL038W identical to YAL038W-A, YAL038W-B since "-" is word separator. it would work with "_" but not with "-".
I've found solutions based on awk which are working fine, for example:
awk -F $'\t' -vsearch=$name '$1==search' inputfile
but awk is terribly slow, over 10 times, see time measurements below
For 2.5 Gb input file and > 5000 items to look for, script is already running for >24 hours!
Example of inputfile:
YAL038W-A 0 48 HWI-1KL176:101:CC27NACXX:3:2208:17646:92047 0 +
YAL038W-A 0 48 HWI-1KL176:101:CC27NACXX:3:2211:17326:31268 0 +
YAL038W 1 50 HWI-1KL176:101:CC27NACXX:8:1205:16311:19319 3 +
YAL038W 1 27 HWI-1KL176:101:CC27NACXX:8:2103:4951:94527 42 +
time grep -w "YAL038W" inputfile > testfile.txt
real 0m3.569s
time awk -F $'\t' -vsearch="YAL038W" '$1==search' inputfile > testfile.txt
real 0m29.521s
I am looking for FAST solution using grep or something else, and I need to pass the variable to this command in the cycle.
Alternative is to modify the imput file by replacing "-" by "_", but it is the last possibility I believe...
Thanks in advance
I've found solutions based on awk which are working fine, for example:
awk -F $'\t' -vsearch=$name '$1==search' inputfile
but awk is terribly slow…
I am looking for FAST solution using grep …
If the above awk command worked for you, then this will do:
grep ^$name$'\t' inputfile
Just search at the beginning of each line for the name followed by a TAB.

Delete lines from a file matching first 2 fields from a second file in shell script

Suppose I have setA.txt:
a|b|0.1
c|d|0.2
b|a|0.3
and I also have setB.txt:
c|d|200
a|b|100
Now I want to delete from setA.txt lines that have the same first 2 fields with setB.txt, so the output should be:
b|a|0.3
I tried:
comm -23 <(sort setA.txt) <(sort setB.txt)
But the equality is defined for whole line, so it won't work. How can I do this?
$ awk -F\| 'FNR==NR{seen[$1,$2]=1;next;} !seen[$1,$2]' setB.txt setA.txt
b|a|0.3
This reads through setB.txt just once, extracts the needed information from it, and then reads through setA.txt while deciding which lines to print.
How it works
-F\|
This sets the field separator to a vertical bar, |.
FNR==NR{seen[$1,$2]=1;next;}
FNR is the number of lines read so far from the current file and NR is the total number of lines read. Thus, when FNR==NR, we are reading the first file, setB.txt. If so, set the value of associative array seen to true, 1, for the key consisting of fields one and two. Lastly, skip the rest of the commands and start over on the next line.
!seen[$1,$2]
If we get to this command, we are working on the second file, setA.txt. Since ! means negation, the condition is true if seen[$1,$2] is false which means that this combination of fields one and two was not in setB.txt. If so, then the default action is performed which is to print the line.
This should work:
sed -n 's#\(^[^|]*|[^|]*\)|.*#/^\1/d#p' setB.txt |sed -f- setA.txt
How this works:
sed -n 's#\(^[^|]*|[^|]*\)|.*#/^\1/d#p'
generates an output:
/^c|d/d
/^a|b/d
which is then used as a sed script for the next sed after the pipe and outputs:
b|a|0.3
(IFS=$'|'; cat setA.txt | while read x y z; do grep -q -P "\Q$x|$y|\E" setB.txt || echo "$x|$y|$z"; done; )
explanation: grep -q means only test if grep can find the regexp, but do not output, -P means use Perl syntax, so that the | is matched as is because the \Q..\E struct.
IFS=$'|' will make bash to use | instead of the spaces (SPC, TAB, etc.) as token separator.

How to delete 5 lines before and 6 lines after pattern match using Sed?

I want to search for a pattern "xxxx" in a file and delete 5 lines before this pattern and 6 lines after this match. How can i do this using Sed?
This might work for you (GNU sed):
sed ':a;N;s/\n/&/5;Ta;/xxxx/!{P;D};:b;N;s/\n/&/11;Tb;d' file
Keep a rolling window of 5 lines and on encountering the specified string add 6 more (11 in total) and delete.
N.B. This is a barebones solution and will most probably need tailoring to your specific needs. Questions such as: what if there are multiple string throughout the file? What if the string is within the first five lines or multiple strings are within five lines of each other etc etc etc.
Here's one way you could do it using awk. I assume that you also want to delete the line itself and that the file is small enough to fit into memory:
awk '{a[NR]=$0}/xxxx/{f=NR}END{for(i=1;i<=NR;++i)if(i<f-5||i>f+6)print a[i]}' file
Store every line into the array a. When the pattern /xxxx/ is matched, save the line number. After the whole file has been processed, loop through the array, only printing the lines you want to keep.
Alternatively, you can use grep to obtain the line number first:
grep -n 'xxxx' file | awk -F: 'NR==FNR{f=$1}NR<f-5||NR>f+6' - file
In both cases, the lines deleted will be surrounding the last line where the pattern is matched.
A third option would be to use grep to obtain the line number then use sed to delete the lines:
line=$(grep -nm1 'xxxx' file | cut -d: -f1)
sed "$((line-5)),$((line+6))d" file
In this case I've also added the -m switch so grep exits after finding the first match.
if you know, the line number (what is not difficult to obtain), you can use something like that:
filename="test"
start=`expr $curr_line - 5`
end=`expr $curr_line + 6`
sed "${start},${end}d" $filename (optionally sed -i)
of course, you have to remember about additional conditions like start shouldn't be less than 1 and end greater than number of lines in file.
Another - maybe more easy to follow - solution would be to use grep to find the keyword and the corresponding line:
grep -n 'KEYWORD' <file>
then use sed to get the line number only like this:
grep -n 'KEYWORD' <file> | sed 's/:.*//'
Now that you have the line number simply use sed like this:
sed -i "$(LINE_START),$(LINE_END) d" <file>
to remove lines before and/or after! With only the -i you will override the <file> (no backup).
A script example could be:
#!/bin/bash
KEYWORD=$1
LINES_BEFORE=$2
LINES_AFTER=$3
FILE=$4
LINE_NO=$(grep -n $KEYWORD $FILE | sed 's/:.*//' )
echo "Keyword found in line: $LINE_NO"
LINE_START=$(($LINE_NO-$LINES_BEFORE))
LINE_END=$(($LINE_NO+$LINES_AFTER))
echo "Deleting lines $LINE_START to $LINE_END!"
sed -i "$LINE_START,$LINE_END d" $FILE
Please note that this will work only if the keyword is found once! Adapt the script to your needs!

space/tab/newline insensitive comparison

Suppose I have these two files:
File 1:
1 2 3 4 5 6 7
File 2:
1
2
3
4
5
6
7
Is it possible to use diff to compare these two files so that the result is equal ?
(Or if not, what are other tools should I use? )
Thanks
You could collapse whitespace so file2 looks like file1, with every number on the same line:
$ cat file1
1 2 3 4 5 6 7
$ cat file2
1
2
4
3
5
6
7
$ diff <(echo $(< file1)) <(echo $(< file2))
1c1
< 1 2 3 4 5 6 7
---
> 1 2 4 3 5 6 7
Explanation:
< file # Equivalent to "cat file", but slightly faster since the shell doesn't
# have to fork a new process.
$(< file) # Capture the output of the "< file" command. Can also be written
# with backticks, as in `< file`.
echo $(< file) # Echo each word from the file. This will have the side effect of
# collapsing all of the whitespace.
<(echo $(< file)) # An advanced way of piping the output of one command to another.
# The shell opens an unused file descriptor (say fd 42) and pipes
# the echo command to it. Then it passes the filename /dev/fd/42 to
# diff. The result is that you can pipe two different echo commands
# to diff.
Alternately, you may want to make file1 look like file2, with each number on separate lines. That will produce more useful diff output.
$ diff -u <(printf '%s\n' $(< file1)) <(printf '%s\n' $(< file2))
--- /dev/fd/63 2012-09-10 23:55:30.000000000 -0400
+++ file2 2012-09-10 23:47:24.000000000 -0400
## -1,7 +1,7 ##
1
2
-3
4
+3
5
6
7
This is similar to the first command with echo changed to printf '%s\n' to put a newline after each word.
Note: Both of these commands will fail if the files being diffed are overly long. This is because of the limit on command-line length. If that happens then you will need to workaround this limitation, say by storing the output of echo/printf to temporary files.
Some diffs have a -b (ignore blanks) and -w (ingnore whitespace), but as unix utilities are all line-oriented, I don't thing whitespace will include \n chars.
Dbl-check that your version of diff doesn't have some fancy gnu-options with diff --help | less or man diff.
Is your formatting correct above, file 1, data all on one line? You could force file2 to match that format with
awk '{printf"%s ", $0}' file2
Or as mentioned in comments, convert file 1
awk '{for (i=1;i<=NF;i++) printf("%s\n", $i)}' file1
But I'm guessing that your data isn't really that simple. Also there are likely line length limitations that will appear when you can least afford the time to deal with them.
Probably not what you want to hear, and diffing of complicated stuff like source-code is not an exact science. So, if you still need help, create a slightly more complicated testcase and add it to your question.
Finally, you'll need to show us what you'd expect the output of such a diff project to look like. Right now I can't see any meaningful way to display such differences for a non-trival case.
IHTH
If it turns out the data is indeed simple enough to not run into limitations, and the only difference between the files is that the first one separates by space and the second by newline, you can also do process substitution (as was suggested above) but with sed to replace the spaces in the first file with newlines:
diff <(sed 's/ /\n/g' file1) file2

How to remove words from a file in UNIX?

first file of information page
name/joe/salary1 50 10 2
name/don/miles2
20 4 3
name/sam/lb3 0 200 50
can some one please tell me how can I remove all the words in the above file, so my output will looks as follows
50 10 2
20 4 3
0 200 50
Use awk instead. The following code says to go through each field, check if its an integer. If it is, print them out. No need complicated regex.
$ awk '{for(i=1;i<=NF;i++) if($i+0==$i) {printf $i" "} print ""}' file
50 10 2
20 4 3
0 200 50
sed -e "s/[a-zA-Z/]/ /g" file
will do it, though I like codaddict's way more if you want to preserver number and whitespace. This way strips out all letters and the '/' symbol, replacing them all with space.
If you want to modify the file in place, pass the -i switch. This command will output what the file would look like.
Looks like you want to preserve only the digits and the space. If yes, you can do:
sed 's/[^0-9 ]//g' inputFile
EDIT: Change in requirements, if a digit is found with a letter, it should be treated as part of the word.
This Perl script does it:
perl -ne 's/(?:\d*[a-z\/]+\d*)*//g;print' input
If your file has this structure, I suggest first to filter out the first line, then remove all characters from beginning of line up to the first space:
sed -ni '2,$s/^[^ ]*//p' file
Remove everything on each line until first space character (also removes leading spaces):
sed 's/\S*\s*//' file

Resources