Shell capture substring which is a line below searched substring - string

I am searching through a text file for a particular string and am looking for a number which is a line below this string. So an example below to make it clearer.
This is the content of the text file
2017-08-14 14:04:53,836 INFO - XML File FILE1 is created in /path/to/file
2017-08-14 14:10:04,696 INFO - #Instances Extracted: 32960
2017-08-14 14:17:52,248 INFO - XML File FILE2 is created in /path/to/file
2017-08-14 14:41:33,720 INFO - #Instances Extracted: 119534
In the text file I want to search for the string FILE1 and capture the number on the line below it 32960.
What is the best method for this? I was considering searching for FILE1 and then searching for the first instance after this of "Instances Extracted" and capture the number after that, is this the best solution?
Many thanks to any help you can provide

You can use awk without getline():
awk 'p==1 {p=0; print $NF } /FILE1/ {p=1}' inputfile

A quick and dirty solution:
cat file.txt | grep -A1 FILE1 | sed "s/.*Instances Extracted: \([0-9]*\).*/\1/;tx;d;:x"
The grep -A1 pulls the matching line and the next one into stdout, and then the sed will pull the number out (and delete any lines that don't match due to the ;tx;d;:x at the end)

Related

Filtering large data file by date using command line

I have a csv file that contains a bunch of data with one of the columns being date. I am trying to extract all lines that have dates in a specific year and save it into a new file.
The format of file is like this with the date and time in the second column:
000000000,10/04/2021 02:10:15 AM,.....
So far I tried:
grep -E ^2020 data.csv >> temp.csv
But it just produced an empty temp list. Any ideas on how I can do this?
One potential solution is with awk:
awk -F"," '$2 ~ /\/2020 /' data.csv > temp.csv
Another potential option is with grep:
grep "\/2020 " data.csv > temp.csv
However, the grep solution may detect "/2020 " elsewhere in the file, rather than in column 2.
Although awk solution is best here, e.g.
awk -F, 'index($2, "/2021 ")' file
grep can also be used here:
grep '^[^,]*,[^,]*/2021 ' file
See the online demo
Notes:
awk -F, 'index($2, "/2021 ")' splits the lines (records) into fields with a comma (see -F,), and if there is a /2021 + space in the second field ($2) the line is printed
the ^[^,]*,[^,]*/2021 pattern in the grep command matches
^ - start of string
[^,]* - zero or more non-comma chars
,[^,]* - a , and zero or more non-comma chars
/2021 - a literal substring.

Read in file line by line and search another file for a line with a partial match

I have a file with partial matches to lines in another file. In order to do this I was looking to generate a while loop with read and substituting a variable for each line of partial matches into a grep command to search a database files with a partial match but for some reason, I am not getting an output (an empty outputfile.txt).
Here is my current script
while read -r line; do
grep $line /path/to/databasefile >> /path/to/folder/outputfile.txt
done < "/partial_matches.txt"
the database has multiple lines with a sequence name then DNA sequence after:
>transcript_ab
AGTCAGTCATGTC
>transcript_ac
AGTCAGTCATGTC
>transctipt_ad
AGTCAGTCATGTC
and the partial matching search file has lines of text:
ab
ac
and I'm looking for a return of:
>transcript_ab
>transcript_ac
any help would be appreciated. Thanks.
If you are using GNU grep, then its -f option is what you are looking for:
grep -f /partial_matches.txt /path/to/databasefile
(if you don't have any pattern in partial_matches.txt but only strings, then use grep -F instead of grep)
you can use a for loop instead:
for i in $(cat partial_matches.txt); do
grep $i /path/to/databasefile >> /path/to/folder/outputfile.txt
done
Also, check if you have a typo:
"/partial_matches.txt" -> "./partial_matches.txt"

How to Grep the complete sequences containing a specific motif in a fasta file?

How to Grep the complete sequences containing a specific motif in a fasta file or txt file with one linux command and write them into another file? Also, I want to include the lines beginning with a ">" before these target sequences.
Example:I have a fasta file of 10000 sequences.
$cat file.fa
>name1
AEDIA
>name2
ALKME
>name3
AAIII
I want to grep sequences containing KME, so I should get:
>name2
ALKME
Attached is the current way I am using based on the answers I got. Maybe others may find it helpful. Thanks to Pierre Lindenbaum, Philipp Bayer, cpad0112 and batMan.
Preprocessing the fasta file first and get each sequence into one line (which is very important)
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' < file.fa > file1.fa
Get rid of the first empty line
tail -n +2 file1.fa > file2.fa
Extract the target sequences containing the substring including their names and save it into another file
LC_ALL=C grep -B 1 KME file2.fa > result.txt
Note: Take KME as the target substring as an example
if you have multiline fasta files. First linearize with awk, and use another awk to filter the sequence containing the motif. using grep would be dangerous a sequence name contains a short motif.
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa |\
awk -F '\t' '{if(index($2,"KME")!=0) printf("%s\n%s\n",$1,$2);}'
grep -B1 KME file > output_file
-B1 : prints 1 line before the match as well

Generate record of files which have been removed by grep as a secondary function of primary command

I asked a question here to remove unwanted lines which contained strings which matched a particular pattern:
Remove lines containg string followed by x number of numbers
anubhava provided a good line of code which met my needs perfectly. This code removes any line which contains the string vol followed by a space and three or more consecutive numbers:
grep -Ev '\bvol([[:blank:]]+[[:digit:]]+){2}' file > newfile
The command will be used on a fairly large csv file and be initiated by crontab. For this reason, I would like to keep a record of the lines this command is removing, just so I can go back to check the correct data that is being removed- I guess it will be some sort of log containing the name sof the lines that did not make the final cut. How can I add this functionality?
Drop grep and use awk instead:
awk '/\<vol([[:blank:]]+[[:digit:]]+){2}/{print >> "deleted"; next} 1' file
The above uses GNU awk for word delimiters (\<) and will append every deleted line to a file named "deleted". Consider adding a timestamp too:
awk '/\<vol([[:blank:]]+[[:digit:]]+){2}/{print systime(), $0 >> "deleted"; next} 1' file

find and replace line strings contained in one file to a second file in shell script

I'm trying to find a solution to the following problem:
I have two files: i.e. file1 and file2.
In file1 there some lines with some key words and I want to find these lines in file2 by using the key words. Once find the key words in file2 I would like to update this line with the content of the same line in file1. This operation should be done for every line contained in file1.
Just an example of what I have in mind, but I don't know exactly how to transform in shell script command.
file1:
key1=new_value1
key2=new_value2
key3=new_value3
etc....
file2:
key1=value1
key2=value2
key3=value3
key4=value4
key5=value5
key6=value6
etc....
Result:
key1=new_value1
key2=new_value2
key3=new_value3
key4=value4
key5=value5
key6=value6
etc....
I don't know how can I use 'sed' or something else in shell script to accomplish this task.
Any help is welcomed.
Thank you
awk would be my first choice
awk -F= -v OFS== '
NR==FNR {new[$1]=$2; next}
$1 in new {$2=new[$1]}
{print}
' file1 file2

Resources