How to grep multiples strings within N lines - linux

I was wondering if there is anyway that I could grep (or any other command) that will search multiple strings within N lines.
Example
Search for "orange", "lime", "banana" all within 3 lines
If the input file is
xxx
a lime
b orange
c banana
yyy
d lime
foo
e orange
f banana
I want to print the three lines starting with a, b, c.
The lines with the searched strings can appear in any order.
I do not want to print the lines d, e, f, as there is a line in between, and so the three strings are not grouped together.

Your question is rather unclear. Here is a simple Awk script which collects consecutive matching lines and prints iff the array is longer than three elements.
awk '/orange|lime|banana/ { a[++n] = $0; next }
{ if (n>=3) for (i=1; i<=n; i++) print a[i]; delete a; n=0 }
END { if (n>=3) for (i=1; i<=n; i++) print a[i] }' file
It's not clear whether you require all of your expressions to match; this one doesn't attempt to. If you see three successive lines with orange, that's a match, and will be printed.
The logic should be straightforward. The array a collects matches, with n indexing into it. When we see a non-match, we check its length, and print if it's 3 or more, then start over with an empty array and index. This is (clumsily) repeated at end of file as well, in case the file ends with a match.
If you want to permit gap (so, if there are three successive lines where one matches "orange" and "banana", then one which doesn't match, then one which matches "lime", print those three lines? Your question is unclear) you could change to always keeping an array of the last three lines, though then you also need to specify how to deal with e.g. a sequence of five lines which matches by these rules.

Similar to tripleee's answer, I would also use awk for this purpose.
The main idea is to implement a simple state machine.
Simple example
As a simple example, first try to find three consecutive lines of banana.
Consider the pattern-action statement
/banana/ { bananas++ }
For every line matching the regex banana, it increases the variable bananas (in awk, all variables are initialised with 0).
Of course, you want bananas to be reset to 0 when there is non-matching line, so your search starts from the beginning:
/banana/ { bananas++; next }
{ bananas = 0 }
You can also test for values of variables in the pattern of actions.
For example, if you want to print "Found" after three lines containing banana, extend the rule:
/banana/ {
bananas++
if (bananas >= 3) {
print "Found"
bananas = 0
}
next
}
This resets the variable bananas to 0, and prints the string "Found".
How to proceed further
Using this basic idea, you should be able to write your own awk script that handles all the cases.
First, you should familiarise yourself with awk (pattern, actions, program execution).
Then, extend and adapt my example to fit your needs.
In particular, you probably need an associative array matched, with indices "banana", "orange", "lime".
You set matched["banana"] = $0 when the current line matches /banana/. This saves the current line for later output.
You clear that whole array when the current line does not match any of your expressions.
When all strings are found (matched[s] is not empty for every string s), you can print the contents of matched[s].
I leave the actual implementation to you.
As others have said, your description leaves many corner-cases unclear.
You should figure them out for yourself and adapt your implementation accordingly.

I think you want this:
awk '
/banana/ {banana=3}
/lime/ {lime=3}
/orange/ {orange=3}
(orange>0)&&(lime>0)&&(banana>0){print l2,l1,$0}
{orange--;lime--;banana--;l2=l1;l1=$0}' OFS='\n' yourFile
So, if you see the word banana you set banana=3 so it is valid for the next 3 lines. Likewise, if you see lime, give it 3 lines of chances to make a group, and similarly for orange.
Now, if all of orange, lime and banana have been seen in the previous three lines, print the second to last line (l2), the last line (l1) and the current line $0.
Now decrement the counts for each fruit before we move to the next line, and save the current line and shuffle backwards in time order the previous 2 lines.

Related

Get most appear phrase (not word) in a file in bash

My file is
cat a.txt
a
b
aa
a
a a
I am trying to get most appear phrase (not word).
my code is
tr -c '[:alnum:]' '[\n*]' < a.txt | sort | uniq -c | sort -nr
4 a
1 b
1 aa
1
I need
2 a
1 b
1 aa
1 a a
sort a.txt | uniq -c | sort -rn
When you say “in Bash”, I’m going to assume that no external programs are allowed in this exercise. (Also, what is a phrase? I’m going to assume that there is one phrase per line and that no extra preprocessing (such as whitespace trimming) is needed.)
frequent_phrases() {
local -Ai phrases
local -ai {dense_,}counts
local phrase
local -i count i
while IFS= read -r phrase; # Step 0
do ((++phrases["${phrase}"]))
done
for phrase in "${!phrases[#]}"; do # Step 1
((count = phrases["${phrase}"]))
((++counts[count]))
local -a "phrases_$((count))"
local -n phrases_ref="phrases_$((count))"
phrases_ref+=("${phrase}")
done
dense_counts=("${!counts[#]}") # Step 2
for ((i = ${#counts[#]} - 1; i >= 0; --i)); do # Step 3
((count = dense_counts[i]))
local -n phrases_ref="phrases_$((count))"
for phrase in "${phrases_ref[#]}"; do
printf '%d %s\n' "$((count))" "${phrase}"
done
done
}
frequent_phrases < a.txt
Steps taken by the frequent_phrases function (marked in code comments):
Read lines (phrases) into an associative array while counting their occurrences. This yields a mapping from phrases to their counts (the phrases array).
Create a reverse mapping from counts back to phrases. Obviously, this will be a “multimap”, because multiple different phrases can occur the same number of times. To avoid assumptions around separator characters disallowed in a phrase, we store lists of phrases for each count using dynamically named arrays (instead of a single array). For example, all phrases that occur 11 times will be stored in an array called phrases_11.
Besides the map inversion (from (phrase → count) to (count → phrases)), we also gather all known counts in an array called counts. Values of this array (representing how may different phrases occur a particular number of times) are somewhat useless for this task, but its keys (the counts themselves) are a useful representation of a sparse set of counts that can be (later) iterated in a sorted order.
We compact our sparse array of counts into a dense array of dense_counts for easy backward iteration. (This would be unnecessary if we were to just iterate through the counts in increasing order. A reverse order of iteration is not that easy in Bash, as long as we want to implement it efficiently, without trying all possible counts between the maximum and 1.)
We iterate through all known counts backwards (from highest to lowest) and for each count we print out all phrases that occur that number of times. Again, for example, phrases that occur 11 times will be stored in an array called phrases_11.
Just for completeness, to print out (also) the extra bits of statistics we gathered, one could extend the printf command like this:
printf 'count: %d, phrases with this count: %d, phrase: "%s"\n' \
"$((count))" "$((counts[count]))" "${phrase}"

Edit values in one column in 4,000,000 row CSV file

I have a CSV file I am trying to edit to add a numeric ID-type column in with unique integers from 1 - approx 4,000,000. Some of the fields already have an ID value, so I was hoping I could just sort those and then fill in starting on the largest value + 1. However, I cannot open this file to edit in Excel because of its size (I can only see the max of 1,048,000 or whatever rows). Is there an easy way to do this? I am not familiar with coding, so I was hoping there was a way to do it manually that is similar to Excel's fill series feature.
Thanks!
-also - I know there are threads on how to edit a large CSV file, but I was hoping for help with how to edit this specific feature. Thanks!
-I want to basically sort the rows based on idnumber and then add unique IDs to rows without that ID value
Screenshot of file
one way, using Notepad++, and a plugin named SQL:
Load the CSV in Notepad++
SELECT a+1,b,c FROM data
Hit 'start'
When starting with a file like this:
a,b,c
1,2,3
4,5,6
7,8,9
The results after look like this:
SQL Plugin 1.0.1025
Query : select a+1,b,c from data
Sourcefile : abc.csv
Delimiter : ,
Number of hits: 3
===================================================================================
Query result:
2,2,3
5,5,6
8,8,9
Or, in words, the first column is incremented by 1.
2nd solution, using gawk, downloaded from https://www.klabaster.com/freeware.htm#mawk:
D:\TEMP>type abc.csv
a,b,c
1,2,3
4,5,6
7,8,9
D:\TEMP>gawk "BEGIN{ FS=OFS=\",\"; getline; print $0 }{ print $1+1,$2,$3 }" abc.csv
a,b,c
2,2,3
5,5,6
8,8,9
(g)awk id a tool which reads a file line by line. The line is then accessible via $0, and the parts from the line via $1,$2,$3,... using a separator.
This separator is set in my example (FS=OFS=\",\";) in the BEGIN section which is only done once per input file. Do not get confused by the \". This is because the script is between double quotes, and a variable (like OFS) is set using double quotes too, so it needs to be escaped like \".
The getline; print $0, do take care of the first line in a CSV which typically hold column names.
Then, for every line, this piece of code print $1+1,$2,$3 will increment the first column, and print the second and third column.
To extend this second example:
gawk "BEGIN{ FS=OFS=\",\"; getline; print $0 }{ print ($1<5?$1+1:$1),$2,$3 }" abc.csv
The ($1<5?$1+1:$1) will check if value of $1is less then 5 ($1<5), if true, it will return $1+1, and else $1. Or, in words, it will only add 1 if the current value is less than 5.
With your data you end up with something like this (untested!):
gawk "BEGIN{ FS=OFS=\",\"; getline; a=42; print $0 }{ if($4+0==0){ a++ }; print ($4<=0?$a:$1),$2,$3 }" input.csv
a=42 to set the initial value for the column values which needs to be update (you need to change this to the correct value )
The if($4+0==0){ a++ } will increment the value of a when the fourth column equals 0 (The $4+0 is done to convert empty values like "" to a numeric value 0).

AWK compare two columns in two seperate files

I would like to compare two files and do something like this: if the 5th column in the first file is equal to the 5th column in the second file, I would like to print the whole line from the first file. Is that possible? I searched for the issue but was unable to find a solution :(
The files are separated by tabulators and I tried something like this:
zcat file1.txt.gz file2.txt.gz | awk -F'\t' 'NR==FNR{a[$5];next}$5 in a {print $0}'
Did anybody tried to do a similar thing? :)
Thanks in advance for help!
Your script is fine, but you need to provide each file individually to awk and in reverse order.
$ cat file1.txt
a b c d 100
x y z w 200
p q r s 300
1 2 3 4 400
$ cat file2.txt
. . . . 200
. . . . 400
$ awk 'NR==FNR{a[$5];next} $5 in a {print $0}' file2.txt file1.txt
x y z w 200
1 2 3 4 400
EDIT:
As pointed out in the comments, the generic solution above can be improved and tailored to OP's situation of starting with compressed tab-separated files:
$ awk -F'\t' 'NR==FNR{a[$5];next} $5 in a' <(zcat file2.txt) <(zcat file1.txt)
x y z w 200
1 2 3 4 400
Explanation:
NR is the number of the current record being processed and FNR is the number
of the current record within its file . Thus NR == FNR is only
true when awk is processing the first file given to it (which in our case is file2.txt).
a[$5] adds the value of the 5th column as an index to the array a. Arrays in awk are associative arrays, but often you don't care about associating a value and just want to make a nice collection of things. This is a
pithy way to make a collection of all the values we've seen in 5th column of the
first file. The next statement, which follows, says to immediately get the next
available record without looking at any anymore statements in the awk program.
Summarizing the above, this line says "If you're reading the first file (file2.txt),
save the value of column 5 in the array called a and move on to the record without
continuing with the rest of the awk program."
NR == FNR { a[$5]; next }
Hopefully it's clear from the above that the only way we can past that first line of
the awk program is if we are reading the second file (file1.txt in our case).
$5 in a evaluates to true if the value of the 5th column occurs as an index in
the a array. In other words, it is true for every record in file1.txt whose 5th
column we saw as a value in the 5th column of file2.txt.
In awk, when the pattern portion evaluates to true, the accompanying action is
invoked. When there's no action given, as below, the default action is triggered
instead, which is to simply print the current record. Thus, by just saying
$5 in a, we are telling awk to print all the records in file1.txt whose 5th
column also occurs in file2.txt, which of course was the given requirement.
$5 in a

Multilevel parsing using shell command

I have a file in the following format
/////
name 1
start_occurrence:
occurrence 1
occurrence 2
///
name 2
start_occurance:
occurrence 1
occurrence 2
///
name 3
start_occurrence:
occurrence 1
occurrence 2
occurrence 3
All I need is to make a count of the number of occurrences for each name and save them in a CSV file. Can I do it using any combination of shell commands? Yes I can do it programmatically, but looking for a bunch of shell commands in a pipe lined fashion.
"names" can be anything. Names does not come with a pattern. Only catch is that the line after /// is the name. Also Occurrence does not have any number with it, anyline that starts with occurrence or have occurrence is a subject of interest.
awk 'c=="THISISNAME"{b=$0;c="";}$1=="///"{c="THISISNAME"}$0~/\<occurrence\>/{a[b]+=1;}END{for (i in a){print i" "a[i]}}' YOUR_FILE_HERE
explain:
if match the name start condition ($1=="///"), mark the c to THISISNAME.
if this is the name line (c=="THISISNAME"), mark the name line with b, and mark c as name part ended(c="").
if match the occurrence condition ($0~/\<occurrence\>/), make a[b] += 1.
use a map a to remark the occurrence time of each name.
awk use EREs, the $0~/EREs/ means $0 match the regex. the '\<' and '>' means '\b' in PREs

Extract rows and substrings from one file conditional on information of another file

I have a file 1.blast with coordinate information like this
1 gnl|BL_ORD_ID|0 100.00 33 0 0 1 3
27620 gnl|BL_ORD_ID|0 95.65 46 2 0 1 46
35296 gnl|BL_ORD_ID|0 90.91 44 4 0 3 46
35973 gnl|BL_ORD_ID|0 100.00 45 0 0 1 45
41219 gnl|BL_ORD_ID|0 100.00 27 0 0 1 27
46914 gnl|BL_ORD_ID|0 100.00 45 0 0 1 45
and a file 1.fasta with sequence information like this
>1
TCGACTAGCTACGACTCGGACTGACGAGCTACGACTACGG
>2
GCATCTGGGCTACGGGATCAGCTAGGCGATGCGAC
...
>100000
TTTGCGAGCGCGAAGCGACGACGAGCAGCAGCGACTCTAGCTACTG
I am searching now a script that takes from 1.blast the first column and extracts those sequence IDs (=first column $1) plus sequence and then from the sequence itself all but those positions between $7 and $8 from the 1.fasta file, meaning from the first two matches the output would be
>1
ACTAGCTACGACTCGGACTGACGAGCTACGACTACGG
>27620
GTAGATAGAGATAGAGAGAGAGAGGGGGGAGA
...
(please notice that the first three entries from >1 are not in this sequence)
The IDs are consecutive, meaning I can extract the required information like this:
awk '{print 2*$1-1, 2*$1, $7, $8}' 1.blast
This gives me then a matrix that contains in the first column the right sequence identifier row, in the second column the right sequence row (= one after the ID row) and then the two coordinates that should be excluded. So basically a matrix that contains all required information which elements from 1.fasta shall be extracted
Unfortunately I do not have too much experience with scripting, hence I am now a bit lost, how to I feed the values e.g. in the suitable sed command?
I can get specific rows like this:
sed -n 3,4p 1.fasta
and the string that I want to remove e.g. via
sed -n 5p 1.fasta | awk '{print substr($0,2,5)}'
But my problem is now, how can I pipe the information from the first awk call into the other commands so that they extract the right rows and remove from the sequence rows then the given coordinates. So, substr isn't the right command, I would need a command remstr(string,start,stop) that removes everything between these two positions from a given string, but I think that I could do in an own script. Especially the correct piping is a problem here for me.
If you do bioinformatics and work with DNA sequences (or even more complicated things like sequence annotations), I would recommend having a look at Bioperl. This obviously requires knowledge of Perl, but has quite a lot of functionality.
In your case you would want to generate Bio::Seq objects from your fasta-file using the Bio::SeqIO module.
Then, you would need to read the fasta-entry-numbers and positions wanted into a hash. With the fasta-name as the key and the value being an array of two values for each subsequence you want to extract. If there can be more than one such subsequence per fasta-entry, you would have to create an array of arrays as the value entry for each key.
With this data structure, you could then go ahead and extract the sequences using the subseq method from Bio::Seq.
I hope this is a way to go for you, although I'm sure that this is also feasible with pure bash.
This isn't an answer, it is an attempt to clarify your problem; please let me know if I have gotten the nature of your task correct.
foreach row in blast:
get the proper (blast[$1]) sequence from fasta
drop bases (blast[$7..$8]) from sequence
print blast[$1], shortened_sequence
If I've got your task correct, you are being hobbled by your programming language (bash) and the peculiar format of your data (a record split across rows). Perl or Python would be far more suitable to the task; indeed Perl was written in part because multiple file access in awk of the time was really difficult if not impossible.
You've come pretty far with the tools you know, but it looks like you are hitting the limits of their convenient expressibility.
As either thunk and msw have pointed out, more suitable tools are available for this kind of task but here you have a script that can teach you something about how to handle it with awk:
Content of script.awk:
## Process first file from arguments.
FNR == NR {
## Save ID and the range of characters to remove from sequence.
blast[ $1 ] = $(NF-1) " " $NF
next
}
## Process second file. For each FASTA id...
$1 ~ /^>/ {
## Get number.
id = substr( $1, 2 )
## Read next line (the sequence).
getline sequence
## if the ID is one found in the other file, get ranges and
## extract those characters from sequence.
if ( id in blast ) {
split( blast[id], ranges )
sequence = substr( sequence, 1, ranges[1] - 1 ) substr( sequence, ranges[2] + 1 )
## Print both lines with the shortened sequence.
printf "%s\n%s\n", $0, sequence
}
}
Assuming your 1.blasta of the question and a customized 1.fasta to test it:
>1
TCGACTAGCTACGACTCGGACTGACGAGCTACGACTACGG
>2
GCATCTGGGCTACGGGATCAGCTAGGCGATGCGAC
>27620
TTTGCGAGCGCGAAGCGACGACGAGCAGCAGCGACTCTAGCTACTGTTTGCGA
Run the script like:
awk -f script.awk 1.blast 1.fasta
That yields:
>1
ACTAGCTACGACTCGGACTGACGAGCTACGACTACGG
>27620
TTTGCGA
Of course I'm assumming some things, the most important that fasta sequences are not longer than one line.
Updated the answer:
awk '
NR==FNR && NF {
id=substr($1,2)
getline seq
a[id]=seq
next
}
($1 in a) && NF {
x=substr(a[$1],$7,$8)
sub(x, "", a[$1])
print ">"$1"\n"a[$1]
} ' 1.fasta 1.blast

Resources