Extract substring with sed/awk/grep from .gff file - linux

I have a file containing multiple lines like this:
NODE_1_length Prodigal:2.6 CDS 11 274 . + 0 ID=PROKKA_00001;inference=ab initio prediction:Prodigal:2.6;locus_tag=PROKKA_00001;product=hypothetical protein
And I want to extract the ID=PROKKA_[whatever number] and everything that comes after 'product=' to obtain an output like this:
ID=PROKKA_00001 product=hypothetical protein
I am not very skilled in using sed, so I tried to adapt some solutions I found here and around but didn't manage to get through. It is also fine if the solution comes in two step (one for the ID, one for the product), then I can merge the two results in a single file.
I would be grateful if you could include an explanation of the regex used.
So far I tried to split the problem in two (starting from the ID) and tried:
grep -o 'ID=PROKKA_[0-9]{1,5}*'
sed 's/^ID=PROKKA[0-9]*;//g/
grep -Po 'ID="K[^"]*'
but of course none of them worked.
Thanks for helping!

You may use grep -oE:
grep -oE 'ID=PROKKA_[0-9]+|product=[^;:]+' file
ID=PROKKA_00001
product=hypothetical protein
If you want result in same line then use grep + paste:
grep -oE 'ID=PROKKA_[0-9]+|product=[^;:]+' file | paste -s

Related

Make grep to exact match strings with and without dash "-"

The problem looks simple and common, so I've looked through many answers but seems that none of them provides appropriate general solution.
I need to grep large tab-separated 6 columns file (*.bed file in fact) to split it by the content of the first column using the list of string variables (items). I just need a row starting with a given string.
I was succesfully using
grep -w "$name" inputfile
$name is read from the list of strings
for that purpose until the case where strings have the following format (example): YAL038W but also YAL038W-A, YAL038W-B,...
So, grep with -w option considers YAL038W identical to YAL038W-A, YAL038W-B since "-" is word separator. it would work with "_" but not with "-".
I've found solutions based on awk which are working fine, for example:
awk -F $'\t' -vsearch=$name '$1==search' inputfile
but awk is terribly slow, over 10 times, see time measurements below
For 2.5 Gb input file and > 5000 items to look for, script is already running for >24 hours!
Example of inputfile:
YAL038W-A 0 48 HWI-1KL176:101:CC27NACXX:3:2208:17646:92047 0 +
YAL038W-A 0 48 HWI-1KL176:101:CC27NACXX:3:2211:17326:31268 0 +
YAL038W 1 50 HWI-1KL176:101:CC27NACXX:8:1205:16311:19319 3 +
YAL038W 1 27 HWI-1KL176:101:CC27NACXX:8:2103:4951:94527 42 +
time grep -w "YAL038W" inputfile > testfile.txt
real 0m3.569s
time awk -F $'\t' -vsearch="YAL038W" '$1==search' inputfile > testfile.txt
real 0m29.521s
I am looking for FAST solution using grep or something else, and I need to pass the variable to this command in the cycle.
Alternative is to modify the imput file by replacing "-" by "_", but it is the last possibility I believe...
Thanks in advance
I've found solutions based on awk which are working fine, for example:
awk -F $'\t' -vsearch=$name '$1==search' inputfile
but awk is terribly slow…
I am looking for FAST solution using grep …
If the above awk command worked for you, then this will do:
grep ^$name$'\t' inputfile
Just search at the beginning of each line for the name followed by a TAB.

Search Text with Linebreaks recursiv in a directory?

I Have many large logfiles which are looks like that:
DATETIME ["2015-03-03 21:52"]
SERVER [{json_with_$_SERVER-Output}]
GET ["GET_JSON","AAA"]
POST ["POST_JSON","BBB","TEST1"]
DATETIME ["2015-03-03 21:53"]
SERVER [{json_with_$_SERVER-Output}]
GET ["GET_JSON","CCC"]
POST ["POST_JSON","DDD","TEST2"]
DATETIME ["2015-03-03 21:54"]
SERVER [{json_with_$_SERVER-Output}]
GET ["GET_JSON","AAA"]
POST ["POST_JSON","BBB","TEST3"]
DATETIME ["2015-03-03 21:55"]
SERVER [{json_with_$_SERVER-Output}]
GET ["GET_JSON","AAA"]
POST ["POST_JSON","EEE","TEST4"]
I want to search about 2 keywords (between them are linebreaks). one specific word in the GET-Line and one specific word in the POST-Line.
i need something like:
grep "GET(.*)AAA(.*)POST(.*)BBB"
what im searching for: AAA (in GET-Line) && BBB (In POST-Line)
the expected result:
POST ["POST_JSON","BBB","TEST1"]
POST ["POST_JSON","BBB","TEST3"]
with which simple methods this is doable?
Using GNU awk for the 3rd arg to match():
$ find . -type f |
xargs gawk -v RS= 'match($0,/\nGET.*AAA.*\n(POST.*BBB.*)/,a){print a[1]}'
POST ["POST_JSON","BBB","TEST1"]
POST ["POST_JSON","BBB","TEST3"]
Add -v ORS='\n\n' if you really want a blank line between output lines.
grep is the command you are searching for
grep -rHn "GET.*KEYWORD_A" -A1 /path/to/files | grep "POST.*KEYWORD_B"
I would first grep for lines containing KEYWORD_A and append one line after the match since the POST comes after the GET in your logfiles. Then search for KEYWORD_B
-r greps recursively in a directory
-H prints the file name
-n prints the line number
i solved this with grep -P for Regular Expressions as i know it from PHP and particularly with -A to get the next n Lines. Then i filtered the result with "|" and grep -P again

Using grep to find difference between two big wordlists

I have one 78k lines .txt file with british words and a 5k lines .txt file with the most common british words. I want to sort out the most common words from the big list so that I have a new list with the not as common words.
I managed solve my problem in another matter, but I would really like to know, what I am doing wrong since this does not work.
I have tried the following:
//To make sure they are trimmed
cut -d" " -f1 78kfile.txt | tac | tac > 78kfile.txt
cut -d" " -f1 5kfile.txt | tac | tac > 5kfile.txt
grep -xivf 5kfile.txt 78kfile.txt > cleansed
//But this procedure apparently gives me two empty files.
If I run just the grep without cut first, I get words that I know are in both files.
I have also tried this:
sort 78kfile.txt > 78kfile-sorted.txt
sort 5kfile.txt > 5kfile-sorted.txt
comm -3 78kfile-sorted.txt 5kfile-sorted.txt
//No luck either
The two text files in case anyone wants to try for them selves:
https://www.dropbox.com/s/dw3k8ragnvjcfgc/5k-most-common-sorted.txt
https://www.dropbox.com/s/1cvut5z2zp9qnmk/brit-a-z-sorted.txt
After downloading your files, I noticed that (a) brit-a-z-sorted.txt has Microsoft line endings while 5k-most-common-sorted.txt has Unix line endings and (b) you are trying to do whole-line compare (grep -x). So, first we need to convert to a common line ending:
dos2unix <brit-a-z-sorted.txt >brit-a-z-sorted-fixed.txt
Now, we can use grep to remove the common words:
grep -xivFf 5k-most-common-sorted.txt brit-a-z-sorted-fixed.txt >less-common.txt
I also added the -F flag to assure that the words would be interpreted as a fixed strings rather than as regular expressions. This also speeds things up.
I note that there are several words in the 5k-most-common-sorted.txt file that are not in the brit-a-z-sorted.txt. For example, "British" is in the common file but not the larger file. Also the common file has "aluminum" while the larger file has only "aluminium".
What do the grep options mean? For those who are curious:
-f means read the patterns from a file.
-F means treat them as fixed patterns, not regular expressions,
-i mean ignore case.
-x means do whole-line matches
-v means invert the match. In other words, print those lines that do not match any of the patterns.

How to find the particular text stored in the file "data.txt" and it occurs only once

The line I seek is stored in the file data.txt and is the only line of text that occurs only once.
How do I go about finding that particular line using linux?
This is a little bit old, but I think you are looking for this...
cat data.txt | sort | uniq -u
This will show the unique values that only occur once in the file. I assume you are familiar with "over the wire" if you are asking?? If so, this is what you are looking for.
To provide some context (I need more rep to comment) this is a question that features in an online "wargame" called Bandit that involves using the command line to discover passwords on an online Linux server to advance up the levels.
For those who would like to see data.txt in full I've Pastebin'd it here however it looks like this:
NN4e37KW2tkIb3dC9ZHyOPdq1FqZwq9h
jpEYciZvDIs6MLPhYoOGWQHNIoQZzE5q
3rpovhi1CyT7RUTunW30goGek5Q5Fu66
JOaWd4uAPii4Jc19AP2McmBNRzBYDAkO
JOaWd4uAPii4Jc19AP2McmBNRzBYDAkO
9WV67QT4uZZK7JHwmOH0jnhurJMwoGZU
a2GjmWtTe3tTM0ARl7TQwraPGXgfkH4f
7yJ8imXc7NNiovDuAl1ZC6xb0O0mMBx1
UsvVyFSfZZWbi6wgC7dAFyFuR6jQQUhR
FcOJhZkHlnwqcD8QbvjRyn886rCrnWZ7
E3ugYDa6Wh2y8C8xQev7vOS8O3OgG1Hw
E3ugYDa6Wh2y8C8xQev7vOS8O3OgG1Hw
ME7nnzbId4W3dajsl6Xtviyl5uhmMenv
J5lN3Qe4s7ktiwvcCj9ZHWrAJcUWEhUq
aouHvjzagN8QT2BCMB6e9rlN4ffqZ0Qq
ZRF5dlSuwuVV9TLhHKvPvRDrQ2L5ODfD
9ZjR3NTHue4YR6n4DgG5e0qMQcJjTaiM
QT8Bw9ofH4x3MeRvYAVbYvV1e1zq3Xim
i6A6TL6nqvjCAPvOdXZWjlYgyvqxmB7k
tx7tQ6kgeJnC446CHbiJY7fyRwrwuhrs
One way to do it is to use:
sort data.txt | uniq -u
The sort command is like cat in that it displays the contents of the file however it sorts the file lexicographically by lines (it reorders them alphabetically so that matching ones are together).
The | is a pipe that redirects the output from one command into another.
The uniq command reports or omits repeated lines and by passing it the -u argument we tell it to report only unique lines.
Used together like this, the command will sort data.txt lexicographically by each line, find the unique line and print it back in the terminal for you.
sort -u data.txt | while read line; do if [ $(grep -c $line data.txt) == 1 ] ;then echo $line; fi; done
was mine solution, until I saw here easy one:
sort data.txt | uniq -u
Add more information to you post.
How data.txt look like?
Like this:
11111111
11111111
pass1111
11111111
Or like this
afawfdgd
password
somethin
gelse...
And, do you know the password is in file or you search for not repeat string.
If you know password, use something like this
cat data.txt | grep 'password'
If you don`t know the password and this password is only unique line in file you must create a script.
For example in Python
file = open("data.txt","r")
f = file.read()
for line in f:
if 'pass' in line:
print pass
Of course replace pass with something else.
For example some slice from line.
And one with only one tool in use, awk:
awk '{a[$1]++}END{for(i in a){if(a[i] == 1){print i} }}' data.txt
sort data.txt | uniq -c | grep 1\ ?*
and it will print the only text that occurs only one time
do not forget to put space after the backslash
sort data.txt | uniq -c | grep 1
you will find only one that accures one time

How to compare two text files for the same exact text using BASH?

Let's say I have two text files that I need to extract data out of. The text of the two files is as follows:
File 1:
1name - randomemail#email.com
2Name - superrandomemail#email.com
3Name - 123random#email.com
4Name - random123#email.com
File 2:
email.com
email.com
email.com
anotherwebsite.com
File 2 is File 1's list of domain names, extracted from the email addresses.
These are not the same domain names by any means, and are quite random.
How can I get the results of the domain names that match File 2 from File 1?
Thank you in advance!
Assuming that order does not matter,
grep -F -f FILE2 FILE1
should do the trick. (This works because of a little-known fact: the -F option to grep doesn't just mean "match this fixed string," it means "match any of these newline-separated fixed strings.")
The recipe:
join <(sed 's/^.*#//' file1|sort -u) <(sort -u file2)
it will output the intersection of all domain names in file1 and file2
See BashFAQ/036 for the list of usual solutions to this type of problem.
Use VimDIFF command, this gives a nice presentation of difference
If I got you right, you want to filter for all addresses with the host mentioned in File 2.
You could then just loop over File 2 and grep for #<line>, accumulating the result in a new file or something similar.
Example:
cat file2 | sort -u | while read host; do grep "#$host" file1; done > filtered

Resources