Executing zgrep recursively in Linux - linux

This zgrep command is outputting a particular field of a line containing the word yellow when given a giant input log file for all 24 hours of 26th Feb 1989.
zgrep 'yellow' /color_logs/1989/02/26/*/1989-02-26-00_* | cut -f3 -d'+'
1) I prefer using a perl script. Are there advantages of using a bash script?
Also when writing this script I would like for it to create a file after processing the data for each DAY (so it will look at all the hours in a day)
zgrep 'yellow' /color_logs/1989/02/*/*/1989-02-26-00_* | cut -f3 -d'+'
2) How do I determine the value of the first star (in Perl), after processing a day's worth of data so that I can output the file with the YYMMDD in its name. I'm interested in getting the value of the first star from the line of code directly above this question.

Grep writes out the file that where the line came from, but your cut command is throwing that away. You could do something like:
open(PROCESS, "zgrep 'yellow' /color_logs/1989/02/*/*/1989-02-26_* |");
while(<PROCESS>) {
if (m!/color_logs/(\d\d\d\d)/(\d\d)/(\d\d)/[^:]+:(.+)$!) {
my ($year, $month, $day, $data) = ($1, $2, $3, $4);
# Do the cut -f3 -d'+' on the line from the log
my $data = (split('+', $data))[2];
open(OUTFILE, ">>${year}${month}${day}.log");
print OUTFILE $data, "\n";
close(OUTFILE);
}
}
That's inefficient in that you're opening and closing the file for each line, you could use an IO::File object instead and only open when the date changes, but you get the idea.

Related

awk, sed, grep specific strings from a file in Linux

Here is part of the complete file that I am trying to filter:
Hashmode: 13761 - VeraCrypt PBKDF2-HMAC-SHA256 + XTS 512 bit + boot-mode (Iterations: 200000)
Speed.#2.........: 2038 H/s (56.41ms) # Accel:128 Loops:32 Thr:256 Vec:1
Speed.#3.........: 2149 H/s (53.51ms) # Accel:128 Loops:32 Thr:256 Vec:1
Speed.#*.........: 4187 H/s
The aim is to print the following:
13761 VeraCrypt PBKDF2-HMAC-SHA256 4187 H/s
Here is what I tried.
The complete file is called complete.txt
cat complete.txt | grep Hashmode | awk '{print $2,$4,$5}' > mode.txt
Output:
13761 VeraCrypt PBKDF2-HMAC-SHA256
Then:
cat complete.txt | grep Speed.# | awk '{print $2,$3}' > speed.txt
Output:
4187 H/s
Then:
paste mode.txt speed.txt
The issue is that the lines do not match. There are approx 200 types of modes to filter within the file 'complete.txt'
I also have a feeling that this can be done using a much simpler command with sed or awk.
I am guessing you are looking for something like the following.
awk '/Hashmode:/ { if(label) print label, speed; label = $2 " " $4 " " $5 }
/Speed\.#/ { speed = $2 " " $ 3 }
END { if (label) print label, speed }' complete.txt
We match up the Hashmode line with the last Speed.# line which follows, then print when we see a new Hashmode, or reach end of file. (Failing to print the last one is a common beginner bug.)
This might work for you (GNU sed):
sed -E '/Hashmode:/{:a;x;s/^[^:]*: (\S+) -( \S+ \S+ ).*\nSpeed.*:\s*(\S+ \S+).*/\1\2\3/p;x;h;d};H;$!d;ba' file
If a line contains Hashmode, swap to the hold space and using pattern matching, manipulate its contents to the desired format and print, swap back to the pattern space, copy the current line to the hold space and delete the current line.
Otherwise, append the current line to the hold space and delete the current line, unless the current line is the last line in the file, in which case process the line as if it contained Hashmode.
N.B. The first time Hashmode is encountered, nothing is output. Subsequent matches and the end-of-file condition will be the only times printing occurs.

extract sequences from multifasta file by ID in file using awk

I would like to extract sequences from the multifasta file that match the IDs given by separate list of IDs.
FASTA file seq.fasta:
>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11605
TTCAGCAAGCCGAGTCCTGCGTCGAGAGTTCAAGTC
CCTGTTCGGGCGCCACTGCTAG
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC
>7P58X:01334:11635
TTCAGCAAGCCGAGTCCTGCGTCGAGAGATCGCTTT
CAAGTCCCTGTTCGGGCGCCACTGCGGGTCTGTGTC
GAGCG
>7P58X:01336:11621
ACGCTCGACACAGACCTTTAGTCAGTGTGGAAATCT
CTAGCAGTAGAGGAGATCTCCTCGACGCAGGACT
IDs file id.txt:
7P58X:01332:11636
7P58X:01334:11613
I want to get the fasta file with only those sequences matching the IDs in the id.txt file:
>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC
I really like the awk approach I found in answers here and here, but the code given there is still not working perfectly for the example I gave. Here is why:
(1)
awk -v seq="7P58X:01332:11636" -v RS='>' '$1 == seq {print RS $0}' seq.fasta
this code works well for the multiline sequences but IDs have to be inserted separately to the code.
(2)
awk 'NR==FNR{n[">"$0];next} f{print f ORS $0;f=""} $0 in n{f=$0}' id.txt seq.fasta
this code can take the IDs from the id.txt file but returns only the first line of the multiline sequences.
I guess that the good thing would be to modify the RS variable in the code (2) but all of my attempts failed so far. Can, please, anybody help me with that?
$ awk -F'>' 'NR==FNR{ids[$0]; next} NF>1{f=($2 in ids)} f' id.txt seq.fasta
>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC
Following awk may help you on same.
awk 'FNR==NR{a[$0];next} /^>/{val=$0;sub(/^>/,"",val);flag=val in a?1:0} flag' ids.txt fasta_file
I'm facing a similar problem. The size of my multi-fasta file is ~ 25G.
I use sed instead of awk, though my solution is an ugly hack.
First, I extracted the line number of the title of each sequence to a data file.
grep -n ">" multi-fasta.fa > multi-fasta.idx
What I got is something like this:
1:>DM_0000000004
5:>DM_0000000005
11:>DM_0000000007
19:>DM_0000000008
23:>DM_0000000009
Then, I extracted the wanted sequence by its title, eg. DM_0000000004, using the scripts below.
seqnm=$1
idx0_idx1=`grep -n $seqnm multi-fasta.idx`
idx0=`echo $idx0_idx1 | cut -d ":" -f 1`
idx0plus1=`expr $idx0 + 1`
idx1=`echo $idx0_idx1 | cut -d ":" -f 2`
idx2=`head -n $idx0plus1 multi-fasta.idx | tail -1 | cut -d ":" -f 1`
idx2minus1=`expr $idx2 - 1`
sed ''"$idx1"','"$idx2minus1"'!d' multi-fasta.fa > ${seqnm}.fasta
For example, I want to extract the sequence of DM_0000016115. The idx0_idx1 variable gives me:
7507:42520:>DM_0000016115
7507 (idx0) is the line number of line 42520:>DM_0000016115 in multi-fasta.idx.
42520 (idx1) is the line number of line >DM_0000016115 in multi-fasta.fa.
idx2 is the line number of the sequence title right beneath the wanted one (>DM_0000016115).
At last, using sed, we can extract the lines between idx1 and idx2 minus 1, which are the title and the sequence, in which case you can use grep -A.
The advantage of this ugly-hack is that it does not require a specific number of lines for each sequence in the multi-fasta file.
What bothers me is this process is slow. For my 25G multi-fasta file, such extraction takes tens of seconds. However, it's much faster than using samtools faidx .

how to add numbers in C-shell

I have a question about C-shell. In my script, I want to automatically add all the numbers and get the total number. How to implement such function in C-shell?
My script is shown below:
#!/bin/csh -f
set log_list = $1
echo "Search begins now"
foreach subdir(`cat $log_list`)
grep "feature identified" "$subdir" -A1 | grep "ne=" | awk '{print $7}'
echo "done"
end
For this script, it will grep the log file "log_list" for keyword "feature identified" and the next line containing keyword "ne=". I care about the number after "ne=", for example ne=140.
Then the grep output will be like this:
ne=100
ne=115
ne=120
...
There are more than 1K lines of such numbers. Of course I can redirect the grep data to a new file(in Linux). Then copy all the data into Excel spreadsheet to add them up. But I want to do this in the script. And it will make thing easier.
The final result should be like this:
total_ne=335
Do you know how to do this in the C-shell? Thanks!

Bash: How to keep lines in a file that have fields that match lines in another file?

I have two big files with a lot of text, and what I have to do is keep all lines in file A that have a field that matches a field in file B.
file A is something like:
Name (tab) # (tab) # (tab) KEYFIELD (tab) Other fields
file B I managed to use cut and sed and other things to basically get it down to one field that is a list.
So The goal is to keep all lines in file A in the 4th field (it says KEYFIELD) if the field for that line matches one of the lines in file B. (Does NOT have to be an exact match, so if file B had Blah and file A said Blah_blah, it'd be ok)
I tried to do:
grep -f fileBcutdown fileA > outputfile
EDIT: Ok I give up. I just force killed it.
Is there a better way to do this? File A is 13.7MB and file B after cutting it down is 32.6MB for anyone that cares.
EDIT: This is an example line in file A:
chr21 33025905 33031813 ENST00000449339.1 0 - 33031813 33031813 0 3 1835,294,104, 0,4341,5804,
example line from file B cut down:
ENST00000111111
Here's one way using GNU awk. Run like:
awk -f script.awk fileB.txt fileA.txt
Contents of script.awk:
FNR==NR {
array[$0]++
next
}
{
line = $4
sub(/\.[0-9]+$/, "", line)
if (line in array) {
print
}
}
Alternatively, here's the one-liner:
awk 'FNR==NR { array[$0]++; next } { line = $4; sub(/\.[0-9]+$/, "", line); if (line in array) print }' fileB.txt fileA.txt
GNU awk can also perform the pre-processing of fileB.txt that you described using cut and sed. If you would like me to build this into the above script, you will need to provide an example of what this line looks like.
UPDATE using files HumanGenCodeV12 and GenBasicV12:
Run like:
awk -f script.awk HumanGenCodeV12 GenBasicV12 > output.txt
Contents of script.awk:
FNR==NR {
gsub(/[^[:alnum:]]/,"",$12)
array[$12]++
next
}
{
line = $4
sub(/\.[0-9]+$/, "", line)
if (line in array) {
print
}
}
This successfully prints lines in GenBasicV12 that can be found in HumanGenCodeV12. The output file (output.txt) contains 65340 lines. The script takes less than 10 seconds to complete.
You're hitting the limit of using the basic shell tools. Assuming about 40 characters per line, File A has 400,000 lines in it and File B has about 1,200,000 lines in it. You're basically running grep for each line in File A and having grep plow through 1,200,000 lines with each execution. that's 480 BILLION lines you're parsing through. Unix tools are surprisingly quick, but even something fast done 480 billion times will add up.
You would be better off using a full programming scripting language like Perl or Python. You put all lines in File B in a hash. You take each line in File A, check to see if that fourth field matches something in the hash.
Reading in a few hundred thousand lines? Creating a 10,000,000 entry hash? Perl can parse both of those in a matter of minutes.
Something -- off the top of my head. You didn't give us much in the way of spects, so I didn't do any testing:
#! /usr/bin/env perl
use strict;
use warnings;
use autodie;
use feature qw(say);
# Create your index
open my $file_b, "<", "file_b.txt";
my %index;
while (my $line = <$file_b>) {
chomp $line;
$index{$line} = $line; #Or however you do it...
}
close $file_b;
#
# Now check against file_a.txt
#
open my $file_a, "<", "file_a.txt";
while (my $line = <$file_a>) {
chomp $line;
my #fields = split /\s+/, $line;
if (exists $index{$field[3]}) {
say "Line: $line";
}
}
close $file_a;
The hash means you only have to read through file_b once instead of 400,000 times. Start the program, go grab a cup of coffee from the office kitchen. (Yum! non-dairy creamer!) By the time you get back to your desk, it'll be done.
grep -f seems to be very slow even for medium sized pattern files (< 1MB). I guess it tries every pattern for each line in the input stream.
A solution, which was faster for me, was to use a while loop. This assumes that fileA is reasonably small (it is the smaller one in your example), so iterating multiple times over the smaller file is preferable over iterating the larger file multiple times.
while read line; do
grep -F "$line" fileA
done < fileBcutdown > outputfile
Note that this loop will output a line several times if it matches multiple patterns. To work around this limitation use sort -u, but this might be slower by quite a bit. You have to try.
while read line; do
grep -F "$line" fileA
done < fileBcutdown | sort -u | outputfile
If you depend on the order of the lines, then I don't think you have any other option than using grep -f. But basically it boils down to trying m*n pattern matches.
use the below command:
awk 'FNR==NR{a[$0];next}($4 in a)' <your filtered fileB with single field> fileA

Getting n-th line of text output

I have a script that generates two lines as output each time. I'm really just interested in the second line. Moreover I'm only interested in the text that appears between a pair of #'s on the second line. Additionally, between the hashes, another delimiter is used: ^A. It would be great if I can also break apart each part of text that is ^A-delimited (Note that ^A is SOH special character and can be typed by using Ctrl-A)
output | sed -n '1p' #prints the 1st line of output
output | sed -n '1,3p' #prints the 1st, 2nd and 3rd line of output
your.program | tail +2 | cut -d# -f2
should get you 2/3 of the way.
Improving Grumdrig's answer:
your.program | head -n 2| tail -1 | cut -d# -f2
I'd probably use awk for that.
your_script | awk -F# 'NR == 2 && NF == 3 {
num_tokens=split($2, tokens, "^A")
for (i = 1; i <= num_tokens; ++i) {
print tokens[i]
}
}'
This says
1. Set the field separator to #
2. On lines that are the 2nd line, and also have 3 fields (text#text#text)
3. Split the middle (2nd) field using "^A" as the delimiter into the array named tokens
4. Print each token
Obviously this makes a lot of assumptions. You might need to tweak it if, for example, # or ^A can appear legitimately in the data, without being separators. But something like that should get you started. You might need to use nawk or gawk or something, I'm not entirely sure if plain awk can handle splitting on a control character.
bash:
read
read line
result="${line#*#}"
result="${result%#*}"
IFS=$'\001' read result -a <<< "$result"
$result is now an array that contains the elements you're interested in. Just pipe the output of the script to this one.
here's a possible awk solution
awk -F"#" 'NR==2{
for(i=2;i<=NF;i+=2){
split($i,a,"\001") # split on SOH
for(o in a ) print o # print the splitted hash
}
}' file

Resources