I am in the course of preparing a presentation and I want a file to be printed on the screen s l o w l y, while I am commenting on it. Typing cat file.txt | less seems to be an obvious solution but is there another one, more elegant and pleasing to eye?
perl -ne '$|=1; for (split //) { print; select(undef,undef,undef, 0.15) }' file.txt
$|=1 do not buffer.
for (split //) { print; ...for every character printed...
select(undef,undef,undef, 0.15) sleep 0.15 seconds (You can change this value according to your taste and needs).
Related
I have the following simple script that tries to count
the tag encoded with "CB:Z" in SAM/BAM file:
samtools view -h small.bam | grep "CB:Z:" |
sed 's/.*CB:Z:\([ACGT]*\).*/\1/' |
sort |
uniq -c |
awk '{print $2 " " $1}'
Typically it needs to process 40 million lines. That codes takes around 1 hour to finish.
This line sed 's/.*CB:Z:\([ACGT]*\).*/\1/' is very time consuming.
How can I speed it up?
The reason I used the Regex is that the "CB" tag column-wise position
is not fixed. Sometimes it's at column 20 and sometimes column 21.
Example BAM file can be found HERE.
Update
Speed comparison on complete 40 million lines file:
My initial code:
real 21m47.088s
user 26m51.148s
sys 1m27.912s
James Brown's with AWK:
real 1m28.898s
user 2m41.336s
sys 0m6.864s
James Brown's with MAWK:
real 1m10.642s
user 1m41.196s
sys 0m6.484s
Another awk, pretty much like #tripleee's, I'd assume:
$ samtools view -h small.bam | awk '
match($0,/CB:Z:[ACGT]*/) { # use match for the regex match
a[substr($0,RSTART+5,RLENGTH-5)]++ # len(CB:z:)==5, hence +-5
}
END {
for(i in a)
print i,a[i] # sample output,tweak it to your liking
}'
Sample output:
...
TCTTAATCGTCC 175
GGGAAGGCCTAA 190
TCGGCCGATCGG 32
GACTTCCAAGCC 76
CCGCGGCATCGG 36
TAGCGATCGTGG 125
...
Notice: Your sed 's/.*CB:Z:... matches the last instance where as my awk 'match($0,/CB:Z:[ACGT]*/)... matches the first.
Notice 2: Quoting #Sundeep in the comments: - - using LC_ALL=C mawk '..' will give even better speed.
With perl
perl -ne '$h{$&}++ if /CB:Z:\K[ACGT]++/; END{print "$_ $h{$_}\n" for keys %h}'
CB:Z:\K[ACGT]++ will match any sequence of ACGT characters preceded by CB:Z:. \K is used here to prevent CB:Z: from being part of matched portion, which is available via $& variable
Sample time with small.bam input file. mawk is fastest for this input, but it might change for larger input file.
# script.awk is the one mentioned in James Brown's answer
# result here shown with GNU awk
$ time LC_ALL=C awk -f script.awk small.bam > f1
real 0m0.092s
# mawk is faster compared to GNU awk for this use case
$ time LC_ALL=C mawk -f script.awk small.bam > f2
real 0m0.054s
$ time perl -ne '$h{$&}++ if /CB:Z:\K[ACGT]++/; END{print "$_ $h{$_}\n" for keys %h}' small.bam > f3
real 0m0.064s
$ diff -sq <(sort f1) <(sort f2)
Files /dev/fd/63 and /dev/fd/62 are identical
$ diff -sq <(sort f1) <(sort f3)
Files /dev/fd/63 and /dev/fd/62 are identical
Better to avoid parsing the output of samtools view in the first place. Here's one way to get what you need just using python and the pysam library:
import pysam
from collections import defaultdict
counts = defaultdict(int)
tag = 'CB'
with pysam.AlignmentFile('small.bam') as sam:
for aln in sam:
if aln.has_tag(tag):
counts[ aln.get_tag(tag) ] += 1
for k, v in counts.items():
print(k, v)
Following your original pipeline approach:
pcre2grep -o 'CB:Z:\K[^\t]*' small.bam |
awk '{++c[$0]} END {for (i in c) print i,c[i]}'
In case you're interested in trying to speed up sed (although it's not likely to be the fastest):
sed 't a;s/CB:Z:/\n/;D;:a;s/\t/\n/;P;d' small.bam |
awk '{++c[$0]} END {for (i in c) print i,c[i]}'
above syntax is compatible with GNU sed.
regrading the AWK based solutions, i've noticed few taking advantage of FS.
I'm not too familiar with BAM format. If CB only show up once per line, then
mawk/mawk2/gawk -b 'BEGIN { FS = "CB:Z:";
} $2 ~ /^[ACGT]/ { # if FS never matches, $2 would be beyond
# end of line, then this would just match
# against null string, & eval to false
seen[substr($2, 1, -1 + match($2, /[^ACGT]|$/))]++
} END { for (x in seen) { print seen[x] " " x } }'
If it shows up more than once, then change that to a loop of any field greater than 1. This version uses the laziest evaluation model possible to speed it up, then do all the uniq -c item.
While this is rather similar to the best answer above, by having FS pre-split the fields, it causes match() and substr() to do a lot less work. I'm simply matching 1 single char after the genetic sequence, and directly using its return, minus 1, as the substring length, and skipping RSTART or RLENGTH all together.
Regarding :
$ diff -sq <(sort f1) <(sort f2)
Files /dev/fd/63 and /dev/fd/62 are identical
$ diff -sq <(sort f1) <(sort f3)
Files /dev/fd/63 and /dev/fd/62 are identical
there's absolutely no need to have them physically output to disk and do a diff. Just simply have the output of each piped to a very high speed hashing algorithm that adds close to no time (when the output is gigantic enough you might end up saving time versus going to disk.
my personal favorite is xxhash in 128-bit mode, available via python pip. it's NOT a cryptographic hash, but it's much faster than even something like MD5. This method also allows for hassle-free compare since the benchmark timing of it will also perform the accuracy check.
I have a bash script which gets a text file as input and takes two parameters (Line N° one and line N° two), then changes both lines with each other in the text. Here is the code:
#!/bin/bash
awk -v var="$1" -v var1="$2" 'NR==var {
s=$0
for(i=var+1; i < var1 ; i++) {
getline; s1=s1?s1 "\n" $0:$0
}
getline; print; print s1 s
next
}1' Ham > newHam_changed.txt
It works fine for every two lines which are not consecutive. but for lines which follows after each other (for ex line 5 , 6) it works but creates a blank line between them. How can I fix that?
I think your actual script is not what you posted in the question. I think the line with all the prints contains:
print s1 "\n" s
The problem is that when the lines are consecutive, s1 will be empty (the for loop is skipped), but it will still print a newline before s, producing a blank line.
So you need to make that newline conditional.
awk -v var="4" -v var1="6" 'NR==var {
s=$0
for(i=var+1; i < var1 ; i++) {
getline; s1=s1?s1 "\n" $0:$0
}
getline; print; print (s1 ? s1 "\n" : "") s
next
}1' Ham > newHam_changed.txt
Using getline makes awk scripts always a bit complicated. It is better to prevent the use of getline and just make use of the awk pattern { action } syntax. This will make perfectly readable scripts. In any other language you would just do a loop and get the next line, but in awk I think it is best to make good use of this feature.
awk -v var="$1" -v var1="$2" '
NR==var {s=$0; collect=1; next;}
NR==var1 {collect=0; print; printf inbetween; print s}
collect {inbetween=inbetween""$0"\n"; next;}
1' Ham
Here I capture the first line in s when I found it and set the collect flag. This will trigger the collect block on the next iteration which collects all lines in between. Whenever the second line is found it sets the collect back to zero and prints first the current line, than the inbetween lines and then s. If the lines are consecutive inbetween is empty and printf will than do nothing.
Too complex for my taste, here is something quite simple that achieves the same task:
#!/bin/bash
ORIGFILE='original.txt' # original text file
PROCFILE='processed.txt' # copy of the original file to be proccesed
CHGL1=`sed "$1q;d" $ORIGFILE` # get original $1 line
CHGL2=`sed "$2q;d" $ORIGFILE` # get original $2 line
`cat $ORIGFILE > $PROCFILE`
sed -i "$2s/^.*/$CHGL1/" $PROCFILE # replace
sed -i "$1s/^.*/$CHGL2/" $PROCFILE # replace
More code doesn't mean more useful, keep it simple. This code do not use for and instead goes directly to the specific lines.
EDIT:
A simple way on one line to do this task:
printf '%s\n' 14m26 26-m14- w q | ed -s file
Found in this answer.
I'm writing a linux-command that pasts corresponding characters from multiple lines together. For example: I want to change these lines
A---
-B--
---C
--D-
to this:
A----B-----D--C-
So far, i've made this:
cat sanger.a sanger.c sanger.g sanger.t | cut -c 1
This does the trick for only the first column, but it has to work for all the columns.
Is there anyone who can help?
EDIT: This is a better example. I want this:
SUGAR
HONEY
CANDY
to become
SHC UOA GND AED RYY (without spaces)
Awk way for updated spec
awk -vFS= '{for(i=1;i<=NF;i++)a[i]=a[i]$i}
END{for(i=1;i<=NF;i++)printf "%s",a[i];print ""}' file
Output
A----B-----D--C-
SHCUOAGNNAEDRYY
P.s for a large file this will use lots of memory
A terrible way not using awk, also you need to know the number of fields before hand.
for i in {1..4};do cut -c $i test | tr -d "\n" ; done;echo
Here's a solution without awk or sed, assuming the file is named f:
paste -s -d "" <(for i in $(seq 1 $(wc -L < f)); do cut -c $i f; done)
wc -L is a GNUism which returns the length of the longest line in the input file, which might not work depending on your version/locale. You could instead find the longest line by doing something like:
awk '{if (length > x) {x = length}} END {print x}' f
Then using this value in the seq command instead of the above command substitution.
All right, time for some sed insanity! :D
Disclaimer: If this is for something serious, use something less brittle than this. awk comes to mind. Unless you feel confident enough in your sed abilities to maintain this lunacy.
cat file1 file2 etc | sed -n '1h; 1!H; $ { :loop; g; s/$/\n/; s/\([^\n]\)[^\n]*\n/\1/g; p; g; s/^.//; s/\n./\n/g; h; /[^\n]/ b loop }' | tr -d '\n'; echo
This comes in three parts: Say you have a file foo.txt
12345
67890
abcde
fghij
then
cat foo.txt | sed -n '1h; 1!H; $ { :loop; g; s/$/\n/; s/\([^\n]\)[^\n]*\n/\1/g; p; g; s/^.//; s/\n./\n/g; h; /[^\n]/ b loop }'
produces
16af
27bg
38ch
49di
50ej
After that, tr -d '\n' deletes the newlines, and ;echo adds one at the end.
The heart of this madness is the sed code, which is
1h
1!H
$ {
:loop
g
s/$/\n/
s/\([^\n]\)[^\n]*\n/\1/g
p
g
s/^.//
s/\n./\n/g
h
/[^\n]/ b loop
}
This first follows the basic pattern
1h # if this is the first line, put it in the hold buffer
1!H # if it is not the first line, append it to the hold buffer
$ { # if this is the last line,
do stuff # do stuff. The whole input is in the hold buffer here.
}
which assembles all input in the hold buffer before working on it. Once the whole input is in the hold buffer, this happens:
:loop
g # copy the hold buffer to the pattern space
s/$/\n/ # put a newline at the end
s/\([^\n]\)[^\n]*\n/\1/g # replace every line with only its first character
p # print that
g # get the hold buffer again
s/^.// # remove the first character from the first line
s/\n./\n/g # remove the first character from all other lines
h # put that back in the hold buffer
/[^\n]/ b loop # if there's something left other than newlines, loop
And there you have it. I might just have summoned Cthulhu.
I have two big files with a lot of text, and what I have to do is keep all lines in file A that have a field that matches a field in file B.
file A is something like:
Name (tab) # (tab) # (tab) KEYFIELD (tab) Other fields
file B I managed to use cut and sed and other things to basically get it down to one field that is a list.
So The goal is to keep all lines in file A in the 4th field (it says KEYFIELD) if the field for that line matches one of the lines in file B. (Does NOT have to be an exact match, so if file B had Blah and file A said Blah_blah, it'd be ok)
I tried to do:
grep -f fileBcutdown fileA > outputfile
EDIT: Ok I give up. I just force killed it.
Is there a better way to do this? File A is 13.7MB and file B after cutting it down is 32.6MB for anyone that cares.
EDIT: This is an example line in file A:
chr21 33025905 33031813 ENST00000449339.1 0 - 33031813 33031813 0 3 1835,294,104, 0,4341,5804,
example line from file B cut down:
ENST00000111111
Here's one way using GNU awk. Run like:
awk -f script.awk fileB.txt fileA.txt
Contents of script.awk:
FNR==NR {
array[$0]++
next
}
{
line = $4
sub(/\.[0-9]+$/, "", line)
if (line in array) {
print
}
}
Alternatively, here's the one-liner:
awk 'FNR==NR { array[$0]++; next } { line = $4; sub(/\.[0-9]+$/, "", line); if (line in array) print }' fileB.txt fileA.txt
GNU awk can also perform the pre-processing of fileB.txt that you described using cut and sed. If you would like me to build this into the above script, you will need to provide an example of what this line looks like.
UPDATE using files HumanGenCodeV12 and GenBasicV12:
Run like:
awk -f script.awk HumanGenCodeV12 GenBasicV12 > output.txt
Contents of script.awk:
FNR==NR {
gsub(/[^[:alnum:]]/,"",$12)
array[$12]++
next
}
{
line = $4
sub(/\.[0-9]+$/, "", line)
if (line in array) {
print
}
}
This successfully prints lines in GenBasicV12 that can be found in HumanGenCodeV12. The output file (output.txt) contains 65340 lines. The script takes less than 10 seconds to complete.
You're hitting the limit of using the basic shell tools. Assuming about 40 characters per line, File A has 400,000 lines in it and File B has about 1,200,000 lines in it. You're basically running grep for each line in File A and having grep plow through 1,200,000 lines with each execution. that's 480 BILLION lines you're parsing through. Unix tools are surprisingly quick, but even something fast done 480 billion times will add up.
You would be better off using a full programming scripting language like Perl or Python. You put all lines in File B in a hash. You take each line in File A, check to see if that fourth field matches something in the hash.
Reading in a few hundred thousand lines? Creating a 10,000,000 entry hash? Perl can parse both of those in a matter of minutes.
Something -- off the top of my head. You didn't give us much in the way of spects, so I didn't do any testing:
#! /usr/bin/env perl
use strict;
use warnings;
use autodie;
use feature qw(say);
# Create your index
open my $file_b, "<", "file_b.txt";
my %index;
while (my $line = <$file_b>) {
chomp $line;
$index{$line} = $line; #Or however you do it...
}
close $file_b;
#
# Now check against file_a.txt
#
open my $file_a, "<", "file_a.txt";
while (my $line = <$file_a>) {
chomp $line;
my #fields = split /\s+/, $line;
if (exists $index{$field[3]}) {
say "Line: $line";
}
}
close $file_a;
The hash means you only have to read through file_b once instead of 400,000 times. Start the program, go grab a cup of coffee from the office kitchen. (Yum! non-dairy creamer!) By the time you get back to your desk, it'll be done.
grep -f seems to be very slow even for medium sized pattern files (< 1MB). I guess it tries every pattern for each line in the input stream.
A solution, which was faster for me, was to use a while loop. This assumes that fileA is reasonably small (it is the smaller one in your example), so iterating multiple times over the smaller file is preferable over iterating the larger file multiple times.
while read line; do
grep -F "$line" fileA
done < fileBcutdown > outputfile
Note that this loop will output a line several times if it matches multiple patterns. To work around this limitation use sort -u, but this might be slower by quite a bit. You have to try.
while read line; do
grep -F "$line" fileA
done < fileBcutdown | sort -u | outputfile
If you depend on the order of the lines, then I don't think you have any other option than using grep -f. But basically it boils down to trying m*n pattern matches.
use the below command:
awk 'FNR==NR{a[$0];next}($4 in a)' <your filtered fileB with single field> fileA
I want to strip a chunk of lines from a big text file. I know the start and end line number. What is the most elegant way to get the content (lines between the A and B) out to some file?
I know the head and tail commands - is there even a quicker (one step) way?
The file is over 5GB and it contains over 81 mio lines.
UPDATED: The results
time sed -n 79224100,79898190p BIGFILE.log > out4.log
real 1m9.988s
time tail -n +79224100 BIGFILE.log | head -n +`expr 79898190 - 79224100` > out1.log
real 1m11.623s
time perl fileslice.pl BIGFILE.log 79224100 79898190 > out2.log
real 1m13.302s
time python fileslice.py 79224100 79898190 < BIGFILE.log > out3.log
real 1m13.277s
The winner is sed. The fastest, the shortest. I think Chuck Norris would use it.
sed -n '<A>,<B>p' input.txt
This works for me in GNU sed:
sed -n 'I,$p; Jq'
The q quits when the indicated line is processed.
for example, these large numbers work:
$ yes | sed -n '200000000,${=;p};200000005q'
200000000
y
200000001
y
200000002
y
200000003
y
200000004
y
200000005
y
I guess big files need a bigger solution...
fileslice.py:
import sys
import itertools
for line in itertools.islice(sys.stdin, int(sys.argv[1]) - 1, int(sys.argv[2])):
sys.stdout.write(line)
invocation:
python fileslice.py 79224100 79898190 < input.txt > output.txt
Here's a perl solution :)
fileslice.pl:
#!/usr/bin/perl
use strict;
use warnings;
use IO::File;
my $first = $ARGV[1];
my $last = $ARGV[2];
my $fd = IO::File->new($ARGV[0], 'r') or die "Unable to open file $ARGV[0]: $!\n";
my $i = 0;
while (<$fd>) {
$i++;
next if ($i < $first);
last if ($i > $last);
print $_;
}
Start with
perl fileslice.pl file 79224100 79898190