How to make while read faster (how to use grep instead) - linux

I have a file named "compare" and a file named "final_contigs_c10K.fa"
I want to eleminate lines AND THE NEXT LINE from "final_contigs_c10K.fa" containing specific strings in "compare".
compare looks like this :
k119_1
k119_3
...
and the number of lines of compare is 26364.
final_contigs_c10K.fa looks like :
>k119_1
AAAACCCCC
>k119_2
CCCCC
>k119_3
AAAAAAAA
...
I want to make make final_contigs_c10K.fa into a format :
>k119_1
AAAACCCCC
>k119_3
AAAAAAAA
...
I tried this code, but this code takes too much time, though it seems to be working fine. I think it takes too much time because the number of lines in compare is 26364, which is too much compared to my other files that I had tested the code on.
while read line; do sed -i -e "/$line/ { N; d; }" final_contigs_c10K.fa; done < compare
Is there a way to make this command faster?

Using awk
$ awk 'NR==FNR{a[">" $1];next}$1 in a{p=3} --p>0' compare final_contigs_c10K.fa
>k119_1
AAAACCCCC
>k119_3
AAAAAAAA
This will produce the output to stdout ie. won't make any changes to original files.
Explained:
$ awk '
NR==FNR { # process the first file
a[">" $1] # hash to a, adding > while at it
next # process the next record
} # process th second file after this point
$1 in a { p=3 } # if current record was in compare file set p
--p>0 # print current file match and the next record
' compare final_contigs_c10K.fa # mind the file order

Related

How to save in two columns of the same file from different output in bash

I am working on a project that require me to take some .bed in input, extract one column from each file, take only certain parameters and count how many of them there are for each file. I am extremely inexperienced with bash so I don't know most of the commands. But with this line of code it should do the trick.
for FILE in *; do cat $FILE | awk '$9>1.3'| wc -l ; done>/home/parallels/Desktop/EP_Cell_Type.xls
I saved those values in a .xls since I need to do some graphs with them.
Now I would like to take the filenames with -ls and save them in the first column of my .xls while my parameters should be in the 2nd column of my excel file.
I managed to save everything in one column with the command:
ls>/home/parallels/Desktop/EP_Cell_Type.xls | for FILE in *; do cat $FILE | awk '$9>1.3'-x| wc -l ; done >>/home/parallels/Desktop/EP_Cell_Type.xls
My sample files are:A549.bed, GM12878.bed, H1.bed, HeLa-S3.bed, HepG2.bed, Ishikawa.bed, K562.bed, MCF-7.bed, SK-N-SH.bed and are contained in a folder with those files only.
The output is the list of all filenames and the values on the same column like this:
Column 1
A549.bed
GM12878.bed
H1.bed
HeLa-S3.bed
HepG2.bed
Ishikawa.bed
K562.bed
MCF-7.bed
SK-N-SH.bed
4536
8846
6754
14880
25440
14905
22721
8760
28286
but what I need should be something like this:
Filenames
#BS
A549.bed
4536
GM12878.bed
8846
H1.bed
6754
HeLa-S3.bed
14880
HepG2.bed
25440
Ishikawa.bed
14905
K562.bed
22721
MCF-7.bed
8760
SK-N-SH.bed
28286
Assuming OP's awk program (correctly) finds all of the desired rows, an easier (and faster) solution can be written completely in awk.
One awk solution that keeps track of the number of matching rows and then prints the filename and line count:
awk '
FNR==1 { if ( count >= 1 ) # first line of new file? if line counter > 0
printf "%s\t%d\n", prevFN, count # then print previous FILENAME + tab + line count
count=0 # then reset our line counter
prevFN=FILENAME # and save the current FILENAME for later printing
}
$9>1.3 { count++ } # if field #9 > 1.3 then increment line counter
END { if ( count >= 1 ) # flush last FILENAME/line counter to stdout
printf "%s\t%d\n", prevFN, count
}
' * # * ==> pass all files as input to awk
For testing purposes I replaced $9>1.3 with /do/ (match any line containing the string 'do') and ran against a directory containing an assortment of scripts and data files. This generated the following tab-delimited output:
bigfile.txt 7
blocker_tree.sql 4
git.bash 2
hist.bash 4
host.bash 2
lines.awk 2
local.sh 3
multi_file.awk 2

How can I split a large file with tab-separated values into smaller files while keeping lines together in a single file based on the first value?

I have a file (currently about 1 GB, 40M lines), and I need to split it into about smaller files based on a target file size (target is ~1 MB per file).
The file contains multiple lines of tab-separated values. The first column has an integer value. The file is sorted by the first column. There are about 1M values in the first column, so each value has on average 40 lines, but some may have 2 and others may have 100 or more lines.
12\t...
12\t...
13\t...
14\t...
15\t...
15\t...
15\t...
16\t...
...
2584765\t...
2586225\t...
2586225\t...
After splitting the file, any distinct first value must only appear in a single file. E.g. when I read a smaller file and find a line starting with 15, it is guaranteed that no other files contain lines starting with 15.
This does not mean map all lines that start with a specific value to a single file.
Is this possible with the commandline tools available on a Unix/Linux system?
The following will try to split every 40,000 records, but postpone the split if the next record has the same key as the previous.
awk -F '\t' 'BEGIN { i=1; s=0; f=sprintf("file%05i", i) }
NR % 40000 == 0 { s=1 }
s==1 && $1!=k { close(f); f=sprintf("file%05i", ++i); s=0 }
{ k=$1; print >>f }' input
List all the keys by looking at only the first column awk and making them unique sort -u. Then for each of these keys, select only the lines that start with the key grep and redirect this into a file named after the key.
Oneliner:
for key in `awk '{print $1;}' file_to_split | sort -u` ; do grep -e "^$key\\s" file_to_split > splitted_file_$key ; done
Or multiple lines for a script file and better readability:
for key in `awk '{print $1;}' file_to_split | sort -u`
do
grep -e "^$key\\s" file_to_split > splitted_file_$key
done
Not especially efficient as it parses the files many times.
Also not sure the for command being able to use such a large input from the `` subcommand.
On unix systems you also can usually use perl. So here is a perl solution:
#!/usr/local/bin/perl
use strict;
my $last_key;
my $store;
my $c = 0;
my $max_size = 1000000;
while(<>){
my #fields = split(/\t/);
my $key = $fields[0];
if ($last_key ne $key) {
store() if (length($store)>$max_size);
}
$store.=$_;
$last_key = $key;
}
store();
sub store {
$c++;
open (O, ">", "${c}.part");
print O $store;
close O;
$store='';
}
save it as x.pl.
use it like:
x.pl bigfile.txt
It sorts your entries into
1.part
2.part
...
files and tries to keep them around $max_size.
HTH

Bash script to list files periodically

I have a huge set of files, 64,000, and I want to create a Bash script that lists the name of files using
ls -1 > file.txt
for every 4,000 files and store the resulted file.txt in a separate folder. So, every 4000 files have their names listed in a text files that is stored in a folder. The result is
folder01 contains file.txt that lists files #0-#4000
folder02 contains file.txt that lists files #4001-#8000
folder03 contains file.txt that lists files #8001-#12000
.
.
.
folder16 contains file.txt that lists files #60000-#64000
Thank you very much in advance
You can try
ls -1 | awk '
{
if (! ((NR-1)%4000)) {
if (j) close(fnn)
fn=sprintf("folder%02d",++j)
system("mkdir "fn)
fnn=fn"/file.txt"
}
print >> fnn
}'
Explanation:
NR is the current record number in awk, that is: the current line number.
NR starts at 1, on the first line, so we subtract 1 such that the if statement is true for the first line
system calls an operating system function from within awk
print in itself prints the current line to standard output, we can redirect (and append) the output to the file using >>
All uninitialized variables in awk will have a zero value, so we do not need to say j=0 in the beginning of the program
This will get you pretty close;
ls -1 | split -l 4000 -d - folder
Run the result of ls through split, breaking every 4000 lines (-l 4000), using numeric suffixes (-d), from standard input (-) and start the naming of the files with folder.
Results in folder00, folder01, ...
Here an exact solution using awk:
ls -1 | awk '
(NR-1) % 4000 == 0 {
dir = sprintf("folder%02d", ++nr)
system("mkdir -p " dir);
}
{ print >> dir "/file.txt"} '
There are already some good answers above, but I would also suggest you take a look at the watch command. This will re-run a command every n seconds, so you can, well, watch the output.

How to display line numbers when comparing files with linux "comm" tool

I would like to diff two very large files (multi-GB), using linux command line tools, and see the line numbers of the differences. The order of the data matters.
I am running on a Linux machine and the standard diff tool gives me the "memory exhausted" error. -H had no effect.
In my application, I only need to stream the diff results. That is, I just want to visually look at the first few differences, I don't need to inspect the entire file. If there are differences, a quick glance will tell me what is wrong.
'comm' seems well suited to this, but it does not display line numbers of the differences.
In general, my multi-GB files only have a few hundred lines that are different, the rest of the file is the same.
Is there a way to get comm to dump the line number? Or a way to make diff run without loading the entire file into memory? (like cutting the input files into 1k blocks, without actually creating a million 1k-files in my filesystem and cluttering everything up)?
I won't use comm, but as you said WHAT you need, in addition to HOW you thought you should do it, I'll focus on the "WHAT you need" instead :
An interesting way would be to use paste and awk : paste can show 2 files "side by side" using a separator. If you use \n as separator, it display the 2 files with line 1 of each , followed by line 2 of each etc.
So the script you could use could be simply (once you know that there are the same number of lines in each files) :
paste -d '\n' /tmp/file1 /tmp/file2 | awk '
NR%2 { linefirstfile=$0 ; }
!(NR%2) { if ( $0 != linefirstfile )
{ print "line",NR/2,": "; print linefirstfile ; print $0 ; } }'
(Interrestingly, this solution will allow be easily extended to do a diff of N files in a single read, whatever the sizes of the N files are ... just adding a check that all have the same amount of lines before doing the comparison steps (otherwise "paste" will in the end show only lines from the bigger files))
Here is a (short) example, to show how it works:
$ cat > /tmp/file1
A
C %FORGOT% fmsdflmdflskdf dfldksdlfkdlfkdlkf
E
$ cat > /tmp/file2
A
C sdflmsdflmsdfsklmdfksdmfksd fmsdflmdflskdf dfldksdlfkdlfkdlkf
E
$ paste -d '\n' /tmp/file1 /tmp/file2
A
A
C %FORGOT% fmsdflmdflskdf dfldksdlfkdlfkdlkf
C sdflmsdflmsdfsklmdfksdmfksd fmsdflmdflskdf dfldksdlfkdlfkdlkf
E
E
$ paste -d '\n' /tmp/file1 /tmp/file2 | awk '
NR%2 { linefirstfile=$0 ; }
!(NR%2) { if ( $0 != linefirstfile )
{ print "line",NR/2,": "; print linefirstfile ; print $0 ; } }'
line 2 :
C %FORGOT% fmsdflmdflskdf dfldksdlfkdlfkdlkf
C sdflmsdflmsdfsklmdfksdmfksd fmsdflmdflskdf dfldksdlfkdlfkdlkf
If it happens that the files don't have the same amount of lines, then you can add first a check of the number of line, comparing $(wc -l /tmp/file1) and $(wc -l /tmp/file2) , and only do the past...|awk if they have the same amount of line, to ensure the "paste" works correctly by always having one line of each! (But of course, in that case, there will be one (fast!) entire read of each file...)
You can easily adjust it to display exactly as you need it to. And you could quit after the Nth difference (either automatically, with a counter in the awk loop, or by pressing CTRL-C when you saw enough)
Which versions of diff have you tried? GNU diff has a "--speed-large-files" which may help.
The comm tool assumes the lines are sorted.

Bash: How to keep lines in a file that have fields that match lines in another file?

I have two big files with a lot of text, and what I have to do is keep all lines in file A that have a field that matches a field in file B.
file A is something like:
Name (tab) # (tab) # (tab) KEYFIELD (tab) Other fields
file B I managed to use cut and sed and other things to basically get it down to one field that is a list.
So The goal is to keep all lines in file A in the 4th field (it says KEYFIELD) if the field for that line matches one of the lines in file B. (Does NOT have to be an exact match, so if file B had Blah and file A said Blah_blah, it'd be ok)
I tried to do:
grep -f fileBcutdown fileA > outputfile
EDIT: Ok I give up. I just force killed it.
Is there a better way to do this? File A is 13.7MB and file B after cutting it down is 32.6MB for anyone that cares.
EDIT: This is an example line in file A:
chr21 33025905 33031813 ENST00000449339.1 0 - 33031813 33031813 0 3 1835,294,104, 0,4341,5804,
example line from file B cut down:
ENST00000111111
Here's one way using GNU awk. Run like:
awk -f script.awk fileB.txt fileA.txt
Contents of script.awk:
FNR==NR {
array[$0]++
next
}
{
line = $4
sub(/\.[0-9]+$/, "", line)
if (line in array) {
print
}
}
Alternatively, here's the one-liner:
awk 'FNR==NR { array[$0]++; next } { line = $4; sub(/\.[0-9]+$/, "", line); if (line in array) print }' fileB.txt fileA.txt
GNU awk can also perform the pre-processing of fileB.txt that you described using cut and sed. If you would like me to build this into the above script, you will need to provide an example of what this line looks like.
UPDATE using files HumanGenCodeV12 and GenBasicV12:
Run like:
awk -f script.awk HumanGenCodeV12 GenBasicV12 > output.txt
Contents of script.awk:
FNR==NR {
gsub(/[^[:alnum:]]/,"",$12)
array[$12]++
next
}
{
line = $4
sub(/\.[0-9]+$/, "", line)
if (line in array) {
print
}
}
This successfully prints lines in GenBasicV12 that can be found in HumanGenCodeV12. The output file (output.txt) contains 65340 lines. The script takes less than 10 seconds to complete.
You're hitting the limit of using the basic shell tools. Assuming about 40 characters per line, File A has 400,000 lines in it and File B has about 1,200,000 lines in it. You're basically running grep for each line in File A and having grep plow through 1,200,000 lines with each execution. that's 480 BILLION lines you're parsing through. Unix tools are surprisingly quick, but even something fast done 480 billion times will add up.
You would be better off using a full programming scripting language like Perl or Python. You put all lines in File B in a hash. You take each line in File A, check to see if that fourth field matches something in the hash.
Reading in a few hundred thousand lines? Creating a 10,000,000 entry hash? Perl can parse both of those in a matter of minutes.
Something -- off the top of my head. You didn't give us much in the way of spects, so I didn't do any testing:
#! /usr/bin/env perl
use strict;
use warnings;
use autodie;
use feature qw(say);
# Create your index
open my $file_b, "<", "file_b.txt";
my %index;
while (my $line = <$file_b>) {
chomp $line;
$index{$line} = $line; #Or however you do it...
}
close $file_b;
#
# Now check against file_a.txt
#
open my $file_a, "<", "file_a.txt";
while (my $line = <$file_a>) {
chomp $line;
my #fields = split /\s+/, $line;
if (exists $index{$field[3]}) {
say "Line: $line";
}
}
close $file_a;
The hash means you only have to read through file_b once instead of 400,000 times. Start the program, go grab a cup of coffee from the office kitchen. (Yum! non-dairy creamer!) By the time you get back to your desk, it'll be done.
grep -f seems to be very slow even for medium sized pattern files (< 1MB). I guess it tries every pattern for each line in the input stream.
A solution, which was faster for me, was to use a while loop. This assumes that fileA is reasonably small (it is the smaller one in your example), so iterating multiple times over the smaller file is preferable over iterating the larger file multiple times.
while read line; do
grep -F "$line" fileA
done < fileBcutdown > outputfile
Note that this loop will output a line several times if it matches multiple patterns. To work around this limitation use sort -u, but this might be slower by quite a bit. You have to try.
while read line; do
grep -F "$line" fileA
done < fileBcutdown | sort -u | outputfile
If you depend on the order of the lines, then I don't think you have any other option than using grep -f. But basically it boils down to trying m*n pattern matches.
use the below command:
awk 'FNR==NR{a[$0];next}($4 in a)' <your filtered fileB with single field> fileA

Resources