Comparing columns in two different files and producing three outputs

Comparing columns in two different files and producing three outputs - linux

I have multiple paired files with headings xxx_1.txt and xxx_2.txt, yyy_1.txt and yyy_2.txt, etc. They are single column files with the following format:
xxx_1.txt:
#CHROM_POSREFALT
MSHR1153_annotated_1_9107CA
MSHR1153_annotated_1_9197CT
MSHR1153_annotated_1_9303TC
MSHR1153_annotated_1_10635GA
MSHR1153_annotated_1_10836AG
MSHR1153_annotated_1_11108AG
MSHR1153_annotated_1_11121GA
MSHR1153_annotated_1_11123CT
MSHR1153_annotated_1_11131CT
MSHR1153_annotated_1_11155AG
MSHR1153_annotated_1_11166CT
MSHR1153_annotated_1_11186TC
MSHR1153_annotated_1_11233TG
MSHR1153_annotated_1_11274GT
MSHR1153_annotated_1_11472CG
MSHR1153_annotated_1_11814GA
MSHR1153_annotated_1_11815CT
xxx_2.txt:
LocationMSHR1153_annotatedMSHR0491_Australasia
MSHR1153_annotated_1_56TC
MSHR1153_annotated_1_226AG
MSHR1153_annotated_1_670AG
MSHR1153_annotated_1_817CT
MSHR1153_annotated_1_1147TC
MSHR1153_annotated_1_1660TC
MSHR1153_annotated_1_2488AG
MSHR1153_annotated_1_2571GA
MSHR1153_annotated_1_2572TC
MSHR1153_annotated_1_2698TC
MSHR1153_annotated_1_2718TG
MSHR1153_annotated_1_3018TC
MSHR1153_annotated_1_3424TC
MSHR1153_annotated_1_3912CT
MSHR1153_annotated_1_4013GA
MSHR1153_annotated_1_4087GC
MSHR1153_annotated_1_4878CT
MSHR1153_annotated_1_5896GA
MSHR1153_annotated_1_7833TG
MSHR1153_annotated_1_7941CT
MSHR1153_annotated_1_8033GA
MSHR1153_annotated_1_8888AC
MSHR1153_annotated_1_9107CA
MSHR1153_annotated_1_9197CT
They are actually much longer than this. My goal is two compare each line and produce multiple outputs for the purpose of creating a venn diagram later on. So I need one file which lists all the lines in common which looks like this (in this case there is only one):
MSHR1153_annotated_1_9107CA
One file that lists everything specific to xxx_1 and one file which lists everything specific to xxx_2.
I have so far come up with this:
awk ' FNR==NR { position[$1]=$1; next} {if ( $1 in position ) {print $1 > "foundinboth"} else {print $1 > "uniquetofile1"}} ' FILE2 FILE1
The problem is I know how over 300 paired files to run through, and if I use this I have to change them manually each time. It also doesn't produce all the files at the same time. Is there a way to do this to loop through and change everything automatically? The files are paired so that the suffix at the end is different "_1" and "_2". I need it to loop through each paired file and produce everything I need at the same time.

Would you please try the following:
for f in *_1.txt; do # find files such as "xxx_1.txt"
basename=${f%_*} # extract "xxx" portion
if [[ -f ${basename}_2.txt ]]; then # make sure "xxx_2.txt" exists
file1="${basename}_1.txt" # assign bash variable file1
file2="${basename}_2.txt" # assign bash variable file2
both="${basename}_foundinboth.txt"
uniq1="${basename}_uniquetofile1.txt"
uniq2="${basename}_uniquetofile2.txt"
awk -v both="$both" -v uniq1="$uniq1" -v uniq2="$uniq2" '
# pass the variables to AWK with -v option
FNR==NR { b[$1]=$1; next }
{
if ($1 in b) {
print $1 > both
seen[$1]++ # mark if the line is found in file1
} else {
print $1 > uniq1
}
}
END {
for (i in b) {
if (! seen[i]) { # the line is not found in file1
print i > uniq2 # then it is unique to file2
}
}
}' "$file2" "$file1"
fi
done
Please note that the lines in *_uniquetofile2.txt do not keep the original order.
If you need them to, please try to sort them for yourself or let me know.

Related

AWK how to process multiple files and comparing them IN CONTROL FILE! (not command line one-liner)

I read all of answers for similar problems but they are not working for me because my files are not uniformal, they contain several control headers and in such case is safer to create script than one-liner and all the answers focused on one-liners. In theory one-liners commands should be convertible to script but I am struggling to achieve:
printing the control headers
print only the records started with 16 in <file 1> where value of column 2 NOT EXISTS in column 2 of the <file 2>
I end up with this:
BEGIN {
FS="\x01";
OFS="\x01";
RS="\x02\n";
ORS="\x02\n";
file1=ARGV[1];
file2=ARGV[2];
count=0;
}
/^#/ {
print;
count++;
}
# reset counters after control headers
NR=1;
FNR=1;
# Below gives syntax error
/^16/ AND NR==FNR {
a[$2];next; 'FNR==1 || !$2 in a' file1 file2
}
END {
}
Googling only gives me results for command line processing and documentation is also silent in that regard. Does it mean it cannot be done?

Perhaps try:
script.awk:
BEGIN {
OFS = FS = "\x01"
ORS = RS = "\x02\n"
}
NR==FNR {
if (/^16/) a[$2]
next
}
/^16/ && !($2 in a) || /^#/
Note the parentheses: !$2 in a would be parsed as (!$2) in a
Invoke with:
awk -f script.awk FILE2 FILE1
Note order of FILE1 / FILE2 is reversed; FILE2 must be read first to pre-populate the lookup table.

First of all, short answer to my question should be "NOT POSSIBLE", if anyone read question carefully and knew AWK in full that is obvious answer, I wish I knew it sooner instead of wasting few days trying to write script.
Also, there is no such thing as minimal reproducible example (this was always constant pain on TeX groups) - I need full example working, if it works on 1 row there is no guarantee if it works on 2 rows and my number of rows is ~ 127 mln.
If you read code carefully than you would know what is not working - I put in comment section what is giving syntax error. Anyway, as #Daweo suggested there is no way to use logic operator in pattern section. So because we don't need printing in first file the whole trick is to do conditional in second brackets:
awk -F, 'BEGIN{} NR==FNR{a[$1];next} !($1 in a) { if (/^16/) print $0} ' set1.txt set2.txt
assuming in above example that separator is comma. I don't know where assumption about multiple RS support only in gnu awk came from. On MacOS BSD awk it works exactly the same, but in fact RS="\x02\n" is single separator not two separators.

Using awk to add a different value to a new variable at every "append" instance

I'm using Bash, and I have a directory of .tsv files containing different behavioral data (RT and accuracy) for different subjects and multiple sessions within the same subjects. My goal is to concatenate the RT field (in field 3 of each .tsv file) and the accuracy field (in field 9) across all these files into a single .tsv file, while adding the subject and session (defined based on the directory names) as new variables in this concatenated file every time I append a new file, so I can keep together the subject-session data with the RT and accuracy data.
To illustrate, each .tsv file has the following header in every row:
V1 V2 RT V4 V5 V6 V7 V8 ACC
I want to look through many of these files, extracting just the RT and ACC fields and adding the data in these fields to a new .tsv file with SUB and SES as new variables in a file called "summary.tsv":
SUB SES RT ACC
Here's the code I have so far:
subdir=~/path/to/subdir
for subs in ${subdir}/subject-*; do
sub=$(basename ${subs})
for sess in ${sub}/session-*; do
ses=$(basename ${ses})
for files in ${sess}/*.tsv; do
if [[ -e $files ]] && [[ -e ${outdir}/summary.tsv ]] ; then
awk 'NR > 1 {print $3,$9}' ${files} >> ${outdir}/summary.tsv
fi
if [[ -e $files ]] && [[ ! -e ${outdir}/summary.tsv ]] ; then
awk '{print $3,$9}' ${files} > ${outdir}/summary.tsv
fi
done
done
done
This works fine to concatenate files into the summary.tsv file without repeating each file's header, but what I can't figure out is how to add 2 new variables with the same length as the appended output in the "awk 'NR > 1 {print $3,$9}' ${files} >> ${outdir}/summary.tsv" line, containing the corresponding ${sub} and ${ses} variables in the 1st and 2nd fields.
Any suggestions? Thank you so much in advance.

Your script has a number of issues, but the answer to your actual question is
awk -v subj="$sub" -v ses="$ses" 'BEGIN { OFS="\t" }
NR>1 { print subj, ses, $3, $9 }'
Awk can read many files so the innermost loop is unnecessary. Here is a tentative refactoring.
for subs in ~/path/to/subdir/subject-*; do
sub=$(basename "$subs")
for sess in "$sub"/session-*; do
ses=$(basename "$ses")
awk -v subj="$sub" -v ses="$ses" '
BEGIN { OFS="\t" }
FNR>1 { print subj, ses, $3, $9 }' \
"$sess"/*.tsv
done
done >> "$outdir"/summary.tsv
I would recommend against having headers in the output file at all, but if you need a header line, writing one before the main script should be easy enough.
If your diectory structure is this simple (and you don't have hundreds of thousands of files, so that passing a single wildcard to Awk will not produce a "command line too long" error) you could probably simplify all the loops into a single Awk script. The current file name is in the FILENAME variable; pulling out the bottom two parent directories with a simple regex or split() should be straghtforward, too.

Using awk to delete multiple lines using argument passed on via function

My input.csv file is semicolon separated, with the first line being a header for attributes. The first column contains customer numbers. The function is being called through a script that I activate from the terminal.
I want to delete all lines containing the customer numbers that are entered as arguments for the script. EDIT: And then export the file as a different file, while keeping the original intact.
bash deleteCustomers.sh 1 3 5
Currently only the last argument is filtered from the csv file. I understand that this is happening because the output file gets overwritten each time the loop runs, restoring all previously deleted arguments.
How can I match all the lines to be deleted, and then delete them (or print everything BUT those lines), and then output it to one file containing ALL edits?
delete_customers () {
echo "These customers will be deleted: "$#""
for i in "$#";
do
awk -F ";" -v customerNR=$i -v input="$inputFile" '($1 != customerNR) NR > 1 { print }' "input.csv" > output.csv
done
}
delete_customers "$#"
Here's some sample input (first piece of code is the first line in the csv file). In the output CSV file I want the same formatting, with the lines for some customers completely deleted.
Klantnummer;Nationaliteit;Geslacht;Title;Voornaam;MiddleInitial;Achternaam;Adres;Stad;Provincie;Provincie-voluit;Postcode;Land;Land-voluit;email;gebruikersnaam;wachtwoord;Collectief ;label;ingangsdatum;pakket;aanvullende verzekering;status;saldo;geboortedatum
1;Dutch;female;Ms.;Josanne;S;van der Rijst;Bliek 189;Hellevoetsluis;ZH;Zuid-Holland;3225 XC;NL;Netherlands;JosannevanderRijst#dayrep.com;Sourawaspen;Lae0phaxee;Klant;CZ;11-7-2010;best;tand1;verleden;-137;30-12-1995
2;Dutch;female;Mrs.;Inci;K;du Bois;Castorweg 173;Hengelo;OV;Overijssel;7557 KL;NL;Netherlands;InciduBois#gustr.com;Hisfireeness;jee0zeiChoh;Klant;CZ;30-8-2015;goed ;geen;verleden;188;1-8-1960
3;Dutch;female;Mrs.;Lusanne;G;Hijlkema;Plutostraat 198;Den Haag;ZH;Zuid-Holland;2516 AL;NL;Netherlands;LusanneHijlkema#dayrep.com;Digum1969;eiTeThun6th;Klant;Achmea;12-2-2010;best;mix;huidig;-335;9-3-1973
4;Dutch;female;Dr.;Husna;M;Hoegee;Tiendweg 89;Ameide;ZH;Zuid-Holland;4233 VW;NL;Netherlands;HusnaHoegee#fleckens.hu;Hatimon;goe5OhS4t;Klant;VGZ;9-8-2015;goed ;gezin;huidig;144;12-8-1962
5;Dutch;male;Mr.;Sieds;D;Verspeek;Willem Albert Scholtenstraat 38;Groningen;GR;Groningen;9711 XA;NL;Netherlands;SiedsVerspeek#armyspy.com;Thade1947;Taexiet9zo;Intern;CZ;17-2-2004;beter;geen;verleden;-49;12-10-1961
6;Dutch;female;Ms.;Nazmiye;R;van Spronsen;Noorderbreedte 180;Amsterdam;NH;Noord-Holland;1034 PK;NL;Netherlands;NazmiyevanSpronsen#jourrapide.com;Whinsed;Oz9ailei;Intern;VGZ;17-6-2003;beter;mix;huidig;178;8-3-1974
7;Dutch;female;Ms.;Livia;X;Breukers;Everlaan 182;Veenendaal;UT;Utrecht;3903

Try this in loop..
awk -v variable=$var '$1 != variable' input.csv
awk - to make decision based on columns
-v - to use a variable into a awk command
variable - store the value for awk to process
$var - to search for a specific string in run-time
!= - to check if not exist
input.csv - your input file
It's awk's behavior, when you use -v it can will work with variable on run-time and provide an output that doesn't contain the value you passed. This way, you get all the values that are not matching to your variable. Hope this is helpful. :)
Thanks

This bash script should work:
!/bin/bash
FILTER="!/(^"$(echo "$#" | sed -e "s/ /\|^/g")")/ {print}"
awk "$FILTER" input.csv > output.csv
The idea is to build an awk relevant FILTER and then use it.
Assuming the call parameters are: 1 2 3, the filter will be: !/(^1|^2|^3)/ {print}
!: to invert matching
^: Beginning of the line
The input data are in the input.csv file and output result will be in the output.csv file.

Print number of lines matching conditions in files generated by count

I'm trying to figure out how to print, using purely awk, lines who satisfied the count number provided by a while count loop in bash. Here's some lines of the input.
NODE_1_posplwpl
NODE_1_owkokwo
NODE_1_kslkow
NODE_2_fbjfh
NODE_2_lsmlsm
NODE_3_Loskos
NODE_3_pospls
What i want to do is to print lines who match, in their second field, the count number provided by the while count loop, into a file named file_$count_test.
So a file called "file_1_test" will contain lines with "NODE_1.." , "file_2_test" will contain lines with "NODE_2.." ; like that with all the lines of the file.
Here's my code.
#! /bin/bash
while read CNAME
do
let count=$count+1
grep "^${CNAME}_" > file_${count}_test
awk -v X=$count '{ FS="_" } { if ($2 == X) print $0 }' > file_${count}_test
done <$1
exit 1
This code creates only the file_1_test, which is empty. So the awk condition seems to work bad.

Looks like you're trying to split your input into separate files named based on the number between the underscores. That'd just be:
awk -F'_' '{print > ("file_" $2 "_test")}' file
You may need to change it to:
awk -F'_' '$2!=prev{close(out); out="file_" $2 "_test"} {print > out; prev=$2}' file
if you're generating a lot of output files and not using GNU awk as that could lead to a "too many open files" error.
wrt your comments below, look:
$ cat file
NODE_1_posplwpl
NODE_1_owkokwo
NODE_1_kslkow
NODE_2_fbjfh
NODE_2_lsmlsm
NODE_3_Loskos
NODE_3_pospls
$ awk -F'_' '{print $0 " > " ("file_" $2 "_test")}' file
NODE_1_posplwpl > file_1_test
NODE_1_owkokwo > file_1_test
NODE_1_kslkow > file_1_test
NODE_2_fbjfh > file_2_test
NODE_2_lsmlsm > file_2_test
NODE_3_Loskos > file_3_test
NODE_3_pospls > file_3_test
Just change $0 " > " to > like in the first script to have the output go to the separate files instead of just showing you that that would happen like this last script does.

Bash: How to keep lines in a file that have fields that match lines in another file?

I have two big files with a lot of text, and what I have to do is keep all lines in file A that have a field that matches a field in file B.
file A is something like:
Name (tab) # (tab) # (tab) KEYFIELD (tab) Other fields
file B I managed to use cut and sed and other things to basically get it down to one field that is a list.
So The goal is to keep all lines in file A in the 4th field (it says KEYFIELD) if the field for that line matches one of the lines in file B. (Does NOT have to be an exact match, so if file B had Blah and file A said Blah_blah, it'd be ok)
I tried to do:
grep -f fileBcutdown fileA > outputfile
EDIT: Ok I give up. I just force killed it.
Is there a better way to do this? File A is 13.7MB and file B after cutting it down is 32.6MB for anyone that cares.
EDIT: This is an example line in file A:
chr21 33025905 33031813 ENST00000449339.1 0 - 33031813 33031813 0 3 1835,294,104, 0,4341,5804,
example line from file B cut down:
ENST00000111111

Here's one way using GNU awk. Run like:
awk -f script.awk fileB.txt fileA.txt
Contents of script.awk:
FNR==NR {
array[$0]++
next
}
{
line = $4
sub(/\.[0-9]+$/, "", line)
if (line in array) {
print
}
}
Alternatively, here's the one-liner:
awk 'FNR==NR { array[$0]++; next } { line = $4; sub(/\.[0-9]+$/, "", line); if (line in array) print }' fileB.txt fileA.txt
GNU awk can also perform the pre-processing of fileB.txt that you described using cut and sed. If you would like me to build this into the above script, you will need to provide an example of what this line looks like.
UPDATE using files HumanGenCodeV12 and GenBasicV12:
Run like:
awk -f script.awk HumanGenCodeV12 GenBasicV12 > output.txt
Contents of script.awk:
FNR==NR {
gsub(/[^[:alnum:]]/,"",$12)
array[$12]++
next
}
{
line = $4
sub(/\.[0-9]+$/, "", line)
if (line in array) {
print
}
}
This successfully prints lines in GenBasicV12 that can be found in HumanGenCodeV12. The output file (output.txt) contains 65340 lines. The script takes less than 10 seconds to complete.

You're hitting the limit of using the basic shell tools. Assuming about 40 characters per line, File A has 400,000 lines in it and File B has about 1,200,000 lines in it. You're basically running grep for each line in File A and having grep plow through 1,200,000 lines with each execution. that's 480 BILLION lines you're parsing through. Unix tools are surprisingly quick, but even something fast done 480 billion times will add up.
You would be better off using a full programming scripting language like Perl or Python. You put all lines in File B in a hash. You take each line in File A, check to see if that fourth field matches something in the hash.
Reading in a few hundred thousand lines? Creating a 10,000,000 entry hash? Perl can parse both of those in a matter of minutes.
Something -- off the top of my head. You didn't give us much in the way of spects, so I didn't do any testing:
#! /usr/bin/env perl
use strict;
use warnings;
use autodie;
use feature qw(say);
# Create your index
open my $file_b, "<", "file_b.txt";
my %index;
while (my $line = <$file_b>) {
chomp $line;
$index{$line} = $line; #Or however you do it...
}
close $file_b;
#
# Now check against file_a.txt
#
open my $file_a, "<", "file_a.txt";
while (my $line = <$file_a>) {
chomp $line;
my #fields = split /\s+/, $line;
if (exists $index{$field[3]}) {
say "Line: $line";
}
}
close $file_a;
The hash means you only have to read through file_b once instead of 400,000 times. Start the program, go grab a cup of coffee from the office kitchen. (Yum! non-dairy creamer!) By the time you get back to your desk, it'll be done.

grep -f seems to be very slow even for medium sized pattern files (< 1MB). I guess it tries every pattern for each line in the input stream.
A solution, which was faster for me, was to use a while loop. This assumes that fileA is reasonably small (it is the smaller one in your example), so iterating multiple times over the smaller file is preferable over iterating the larger file multiple times.
while read line; do
grep -F "$line" fileA
done < fileBcutdown > outputfile
Note that this loop will output a line several times if it matches multiple patterns. To work around this limitation use sort -u, but this might be slower by quite a bit. You have to try.
while read line; do
grep -F "$line" fileA
done < fileBcutdown | sort -u | outputfile
If you depend on the order of the lines, then I don't think you have any other option than using grep -f. But basically it boils down to trying m*n pattern matches.

use the below command:
awk 'FNR==NR{a[$0];next}($4 in a)' <your filtered fileB with single field> fileA

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Comparing columns in two different files and producing three outputs - linux

Related

AWK how to process multiple files and comparing them IN CONTROL FILE! (not command line one-liner)

Using awk to add a different value to a new variable at every "append" instance

Using awk to delete multiple lines using argument passed on via function

Print number of lines matching conditions in files generated by count

Bash: How to keep lines in a file that have fields that match lines in another file?

Categories

Resources