Print number of lines matching conditions in files generated by count - linux

I'm trying to figure out how to print, using purely awk, lines who satisfied the count number provided by a while count loop in bash. Here's some lines of the input.
NODE_1_posplwpl
NODE_1_owkokwo
NODE_1_kslkow
NODE_2_fbjfh
NODE_2_lsmlsm
NODE_3_Loskos
NODE_3_pospls
What i want to do is to print lines who match, in their second field, the count number provided by the while count loop, into a file named file_$count_test.
So a file called "file_1_test" will contain lines with "NODE_1.." , "file_2_test" will contain lines with "NODE_2.." ; like that with all the lines of the file.
Here's my code.
#! /bin/bash
while read CNAME
do
let count=$count+1
grep "^${CNAME}_" > file_${count}_test
awk -v X=$count '{ FS="_" } { if ($2 == X) print $0 }' > file_${count}_test
done <$1
exit 1
This code creates only the file_1_test, which is empty. So the awk condition seems to work bad.

Looks like you're trying to split your input into separate files named based on the number between the underscores. That'd just be:
awk -F'_' '{print > ("file_" $2 "_test")}' file
You may need to change it to:
awk -F'_' '$2!=prev{close(out); out="file_" $2 "_test"} {print > out; prev=$2}' file
if you're generating a lot of output files and not using GNU awk as that could lead to a "too many open files" error.
wrt your comments below, look:
$ cat file
NODE_1_posplwpl
NODE_1_owkokwo
NODE_1_kslkow
NODE_2_fbjfh
NODE_2_lsmlsm
NODE_3_Loskos
NODE_3_pospls
$ awk -F'_' '{print $0 " > " ("file_" $2 "_test")}' file
NODE_1_posplwpl > file_1_test
NODE_1_owkokwo > file_1_test
NODE_1_kslkow > file_1_test
NODE_2_fbjfh > file_2_test
NODE_2_lsmlsm > file_2_test
NODE_3_Loskos > file_3_test
NODE_3_pospls > file_3_test
Just change $0 " > " to > like in the first script to have the output go to the separate files instead of just showing you that that would happen like this last script does.

Related

How to print the value in third column of a line which comes after a line which, contains a specific string using AWK to a different file?

I have an output which contains something like this in the middle.
Stopping criterion = max iterations
Energy initial, next-to-last, final =
-83909.5503696 -86748.8150981 -86748.8512012
What I am trying to do is to print out the last value(3rd column) in line after the line which contains the string "Energy" to a different file. and I have to print out these values from 100 different files. currently I have been trying with this line which only looks at a single file.
awk -F: '/Energy/ { getline; print $0 }' inputfile > outputfile
but this gives output like:
-83909.5503696 -86748.8150981 -86748.8512012
Update - With the help of a suggestion below I was able to output the value to a file. but as it reads through different files it overwrites the final output file and prints out value of the final file that it read. What I tried was this,
#SBATCH --array=1-100
num=$SLURM_ARRAY_TASK_ID..
fold=$(printf '%03d' $num)
cd $main_path/surf_$fold
awk 'f{print $3; f=0} /Energy/{f=1}' inputfile > outputfile
This would not be an appropriate job for getline, see http://awk.freeshell.org/AllAboutGetline, and idk why you're setting FS to : with -F: when your fields are space-separated as awk assumes by default.
Here's how to do what I think you're trying to do with 1 call to awk:
awk 'f{print $3; f=0} /Energy/{f=1}' "$main_path/surf_"*"/inputfile > outputfile

Comparing columns in two different files and producing three outputs

I have multiple paired files with headings xxx_1.txt and xxx_2.txt, yyy_1.txt and yyy_2.txt, etc. They are single column files with the following format:
xxx_1.txt:
#CHROM_POSREFALT
MSHR1153_annotated_1_9107CA
MSHR1153_annotated_1_9197CT
MSHR1153_annotated_1_9303TC
MSHR1153_annotated_1_10635GA
MSHR1153_annotated_1_10836AG
MSHR1153_annotated_1_11108AG
MSHR1153_annotated_1_11121GA
MSHR1153_annotated_1_11123CT
MSHR1153_annotated_1_11131CT
MSHR1153_annotated_1_11155AG
MSHR1153_annotated_1_11166CT
MSHR1153_annotated_1_11186TC
MSHR1153_annotated_1_11233TG
MSHR1153_annotated_1_11274GT
MSHR1153_annotated_1_11472CG
MSHR1153_annotated_1_11814GA
MSHR1153_annotated_1_11815CT
xxx_2.txt:
LocationMSHR1153_annotatedMSHR0491_Australasia
MSHR1153_annotated_1_56TC
MSHR1153_annotated_1_226AG
MSHR1153_annotated_1_670AG
MSHR1153_annotated_1_817CT
MSHR1153_annotated_1_1147TC
MSHR1153_annotated_1_1660TC
MSHR1153_annotated_1_2488AG
MSHR1153_annotated_1_2571GA
MSHR1153_annotated_1_2572TC
MSHR1153_annotated_1_2698TC
MSHR1153_annotated_1_2718TG
MSHR1153_annotated_1_3018TC
MSHR1153_annotated_1_3424TC
MSHR1153_annotated_1_3912CT
MSHR1153_annotated_1_4013GA
MSHR1153_annotated_1_4087GC
MSHR1153_annotated_1_4878CT
MSHR1153_annotated_1_5896GA
MSHR1153_annotated_1_7833TG
MSHR1153_annotated_1_7941CT
MSHR1153_annotated_1_8033GA
MSHR1153_annotated_1_8888AC
MSHR1153_annotated_1_9107CA
MSHR1153_annotated_1_9197CT
They are actually much longer than this. My goal is two compare each line and produce multiple outputs for the purpose of creating a venn diagram later on. So I need one file which lists all the lines in common which looks like this (in this case there is only one):
MSHR1153_annotated_1_9107CA
One file that lists everything specific to xxx_1 and one file which lists everything specific to xxx_2.
I have so far come up with this:
awk ' FNR==NR { position[$1]=$1; next} {if ( $1 in position ) {print $1 > "foundinboth"} else {print $1 > "uniquetofile1"}} ' FILE2 FILE1
The problem is I know how over 300 paired files to run through, and if I use this I have to change them manually each time. It also doesn't produce all the files at the same time. Is there a way to do this to loop through and change everything automatically? The files are paired so that the suffix at the end is different "_1" and "_2". I need it to loop through each paired file and produce everything I need at the same time.
Would you please try the following:
for f in *_1.txt; do # find files such as "xxx_1.txt"
basename=${f%_*} # extract "xxx" portion
if [[ -f ${basename}_2.txt ]]; then # make sure "xxx_2.txt" exists
file1="${basename}_1.txt" # assign bash variable file1
file2="${basename}_2.txt" # assign bash variable file2
both="${basename}_foundinboth.txt"
uniq1="${basename}_uniquetofile1.txt"
uniq2="${basename}_uniquetofile2.txt"
awk -v both="$both" -v uniq1="$uniq1" -v uniq2="$uniq2" '
# pass the variables to AWK with -v option
FNR==NR { b[$1]=$1; next }
{
if ($1 in b) {
print $1 > both
seen[$1]++ # mark if the line is found in file1
} else {
print $1 > uniq1
}
}
END {
for (i in b) {
if (! seen[i]) { # the line is not found in file1
print i > uniq2 # then it is unique to file2
}
}
}' "$file2" "$file1"
fi
done
Please note that the lines in *_uniquetofile2.txt do not keep the original order.
If you need them to, please try to sort them for yourself or let me know.

Using awk to add a different value to a new variable at every "append" instance

I'm using Bash, and I have a directory of .tsv files containing different behavioral data (RT and accuracy) for different subjects and multiple sessions within the same subjects. My goal is to concatenate the RT field (in field 3 of each .tsv file) and the accuracy field (in field 9) across all these files into a single .tsv file, while adding the subject and session (defined based on the directory names) as new variables in this concatenated file every time I append a new file, so I can keep together the subject-session data with the RT and accuracy data.
To illustrate, each .tsv file has the following header in every row:
V1 V2 RT V4 V5 V6 V7 V8 ACC
I want to look through many of these files, extracting just the RT and ACC fields and adding the data in these fields to a new .tsv file with SUB and SES as new variables in a file called "summary.tsv":
SUB SES RT ACC
Here's the code I have so far:
subdir=~/path/to/subdir
for subs in ${subdir}/subject-*; do
sub=$(basename ${subs})
for sess in ${sub}/session-*; do
ses=$(basename ${ses})
for files in ${sess}/*.tsv; do
if [[ -e $files ]] && [[ -e ${outdir}/summary.tsv ]] ; then
awk 'NR > 1 {print $3,$9}' ${files} >> ${outdir}/summary.tsv
fi
if [[ -e $files ]] && [[ ! -e ${outdir}/summary.tsv ]] ; then
awk '{print $3,$9}' ${files} > ${outdir}/summary.tsv
fi
done
done
done
This works fine to concatenate files into the summary.tsv file without repeating each file's header, but what I can't figure out is how to add 2 new variables with the same length as the appended output in the "awk 'NR > 1 {print $3,$9}' ${files} >> ${outdir}/summary.tsv" line, containing the corresponding ${sub} and ${ses} variables in the 1st and 2nd fields.
Any suggestions? Thank you so much in advance.
Your script has a number of issues, but the answer to your actual question is
awk -v subj="$sub" -v ses="$ses" 'BEGIN { OFS="\t" }
NR>1 { print subj, ses, $3, $9 }'
Awk can read many files so the innermost loop is unnecessary. Here is a tentative refactoring.
for subs in ~/path/to/subdir/subject-*; do
sub=$(basename "$subs")
for sess in "$sub"/session-*; do
ses=$(basename "$ses")
awk -v subj="$sub" -v ses="$ses" '
BEGIN { OFS="\t" }
FNR>1 { print subj, ses, $3, $9 }' \
"$sess"/*.tsv
done
done >> "$outdir"/summary.tsv
I would recommend against having headers in the output file at all, but if you need a header line, writing one before the main script should be easy enough.
If your diectory structure is this simple (and you don't have hundreds of thousands of files, so that passing a single wildcard to Awk will not produce a "command line too long" error) you could probably simplify all the loops into a single Awk script. The current file name is in the FILENAME variable; pulling out the bottom two parent directories with a simple regex or split() should be straghtforward, too.

Using awk to delete multiple lines using argument passed on via function

My input.csv file is semicolon separated, with the first line being a header for attributes. The first column contains customer numbers. The function is being called through a script that I activate from the terminal.
I want to delete all lines containing the customer numbers that are entered as arguments for the script. EDIT: And then export the file as a different file, while keeping the original intact.
bash deleteCustomers.sh 1 3 5
Currently only the last argument is filtered from the csv file. I understand that this is happening because the output file gets overwritten each time the loop runs, restoring all previously deleted arguments.
How can I match all the lines to be deleted, and then delete them (or print everything BUT those lines), and then output it to one file containing ALL edits?
delete_customers () {
echo "These customers will be deleted: "$#""
for i in "$#";
do
awk -F ";" -v customerNR=$i -v input="$inputFile" '($1 != customerNR) NR > 1 { print }' "input.csv" > output.csv
done
}
delete_customers "$#"
Here's some sample input (first piece of code is the first line in the csv file). In the output CSV file I want the same formatting, with the lines for some customers completely deleted.
Klantnummer;Nationaliteit;Geslacht;Title;Voornaam;MiddleInitial;Achternaam;Adres;Stad;Provincie;Provincie-voluit;Postcode;Land;Land-voluit;email;gebruikersnaam;wachtwoord;Collectief ;label;ingangsdatum;pakket;aanvullende verzekering;status;saldo;geboortedatum
1;Dutch;female;Ms.;Josanne;S;van der Rijst;Bliek 189;Hellevoetsluis;ZH;Zuid-Holland;3225 XC;NL;Netherlands;JosannevanderRijst#dayrep.com;Sourawaspen;Lae0phaxee;Klant;CZ;11-7-2010;best;tand1;verleden;-137;30-12-1995
2;Dutch;female;Mrs.;Inci;K;du Bois;Castorweg 173;Hengelo;OV;Overijssel;7557 KL;NL;Netherlands;InciduBois#gustr.com;Hisfireeness;jee0zeiChoh;Klant;CZ;30-8-2015;goed ;geen;verleden;188;1-8-1960
3;Dutch;female;Mrs.;Lusanne;G;Hijlkema;Plutostraat 198;Den Haag;ZH;Zuid-Holland;2516 AL;NL;Netherlands;LusanneHijlkema#dayrep.com;Digum1969;eiTeThun6th;Klant;Achmea;12-2-2010;best;mix;huidig;-335;9-3-1973
4;Dutch;female;Dr.;Husna;M;Hoegee;Tiendweg 89;Ameide;ZH;Zuid-Holland;4233 VW;NL;Netherlands;HusnaHoegee#fleckens.hu;Hatimon;goe5OhS4t;Klant;VGZ;9-8-2015;goed ;gezin;huidig;144;12-8-1962
5;Dutch;male;Mr.;Sieds;D;Verspeek;Willem Albert Scholtenstraat 38;Groningen;GR;Groningen;9711 XA;NL;Netherlands;SiedsVerspeek#armyspy.com;Thade1947;Taexiet9zo;Intern;CZ;17-2-2004;beter;geen;verleden;-49;12-10-1961
6;Dutch;female;Ms.;Nazmiye;R;van Spronsen;Noorderbreedte 180;Amsterdam;NH;Noord-Holland;1034 PK;NL;Netherlands;NazmiyevanSpronsen#jourrapide.com;Whinsed;Oz9ailei;Intern;VGZ;17-6-2003;beter;mix;huidig;178;8-3-1974
7;Dutch;female;Ms.;Livia;X;Breukers;Everlaan 182;Veenendaal;UT;Utrecht;3903
Try this in loop..
awk -v variable=$var '$1 != variable' input.csv
awk - to make decision based on columns
-v - to use a variable into a awk command
variable - store the value for awk to process
$var - to search for a specific string in run-time
!= - to check if not exist
input.csv - your input file
It's awk's behavior, when you use -v it can will work with variable on run-time and provide an output that doesn't contain the value you passed. This way, you get all the values that are not matching to your variable. Hope this is helpful. :)
Thanks
This bash script should work:
!/bin/bash
FILTER="!/(^"$(echo "$#" | sed -e "s/ /\|^/g")")/ {print}"
awk "$FILTER" input.csv > output.csv
The idea is to build an awk relevant FILTER and then use it.
Assuming the call parameters are: 1 2 3, the filter will be: !/(^1|^2|^3)/ {print}
!: to invert matching
^: Beginning of the line
The input data are in the input.csv file and output result will be in the output.csv file.

rearranging column based on condition

I have a *.csv file. with value as below
"ASDP02","8801942183589"
"ASDP06","8801939151023"
"CSDP04","8801963981740"
"ASDP09","8801946305047"
"ASDP12","8801941195677"
"ASDP05","8801922826186"
"CSDP08","8801983008938"
"ASDP04","8801944346555"
"CSDP11","8801910831518"
or sometimes the value is as below
"8801989353984","KSDP05"
"8801957608165","ASDP11"
"8801991455848","CSDP10"
"8801981363116","CSDP07"
"8801921247870","KSDP07"
"8801965386240","CSDP06"
"8801956293036","KSDP10"
"8801984383904","KSDP11"
"8801944211742","ASDP09"
I just want to put the numeric value (e.g. 8801989353984) always in 1st column. Is it possible using BASH script?
Sed is also your friend here
Input
cat 41189347
"ASDP02","8801942183589"
"ASDP06","8801939151023"
"CSDP04","8801963981740"
"ASDP09","8801946305047"
"ASDP12","8801941195677"
"ASDP05","8801922826186"
"CSDP08","8801983008938"
"ASDP04","8801944346555"
"CSDP11","8801910831518"
Script
sed -E 's/^("[[:alpha:]]+.*"),("[[:digit:]]+")$/\2,\1/' 41189347
Output
"8801942183589","ASDP02"
"8801939151023","ASDP06"
"8801963981740","CSDP04"
"8801946305047","ASDP09"
"8801941195677","ASDP12"
"8801922826186","ASDP05"
"8801983008938","CSDP08"
"8801944346555","ASDP04"
"8801910831518","CSDP11"
awk to the rescue!
$ awk -F, -v OFS=, '$1~/[A-Z]/{t=$2;$2=$1;$1=t}1' file
if first field has alpha chars, swap first and second columns and print.
Bash can do the work but awk might be a better choice for rearrange your file:
sample.csv:
"ASDP02","8801942183589"
"8801944211742","ASDP09"
command:
awk -F, 'BEGIN{OFS=","}{$1=$1;if(substr($1, 2, length($1) - 2) + 0 == substr($1, 2, length($1) - 2)){print $1,$2}else{print $2,$1}}' sample.csv
substr($1, 2, length($1) - 2) + 0 == substr($1, 2, length($1) - 2) checks the column is numeric or not. If it is, print the original line otherwise switch column1 and column2
Output:
"8801942183589","ASDP02"
"8801944211742","ASDP09"
You can create a pure bash script to generate other file which has the structure you need:
#!/bin/bash
csv_file="/path/to/your/csvfile"
output_file="/path/to/output_file"
#Optional
rm -rf "${output_file}"
readarray -t LINES < <(cat < "${csv_file}" 2> /dev/null)
for item in "${LINES[#]}"; do
if [[ $item =~ ^\"([0-9A-Z]+)\"\,\"([0-9]+)\" ]]; then
echo "\"${BASH_REMATCH[2]}\",\"${BASH_REMATCH[1]}\"" >> "${output_file}"
else
echo "$item" >> "${output_file}"
fi
done
This works even if your file is "mixed" I mean with some lines in the right format and other lines in the bad format.
The following commands assume that the cells in the CSV files do not contain newlines and commas. Otherwise, you should write a more complicated script in Perl, PHP, or other programming language capable of parsing CSV files properly. But Bash, definitely, is not appropriate for this task.
Perl
perl -F, -nle '#F = reverse #F if $F[0] =~ /^"\d+"$/;
print join(",", #F)' file
Beware, If the cells contain newlines, or commas, use Perl's Text::CSV module, for instance. Although it is a simple task in Perl, it goes beyond the scope of the current question.
The command splits the input lines by commas (-F,) and stores the result into #F array, for each line. The items in the array are reversed, if the first field $F[0] matches the regular expression. You can also swap the items this way: ($F[0], $F[1]) = ($F[1], $F[0]).
Finally, the joins the array items with commas, and prints to the standard output.
If you want to edit the file in-place, use -i option: perl -i.backup -F, ....
AWK
awk -F, -vOFS=, '/^"[0-9]+",/ {print; next}
{ t = $1; $1 = $2; $2 = t; print }' file
The input and output field separators are set to , with -F, and -vOFS=,.
If the line matches the pattern /^"[0-9]+",/ (the line begins with a "numeric" CSV column), the script prints the record and advances to the next record. Otherwise the next block is executed.
In the next block, it swaps the first two columns and prints the result to the standard output.
If you want to edit the file in-place, see answers to this question.

Resources