Using awk to delete multiple lines using argument passed on via function - linux

My input.csv file is semicolon separated, with the first line being a header for attributes. The first column contains customer numbers. The function is being called through a script that I activate from the terminal.
I want to delete all lines containing the customer numbers that are entered as arguments for the script. EDIT: And then export the file as a different file, while keeping the original intact.
bash deleteCustomers.sh 1 3 5
Currently only the last argument is filtered from the csv file. I understand that this is happening because the output file gets overwritten each time the loop runs, restoring all previously deleted arguments.
How can I match all the lines to be deleted, and then delete them (or print everything BUT those lines), and then output it to one file containing ALL edits?
delete_customers () {
echo "These customers will be deleted: "$#""
for i in "$#";
do
awk -F ";" -v customerNR=$i -v input="$inputFile" '($1 != customerNR) NR > 1 { print }' "input.csv" > output.csv
done
}
delete_customers "$#"
Here's some sample input (first piece of code is the first line in the csv file). In the output CSV file I want the same formatting, with the lines for some customers completely deleted.
Klantnummer;Nationaliteit;Geslacht;Title;Voornaam;MiddleInitial;Achternaam;Adres;Stad;Provincie;Provincie-voluit;Postcode;Land;Land-voluit;email;gebruikersnaam;wachtwoord;Collectief ;label;ingangsdatum;pakket;aanvullende verzekering;status;saldo;geboortedatum
1;Dutch;female;Ms.;Josanne;S;van der Rijst;Bliek 189;Hellevoetsluis;ZH;Zuid-Holland;3225 XC;NL;Netherlands;JosannevanderRijst#dayrep.com;Sourawaspen;Lae0phaxee;Klant;CZ;11-7-2010;best;tand1;verleden;-137;30-12-1995
2;Dutch;female;Mrs.;Inci;K;du Bois;Castorweg 173;Hengelo;OV;Overijssel;7557 KL;NL;Netherlands;InciduBois#gustr.com;Hisfireeness;jee0zeiChoh;Klant;CZ;30-8-2015;goed ;geen;verleden;188;1-8-1960
3;Dutch;female;Mrs.;Lusanne;G;Hijlkema;Plutostraat 198;Den Haag;ZH;Zuid-Holland;2516 AL;NL;Netherlands;LusanneHijlkema#dayrep.com;Digum1969;eiTeThun6th;Klant;Achmea;12-2-2010;best;mix;huidig;-335;9-3-1973
4;Dutch;female;Dr.;Husna;M;Hoegee;Tiendweg 89;Ameide;ZH;Zuid-Holland;4233 VW;NL;Netherlands;HusnaHoegee#fleckens.hu;Hatimon;goe5OhS4t;Klant;VGZ;9-8-2015;goed ;gezin;huidig;144;12-8-1962
5;Dutch;male;Mr.;Sieds;D;Verspeek;Willem Albert Scholtenstraat 38;Groningen;GR;Groningen;9711 XA;NL;Netherlands;SiedsVerspeek#armyspy.com;Thade1947;Taexiet9zo;Intern;CZ;17-2-2004;beter;geen;verleden;-49;12-10-1961
6;Dutch;female;Ms.;Nazmiye;R;van Spronsen;Noorderbreedte 180;Amsterdam;NH;Noord-Holland;1034 PK;NL;Netherlands;NazmiyevanSpronsen#jourrapide.com;Whinsed;Oz9ailei;Intern;VGZ;17-6-2003;beter;mix;huidig;178;8-3-1974
7;Dutch;female;Ms.;Livia;X;Breukers;Everlaan 182;Veenendaal;UT;Utrecht;3903

Try this in loop..
awk -v variable=$var '$1 != variable' input.csv
awk - to make decision based on columns
-v - to use a variable into a awk command
variable - store the value for awk to process
$var - to search for a specific string in run-time
!= - to check if not exist
input.csv - your input file
It's awk's behavior, when you use -v it can will work with variable on run-time and provide an output that doesn't contain the value you passed. This way, you get all the values that are not matching to your variable. Hope this is helpful. :)
Thanks

This bash script should work:
!/bin/bash
FILTER="!/(^"$(echo "$#" | sed -e "s/ /\|^/g")")/ {print}"
awk "$FILTER" input.csv > output.csv
The idea is to build an awk relevant FILTER and then use it.
Assuming the call parameters are: 1 2 3, the filter will be: !/(^1|^2|^3)/ {print}
!: to invert matching
^: Beginning of the line
The input data are in the input.csv file and output result will be in the output.csv file.

Related

Using awk to add a different value to a new variable at every "append" instance

I'm using Bash, and I have a directory of .tsv files containing different behavioral data (RT and accuracy) for different subjects and multiple sessions within the same subjects. My goal is to concatenate the RT field (in field 3 of each .tsv file) and the accuracy field (in field 9) across all these files into a single .tsv file, while adding the subject and session (defined based on the directory names) as new variables in this concatenated file every time I append a new file, so I can keep together the subject-session data with the RT and accuracy data.
To illustrate, each .tsv file has the following header in every row:
V1 V2 RT V4 V5 V6 V7 V8 ACC
I want to look through many of these files, extracting just the RT and ACC fields and adding the data in these fields to a new .tsv file with SUB and SES as new variables in a file called "summary.tsv":
SUB SES RT ACC
Here's the code I have so far:
subdir=~/path/to/subdir
for subs in ${subdir}/subject-*; do
sub=$(basename ${subs})
for sess in ${sub}/session-*; do
ses=$(basename ${ses})
for files in ${sess}/*.tsv; do
if [[ -e $files ]] && [[ -e ${outdir}/summary.tsv ]] ; then
awk 'NR > 1 {print $3,$9}' ${files} >> ${outdir}/summary.tsv
fi
if [[ -e $files ]] && [[ ! -e ${outdir}/summary.tsv ]] ; then
awk '{print $3,$9}' ${files} > ${outdir}/summary.tsv
fi
done
done
done
This works fine to concatenate files into the summary.tsv file without repeating each file's header, but what I can't figure out is how to add 2 new variables with the same length as the appended output in the "awk 'NR > 1 {print $3,$9}' ${files} >> ${outdir}/summary.tsv" line, containing the corresponding ${sub} and ${ses} variables in the 1st and 2nd fields.
Any suggestions? Thank you so much in advance.
Your script has a number of issues, but the answer to your actual question is
awk -v subj="$sub" -v ses="$ses" 'BEGIN { OFS="\t" }
NR>1 { print subj, ses, $3, $9 }'
Awk can read many files so the innermost loop is unnecessary. Here is a tentative refactoring.
for subs in ~/path/to/subdir/subject-*; do
sub=$(basename "$subs")
for sess in "$sub"/session-*; do
ses=$(basename "$ses")
awk -v subj="$sub" -v ses="$ses" '
BEGIN { OFS="\t" }
FNR>1 { print subj, ses, $3, $9 }' \
"$sess"/*.tsv
done
done >> "$outdir"/summary.tsv
I would recommend against having headers in the output file at all, but if you need a header line, writing one before the main script should be easy enough.
If your diectory structure is this simple (and you don't have hundreds of thousands of files, so that passing a single wildcard to Awk will not produce a "command line too long" error) you could probably simplify all the loops into a single Awk script. The current file name is in the FILENAME variable; pulling out the bottom two parent directories with a simple regex or split() should be straghtforward, too.

BASH - Extract Data from String

I have a log that returns thousands of lines of data, I want to extract a few values from that.
In the log there is only one line containing the unquie unit reference so I can grep for that using:
grep "unit=Central-C152" logfile.txt
That produces a line of output similar to the following:
a3cd23e,85d58f5,53f534abef7e7,unit=Central-C152,locale=32325687-8595-9856-1236-12546975,11="School",1="Mr Green",2="Qual",3="SWE",8="report",5="channel",7="reset",6="velum"
The format of the line may change in that the order of the values won't always be in the same position.
I'm trying to work out how to get the value of 2 and 7 in to separate variables.
I had thought about cut on , or = but as the values aren't in a set order I couldn't work out that best way to do it.
I' trying to get:
var state=value of 2 without quotes
var mode=value of 7 without quotes
Can anyone advise on the best way to do this ?
Thanks
Could you please try following to create variable's values.
state=$(awk '/unit=Central-C152/ && match($0,/2=\"[^"]*/){print substr($0,RSTART+3,RLENGTH-3)}' Input_file)
mode=$(awk '/unit=Central-C152/ && match($0,/7=\"[^"]*/){print substr($0,RSTART+3,RLENGTH-3)}' Input_file)
You could print them too by doing following.
echo "$state"
echo "$mode"
Explanation: Adding explanation of command too now.
awk ' ##Starting awk program here.
/unit=Central-C152/ && match($0,/2=\"[^"]*/){ ##Checking condition if a line has string (unit=Central-C152) and using match using REGEX to check from 2 to till "
print substr($0,RSTART+3,RLENGTH-3) ##Printing substring starting from RSTART+3 till RLENGTH-3 characters.
}
' Input_file ##Mentioning Input_file name here.
You are probably better off doing all of the processing in Awk.
awk -F, '/unit=Central-C152/ {
for(i=1;i<=NF;++i)
if($i ~ /^[27]="/) {
b[++k] = $i
sub(/^[27]="/, "", b[k])
sub(/"$/, "", b[k])
gsub(/\\/, "", b[k])
}
print "state " b[1] ", mode " b[2]
}' logfile.txt
This presupposes that the fields always occur in the same order (2 before 7). Maybe you need to change or disable the gsub to remove backslashes in the values.
If you want to do more than print the values, refactoring whatever Bash code you have into Awk is often a better approach than doing this processing in Bash.
Assuming you already have the line in a variable such as with:
line="$(grep 'unit=Central-C152' logfile.txt | head -1)"
You can then simply use the built-in parameter substitution features of bash:
f2=${line#*2=\"} ; f2=${f2%%\"*} ; echo ${f2}
f7=${line#*7=\"} ; f7=${f7%%\"*} ; echo ${f7}
The first command on each line strips off the first part of the line up to and including the <field-number>=". The second command then strips everything off that beyond (and including) the first quote. The third, of course, simply echos the value.
When I run those commands against your input line, I see:
Qual
reset
which is, from what I can see, what you were after.

extract sequences from multifasta file by ID in file using awk

I would like to extract sequences from the multifasta file that match the IDs given by separate list of IDs.
FASTA file seq.fasta:
>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11605
TTCAGCAAGCCGAGTCCTGCGTCGAGAGTTCAAGTC
CCTGTTCGGGCGCCACTGCTAG
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC
>7P58X:01334:11635
TTCAGCAAGCCGAGTCCTGCGTCGAGAGATCGCTTT
CAAGTCCCTGTTCGGGCGCCACTGCGGGTCTGTGTC
GAGCG
>7P58X:01336:11621
ACGCTCGACACAGACCTTTAGTCAGTGTGGAAATCT
CTAGCAGTAGAGGAGATCTCCTCGACGCAGGACT
IDs file id.txt:
7P58X:01332:11636
7P58X:01334:11613
I want to get the fasta file with only those sequences matching the IDs in the id.txt file:
>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC
I really like the awk approach I found in answers here and here, but the code given there is still not working perfectly for the example I gave. Here is why:
(1)
awk -v seq="7P58X:01332:11636" -v RS='>' '$1 == seq {print RS $0}' seq.fasta
this code works well for the multiline sequences but IDs have to be inserted separately to the code.
(2)
awk 'NR==FNR{n[">"$0];next} f{print f ORS $0;f=""} $0 in n{f=$0}' id.txt seq.fasta
this code can take the IDs from the id.txt file but returns only the first line of the multiline sequences.
I guess that the good thing would be to modify the RS variable in the code (2) but all of my attempts failed so far. Can, please, anybody help me with that?
$ awk -F'>' 'NR==FNR{ids[$0]; next} NF>1{f=($2 in ids)} f' id.txt seq.fasta
>7P58X:01332:11636
TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTT
CAAGTCCCTGTTCGGGCGCC
>7P58X:01334:11613
ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC
Following awk may help you on same.
awk 'FNR==NR{a[$0];next} /^>/{val=$0;sub(/^>/,"",val);flag=val in a?1:0} flag' ids.txt fasta_file
I'm facing a similar problem. The size of my multi-fasta file is ~ 25G.
I use sed instead of awk, though my solution is an ugly hack.
First, I extracted the line number of the title of each sequence to a data file.
grep -n ">" multi-fasta.fa > multi-fasta.idx
What I got is something like this:
1:>DM_0000000004
5:>DM_0000000005
11:>DM_0000000007
19:>DM_0000000008
23:>DM_0000000009
Then, I extracted the wanted sequence by its title, eg. DM_0000000004, using the scripts below.
seqnm=$1
idx0_idx1=`grep -n $seqnm multi-fasta.idx`
idx0=`echo $idx0_idx1 | cut -d ":" -f 1`
idx0plus1=`expr $idx0 + 1`
idx1=`echo $idx0_idx1 | cut -d ":" -f 2`
idx2=`head -n $idx0plus1 multi-fasta.idx | tail -1 | cut -d ":" -f 1`
idx2minus1=`expr $idx2 - 1`
sed ''"$idx1"','"$idx2minus1"'!d' multi-fasta.fa > ${seqnm}.fasta
For example, I want to extract the sequence of DM_0000016115. The idx0_idx1 variable gives me:
7507:42520:>DM_0000016115
7507 (idx0) is the line number of line 42520:>DM_0000016115 in multi-fasta.idx.
42520 (idx1) is the line number of line >DM_0000016115 in multi-fasta.fa.
idx2 is the line number of the sequence title right beneath the wanted one (>DM_0000016115).
At last, using sed, we can extract the lines between idx1 and idx2 minus 1, which are the title and the sequence, in which case you can use grep -A.
The advantage of this ugly-hack is that it does not require a specific number of lines for each sequence in the multi-fasta file.
What bothers me is this process is slow. For my 25G multi-fasta file, such extraction takes tens of seconds. However, it's much faster than using samtools faidx .

Awk script to put value in column basis on another column value

I am trying to use below script to replace column values.
But below data is huge and have around 33000 rows.
so when i run the script i get error "Argument list too long"
Please let me know other way to do it..
if($33="100000000"){$36="EA"}
if($33="100000001"){$36="EA"}
if($33="100000002"){$36="EA"}
if($33="100000003"){$36="EA"}
if($33="100000004"){$36="EA"}
if($33="100000005"){$36="EA"}
if($33="100000006"){$36="EA"}
if($33="100000007"){$36="EA"}
if($33="100000008"){$36="EA"}
if($33="100000009"){$36="EA"}
if($33="100000010"){$36="EA"}
if($33="100000011"){$36="EA"}
if($33="100000012"){$36="EA"}
if($33="100000013"){$36="EA"}
if($33="100000014"){$36="EA"}
if($33="100000015"){$36="EA"}
if($33="100000016"){$36="EA"}
if($33="100000017"){$36="EA"}
if($33="100000018"){$36="EA"}
if($33="100000019"){$36="EA"}
if($33="100000020"){$36="EA"}
sample input file
SourceIifier|SourleName|GntCode|Dision|Suvision|ProfitCe1|Profie2|Plade|Retuiod|SuppliN|DocType|Suppe|Docummber|Docte|Originer|OrigDate|CRDST|LineNumber|CustoN|UINorComposition|OriginaN|Custoame|Custoe|BillTe|Shite|POS|PortCode|ShippingBillNumber|ShippingBillDate|FOB|ExportDuty|HSNorSAC|ProductCode|ProductDescription|Categorduct|UnitOement|Quantity|Taxabue|Integratede|Integratount|Centraate|CentralTt|StaURate|StateUTTaxAmount|CessRateAdvalorem|CessAmountAdvalorem|CessRateSpecific|CessAmountSpecific|Invoalue|ReverseChargeFlag|TCSFlag|eComGSTIN|ITCFlag|ReasonForCreditDebitNote|AccountingVoucmber|Accountinate|Userdefinedfield1|Userdefinedfield2|Userdefinedfield3|Additionalfield1|Additionalfield2|Additlfield3|Additionalfield4|Additionalfield5
SAP|SAP_OSR_INV|||||||date+%m%Y|08AAACT2T|IN|EXPWT|262881626|02.02.2018||||10||||TVVAHALI|1151040011|8|8|8||||||9984|EVD0|EVDCOCOaterial|||0|8.47|0|0|9|0.76|9|0.76|||||||||||1301312397||ZEVD|1210||||||0
SAP|SAP_OSR_INV|||||||date+%m%Y|08AAACZT|IV|EXPWT|2627|02.02.2018||||10||||TVVHALI|1151040011|8|8|8||||||9984|EVD0|EVDCOAMaterial|||0|8.47|0|0|9|0.76|9|0.76|||||||||||130139||ZEVD|1210||||||0
SAP|SAP_OSR_INV|||||||date+%m%Y|08AAAZT|NV|AN|2628|02.02.2018||||20||||TVHVAISHALI|1151040011|8|8|8||||||9984|EVD0|EVDCOCOCDMAMaterial|||0|8.47|0|0|9|0.76|9|0.76|||||||||||13014||ZEVD|1210||||||0
My code :
awk -F"|" -v OFS="|" '{
if($33="100000000"){$36="EA"}
if($33="100000001"){$36="EA"}
if($33="100000002"){$36="EA"}
if($33="100000003"){$36="EA"}
if($33="100000004"){$36="EA"}
if($33="100000005"){$36="EA"}
if($33="100000006"){$36="EA"}
if($33="100000007"){$36="EA"}
if($33="100000008"){$36="EA"}
if($33="100000009"){$36="EA"}
if($33="100000010"){$36="EA"}1' inputfile > outputfile
Here the above code is just sample but in actual has around 33000 rows.
below is Sample Awk code..
BEGIN {
FS="|";
OFS="|";
}
{
if($33="100000000"){$36="EA"}
if($33="100000001"){$36="EA"}
if($33="100000002"){$36="EA"}
if($33="100000003"){$36="EA"}
if($33="100000004"){$36="EA"}
if($33="100000005"){$36="EA"}1 inputfile > outputfile
and called it like below
awk -f script.awk
Below is the error by calling awk script.
awk: fpostp.awk:33445: if($36=="M") {$36="MTR"}} TFinaloutputp7_6_3_d_OYMNC_w.csv > TFinaloutputt_w36.csv
awk: fpostp.awk:33445: ^ syntax error
awk: fpostp.awk:33445: if($36=="M") {$36="MTR"}} TFinaloutputp7_6_3_d_OYMNC_w.csv > TFinaloutputt_w36.csv
awk: fpostp.awk:33445: ^ syntax error
Can't i redirect output in someother file when executing by awk -f script.awk
Programming is not like that, if writing it could get boring there is bound to be another way.
awk -F"|" -v OFS="|" '$33>=100 && $33<200{$36="EA";print} $33>=200 && $33<300{$36="FB";print}' inputfile > outputfile
First awk is a pattern matching language, on rows the pattern outside the curly braces matches, it does what is inside the curly braces.
no need for the if syntax as it is inherent.
The patterns can be compound and awk knows what numbers are without being told (and does math arbitrarily well).
I shortened the values in $33 and made up what and where $36 becomes
bit in general make a statement per change in $36 for ranges of $36
If that is not your goal the question will need some refining.
Edit:
maybe you are masking $36 to a constant based on a arbitrary condition involving
$33 which only you know and there are lots of them ... in a file somewhere.
(I am pretending you have the list isolated in a file named filter.list )
so maybe something like
awk 'FNR==NR{filter[$1]++}FNR!=NR && $33 in filter {$36="EA"}' filter.list inputfile > outputfile
FNR is the File's number row and NR is the overall scripts number row
they are ony equal for the first file,
so using it here to treat the first file differently from the second.

rearranging column based on condition

I have a *.csv file. with value as below
"ASDP02","8801942183589"
"ASDP06","8801939151023"
"CSDP04","8801963981740"
"ASDP09","8801946305047"
"ASDP12","8801941195677"
"ASDP05","8801922826186"
"CSDP08","8801983008938"
"ASDP04","8801944346555"
"CSDP11","8801910831518"
or sometimes the value is as below
"8801989353984","KSDP05"
"8801957608165","ASDP11"
"8801991455848","CSDP10"
"8801981363116","CSDP07"
"8801921247870","KSDP07"
"8801965386240","CSDP06"
"8801956293036","KSDP10"
"8801984383904","KSDP11"
"8801944211742","ASDP09"
I just want to put the numeric value (e.g. 8801989353984) always in 1st column. Is it possible using BASH script?
Sed is also your friend here
Input
cat 41189347
"ASDP02","8801942183589"
"ASDP06","8801939151023"
"CSDP04","8801963981740"
"ASDP09","8801946305047"
"ASDP12","8801941195677"
"ASDP05","8801922826186"
"CSDP08","8801983008938"
"ASDP04","8801944346555"
"CSDP11","8801910831518"
Script
sed -E 's/^("[[:alpha:]]+.*"),("[[:digit:]]+")$/\2,\1/' 41189347
Output
"8801942183589","ASDP02"
"8801939151023","ASDP06"
"8801963981740","CSDP04"
"8801946305047","ASDP09"
"8801941195677","ASDP12"
"8801922826186","ASDP05"
"8801983008938","CSDP08"
"8801944346555","ASDP04"
"8801910831518","CSDP11"
awk to the rescue!
$ awk -F, -v OFS=, '$1~/[A-Z]/{t=$2;$2=$1;$1=t}1' file
if first field has alpha chars, swap first and second columns and print.
Bash can do the work but awk might be a better choice for rearrange your file:
sample.csv:
"ASDP02","8801942183589"
"8801944211742","ASDP09"
command:
awk -F, 'BEGIN{OFS=","}{$1=$1;if(substr($1, 2, length($1) - 2) + 0 == substr($1, 2, length($1) - 2)){print $1,$2}else{print $2,$1}}' sample.csv
substr($1, 2, length($1) - 2) + 0 == substr($1, 2, length($1) - 2) checks the column is numeric or not. If it is, print the original line otherwise switch column1 and column2
Output:
"8801942183589","ASDP02"
"8801944211742","ASDP09"
You can create a pure bash script to generate other file which has the structure you need:
#!/bin/bash
csv_file="/path/to/your/csvfile"
output_file="/path/to/output_file"
#Optional
rm -rf "${output_file}"
readarray -t LINES < <(cat < "${csv_file}" 2> /dev/null)
for item in "${LINES[#]}"; do
if [[ $item =~ ^\"([0-9A-Z]+)\"\,\"([0-9]+)\" ]]; then
echo "\"${BASH_REMATCH[2]}\",\"${BASH_REMATCH[1]}\"" >> "${output_file}"
else
echo "$item" >> "${output_file}"
fi
done
This works even if your file is "mixed" I mean with some lines in the right format and other lines in the bad format.
The following commands assume that the cells in the CSV files do not contain newlines and commas. Otherwise, you should write a more complicated script in Perl, PHP, or other programming language capable of parsing CSV files properly. But Bash, definitely, is not appropriate for this task.
Perl
perl -F, -nle '#F = reverse #F if $F[0] =~ /^"\d+"$/;
print join(",", #F)' file
Beware, If the cells contain newlines, or commas, use Perl's Text::CSV module, for instance. Although it is a simple task in Perl, it goes beyond the scope of the current question.
The command splits the input lines by commas (-F,) and stores the result into #F array, for each line. The items in the array are reversed, if the first field $F[0] matches the regular expression. You can also swap the items this way: ($F[0], $F[1]) = ($F[1], $F[0]).
Finally, the joins the array items with commas, and prints to the standard output.
If you want to edit the file in-place, use -i option: perl -i.backup -F, ....
AWK
awk -F, -vOFS=, '/^"[0-9]+",/ {print; next}
{ t = $1; $1 = $2; $2 = t; print }' file
The input and output field separators are set to , with -F, and -vOFS=,.
If the line matches the pattern /^"[0-9]+",/ (the line begins with a "numeric" CSV column), the script prints the record and advances to the next record. Otherwise the next block is executed.
In the next block, it swaps the first two columns and prints the result to the standard output.
If you want to edit the file in-place, see answers to this question.

Resources