insert column with same row content to csv in cli - linux

I am having a csv to which I need to add a new column at the end and add a certain string to all rows of the csv in the newly added column.
Example csv:
os,num1,alpha1
Unix,10,A
Linux,30,B
Solaris,40,C
Fedora,20,D
Ubuntu,50,E
I tried using awk command and did not get expected result. I am not sure whether indexing or counting column number is right.
awk -F'[[:null:]]' '$2 && !$1{ $4="NA" }1'
Expected result is:
os,num1,alpha1,code
Unix,10,A,NA
Linux,30,B,NA
Solaris,40,C,NA
Fedora,20,D,NA
Ubuntu,50,E,NA

You can use sed:
sed 's/$/,NA/' db1.csv > db2.csv
then edit the first line containing the column titles.

I'm not quite sure how you came up w/ that awk statement of yours, why you'd think that your file has NUL-terminated lines or that [[:null:]] has become a valid character class ...
The following, however, will do your bidding:
awk 'NR==1{print $0",code"}; NR>1{print $0",NA"}' example.csv
os,num1,alpha1,code
Unix,10,A,NA
Linux,30,B,NA
Solaris,40,C,NA
Fedora,20,D,NA
Ubuntu,50,E,NA

Related

Using awk to delete multiple lines using argument passed on via function

My input.csv file is semicolon separated, with the first line being a header for attributes. The first column contains customer numbers. The function is being called through a script that I activate from the terminal.
I want to delete all lines containing the customer numbers that are entered as arguments for the script. EDIT: And then export the file as a different file, while keeping the original intact.
bash deleteCustomers.sh 1 3 5
Currently only the last argument is filtered from the csv file. I understand that this is happening because the output file gets overwritten each time the loop runs, restoring all previously deleted arguments.
How can I match all the lines to be deleted, and then delete them (or print everything BUT those lines), and then output it to one file containing ALL edits?
delete_customers () {
echo "These customers will be deleted: "$#""
for i in "$#";
do
awk -F ";" -v customerNR=$i -v input="$inputFile" '($1 != customerNR) NR > 1 { print }' "input.csv" > output.csv
done
}
delete_customers "$#"
Here's some sample input (first piece of code is the first line in the csv file). In the output CSV file I want the same formatting, with the lines for some customers completely deleted.
Klantnummer;Nationaliteit;Geslacht;Title;Voornaam;MiddleInitial;Achternaam;Adres;Stad;Provincie;Provincie-voluit;Postcode;Land;Land-voluit;email;gebruikersnaam;wachtwoord;Collectief ;label;ingangsdatum;pakket;aanvullende verzekering;status;saldo;geboortedatum
1;Dutch;female;Ms.;Josanne;S;van der Rijst;Bliek 189;Hellevoetsluis;ZH;Zuid-Holland;3225 XC;NL;Netherlands;JosannevanderRijst#dayrep.com;Sourawaspen;Lae0phaxee;Klant;CZ;11-7-2010;best;tand1;verleden;-137;30-12-1995
2;Dutch;female;Mrs.;Inci;K;du Bois;Castorweg 173;Hengelo;OV;Overijssel;7557 KL;NL;Netherlands;InciduBois#gustr.com;Hisfireeness;jee0zeiChoh;Klant;CZ;30-8-2015;goed ;geen;verleden;188;1-8-1960
3;Dutch;female;Mrs.;Lusanne;G;Hijlkema;Plutostraat 198;Den Haag;ZH;Zuid-Holland;2516 AL;NL;Netherlands;LusanneHijlkema#dayrep.com;Digum1969;eiTeThun6th;Klant;Achmea;12-2-2010;best;mix;huidig;-335;9-3-1973
4;Dutch;female;Dr.;Husna;M;Hoegee;Tiendweg 89;Ameide;ZH;Zuid-Holland;4233 VW;NL;Netherlands;HusnaHoegee#fleckens.hu;Hatimon;goe5OhS4t;Klant;VGZ;9-8-2015;goed ;gezin;huidig;144;12-8-1962
5;Dutch;male;Mr.;Sieds;D;Verspeek;Willem Albert Scholtenstraat 38;Groningen;GR;Groningen;9711 XA;NL;Netherlands;SiedsVerspeek#armyspy.com;Thade1947;Taexiet9zo;Intern;CZ;17-2-2004;beter;geen;verleden;-49;12-10-1961
6;Dutch;female;Ms.;Nazmiye;R;van Spronsen;Noorderbreedte 180;Amsterdam;NH;Noord-Holland;1034 PK;NL;Netherlands;NazmiyevanSpronsen#jourrapide.com;Whinsed;Oz9ailei;Intern;VGZ;17-6-2003;beter;mix;huidig;178;8-3-1974
7;Dutch;female;Ms.;Livia;X;Breukers;Everlaan 182;Veenendaal;UT;Utrecht;3903
Try this in loop..
awk -v variable=$var '$1 != variable' input.csv
awk - to make decision based on columns
-v - to use a variable into a awk command
variable - store the value for awk to process
$var - to search for a specific string in run-time
!= - to check if not exist
input.csv - your input file
It's awk's behavior, when you use -v it can will work with variable on run-time and provide an output that doesn't contain the value you passed. This way, you get all the values that are not matching to your variable. Hope this is helpful. :)
Thanks
This bash script should work:
!/bin/bash
FILTER="!/(^"$(echo "$#" | sed -e "s/ /\|^/g")")/ {print}"
awk "$FILTER" input.csv > output.csv
The idea is to build an awk relevant FILTER and then use it.
Assuming the call parameters are: 1 2 3, the filter will be: !/(^1|^2|^3)/ {print}
!: to invert matching
^: Beginning of the line
The input data are in the input.csv file and output result will be in the output.csv file.

Removing two columns from csv without removing the column heading

Been stuck on this for a while, managed to remove two columns completely from it but now I need to remove two columns (3 in total) within the 1 column heading. I've attached a snippit from my csv file.
timestamp;CPU;%usr;%nice;%sys;%iowait;%steal;%irq;%soft;%guest;%idle
2014-09-17 10-20-39 UTC;-1;6.53;0.00;4.02;0.00;0.00;0.00;0.00;0.00;89.45
2014-09-17 10-20-41 UTC;-1;0.50;0.00;1.51;0.00;0.00;0.00;0.00;0.00;97.99
2014-09-17 10-20-43 UTC;-1;1.98;0.00;1.98;5.45;0.00;0.50;0.00;0.00;90.10
2014-09-17 10-20-45 UTC;-1;0.50;0.00;1.51;0.00;0.00;0.00;0.00;0.00;97.99
2014-09-17 10-20-47 UTC;-1;0.50;0.00;1.50;0.00;0.00;0.00;0.00;0.00;98.00
2014-09-17 10-20-49 UTC;-1;0.50;0.00;1.01;3.02;0.00;0.00;0.00;0.00;95.48
What I'm wanting to do is remove yyyy-mm-dd and also UTC, leaving just 10-20-39 underneath the timestamp column heading. I've tried removing them but I can't seem to do it without taking out the headings.
Thanks to anyone who can help me with this
A perl way:
perl -pe 's/^.+? (.+?) .+?;/$1;/ if $.>1' file
Explanation
The -pe means "print each line after applying the script to it". The script itself simply substitutes identifies the 3 first non-whitespace words and replaces them with the 2nd of the three ($1 since the pattern was captured). This is only run if the current line number ($.) is greater than 1.
An awk way
awk -F';' '(NR>1){sub(/[^ ]* /,"",$1); sub(/ [^ ]*$/,"",$1)}1;' OFS=";" file
Here, we set the input field delimiter to ; and use sub() to remove the 1st and last word from the 1st field.
This following sed command works for you:
sed '1!s/^[^ ]\+ //;1!s/ UTC//'
Explanations:
1! Do not apply to the first line.
s/^[^ ]\+ // Remove the first group of non-space characters at line beginning ("2014-09-17 " in your case).
s/ UTC// Remove the string " UTC".
Assuming the csv file is stored as a.csv, then
sed '1!s/^[^ ]\+ //;1!s/ UTC//' < a.csv
prints the results to standard output, and
sed '1!s/^[^ ]\+ //;1!s/ UTC//' < a.csv > b.csv
saves the result to b.csv.
EDITED:
Added: sample results:
[pengyu#GLaDOS tmp]$ sed '1!s/^[^ ]\+ //;1!s/ UTC//' < a.csv
timestamp;CPU;%usr;%nice;%sys;%iowait;%steal;%irq;%soft;%guest;%idle
10-20-39;-1;6.53;0.00;4.02;0.00;0.00;0.00;0.00;0.00;89.45
10-20-41;-1;0.50;0.00;1.51;0.00;0.00;0.00;0.00;0.00;97.99
10-20-43;-1;1.98;0.00;1.98;5.45;0.00;0.50;0.00;0.00;90.10
10-20-45;-1;0.50;0.00;1.51;0.00;0.00;0.00;0.00;0.00;97.99
10-20-47;-1;0.50;0.00;1.50;0.00;0.00;0.00;0.00;0.00;98.00
10-20-49;-1;0.50;0.00;1.01;3.02;0.00;0.00;0.00;0.00;95.48

Maths on multiple fields on seperate lines using awk

I've been doing some maths on a 3 field x 2 line file like this:
3216.01 2724.81 1708.25
1762.48 617.436 1650.79
My question is how do i refer to the first field on the first line and in the same calculation, refer to the first field in the second row?
And just for completion: I'm planning on taking $1 (line1) and minusing $1 (line2), then squaring and doing the same for the other columns and finally summing this value.
this line works with your requirement:
awk 'NR==1{for(i=1;i<=NF;i++)a[i]=$i}
NR==2{for(i=1;i<=NF;i++)s+=(a[i]-$i)^2; printf "sum: %.3f",s}' file
result:
sum: 6557076.288
Note
this should work with dynamic number of columns, but exactly two lines
the output format is %.3f, you could change it if you like
The codes could be shorten, since there are two similar for-loop structures
EDIT
as suggested by EdMorton, the above codes could be written as:
awk 'NR==1{split($0,a)}
NR==2{for(i=1;i<=NF;i++)s+=(a[i]-$i)^2; printf "sum: %.3f\n",s}' file
very good suggestion, I didn't think of the split... thank Ed!
I would normally story it in some temporary variable
awk 'NR>1 {print $1-a} {a=$1}' inputfile

Bash CSV sorting and unique-ing

a Linux question: I have the CSV file data.csv with the following fields and values
KEY,LEVEL,DATA
2.456,2,aaa
2.456,1,zzz
0.867,2,bbb
9.775,4,ddd
0.867,1,ccc
2.456,0,ttt
...
The field KEY is a float value, while LEVEL is an integer. I know that the first field can have repeated values, as well as the second one, but if you take them together you have a unique couple.
What I would like to do is to sort the file according to the column KEY and then for each unique value under KEY keep only the row having the higher value under LEVEL.
Sorting is not a problem:
$> sort -t, -k1,2 data.csv # fields: KEY,LEVEL,DATA
0.867,1,ccc
0.867,2,bbb
2.456,0,ttt
2.456,1,zzz
2.456,2,aaa
9.775,4,ddd
...
but then how can I filter the rows so that I get what I want, which is:
0.867,2,bbb
2.456,2,aaa
9.775,4,ddd
...
Is there a way to do it using command line tools like sort, uniq, awk and so on? Thanks in advance
try this line:
your sort...|awk -F, 'k&&k!=$1{print p}{p=$0;k=$1}END{print p}'
output:
kent$ echo "0.867,1,bbb
0.867,2,ccc
2.456,0,ttt
2.456,1,zzz
2.456,2,aaa
9.775,4,ddd"|awk -F, 'k&&k!=$1{print p}{p=$0;k=$1}END{print p}'
0.867,2,ccc
2.456,2,aaa
9.775,4,ddd
The idea is, because your file is already sorted, just go through the file/input from top, if the first column (KEY) changed, print the last line, which is the highest value of LEVEL of last KEY
try with your real data, it should work.
also the whole logic (with your sort) could be done by awk in single process.
Use:
$> sort -r data.csv | uniq -w 5 | sort
given your floats are formatted "0.000"-"9.999"
Perl solution:
perl -aF, -ne '$h{$F[0]} = [#F[1,2]] if $F[1] > $h{$F[0]}[0]
}{
print join ",", $_, #{$h{$_}} for sort {$a<=>$b} keys %h' data.csv
Note that the result is different from the one you requested, the first line contains bbb, not ccc.

Remove lines with duplicate cells

I need to remove lines with a duplicate value. For example I need to remove line 1 and 3 in the block below because they contain "Value04" - I cannot remove all lines containing Value03 because there are lines with that data that are NOT duplicates and must be kept. I can use any editor; excel, vim, any other Linux command lines.
In the end there should be no duplicate "UserX" values. User1 should only appear 1 time. But if User1 exists twice, I need to remove the entire line containing "Value04" and keep the one with "Value03"
Value01,Value03,User1
Value02,Value04,User1
Value01,Value03,User2
Value02,Value04,User2
Value01,Value03,User3
Value01,Value03,User4
Your ideas and thoughts are greatly appreciated.
Edit: For clarity and leaving words out from the editing process.
The following Awk command removes all but the first occurrence of a value in the third column:
$ awk -F',' '{
if (!seen[$3]) {
seen[$3] = 1
print
}
}' textfile.txt
Output:
Value01,Value03,User1
Value01,Value03,User2
Value01,Value03,User3
Value01,Value03,User4
same thing in Perl:
perl -F, -nae 'print unless $c{$F[2]}++;' textfile.txt
this uses autosplit mode: "-F, -a" splits by comma and places the result into #F array

Resources