I am fairly new to unix commands, but i have two .csv files where i would like to compare the first column either with diff or comm. Every line is different, if i were to compare the whole line, thats why i want to compare the first column in each file and then have the difference printed out in numbers where the landcode sould not be counted more than once. The first file has also has a header i want to skip when it compares.
sample from file1:
iso_code,continent,location,date,total_cases
AND,Denver ,America,2020-07-26,897.0
ABW,Copenhagen Denmark,,2020-03-13,2.0
AFG,Oslo,Norway,2020-09-06,324.0
AZE,Hamburg,Germany,2020-03-30,29.0
sample from file2:
AND,Denver ,America,2020-07-26,897.0
ABW,Copenhagen Denmark,,2020-03-13,5.0
ABW,Chil Ukrain,Aruba,2020-10-06,4449.0
ALB,Upsala,Sweden,2020-08-275.0,
AFG,Afghanistan,,2020-09-06,324.0
The expected output should be "2", as there are two occurrences of the same land code in the two files. Duplicates of the contry code sould only be counted one time. That is why expected out should be 2 and not 3
I have tried multiple solutions:
awk 'NR==FNR{c[$1]++;next};c[$1] == 0' owid-covid-data-filtered.csv owid-covid-data.csv | wc -l
with the awk i get output: 1
and
diff owid-covid-data.csv owid-covid-data-filtered.csv |cut -d' ' -f1 owid-covid-data-filtered.csv| wc -l
overall i want the occurrences that are similar in both file1 and file2 column 1
From the condition c[$1] == 0 in the awk script from the question I assumed you want to print lines from file2 that contain a code that is not present in file1.
As it is clarified now, that you want to count the codes that are present in both files, see below at the end of the answer for the reverse check.
Slight modifications to your script will fix the problems:
awk -F, 'NR==FNR { if(NR!=1)c[$1]++; next} c[$1]++ == 0' file1 file2
Option -F , specifies comma (,) as field separator.
The condition if(NR!=1)c[$1]++; skips the header line in file1.
The post-increment operator in c[$1]++ == 0 will make the condition fail for the second or later occurrence of the same code in file2.
I omit the trailing | wc -l here to show the output lines.
I modified file2 to contain two lines with the same code in column 1 that is not present in file1.
With file2 shown here
AND,Europe,Andorra,2020-07-26,897.0
ABW,North America,Aruba,2020-03-13,2.0
ABW,North America,Aruba,2020-10-06,4079.0
ALB,Europe,Albania,2020-08-23,8275.1
ALB,Europe,Albania,2020-08-23,8275.2
AFG,Asia,Afghanistan,2020-09-06,38324.0
AFG,Asia,Afghanistan,2020-09-06,38324.0
and file1 from the question I get this output:
AND,Europe,Andorra,2020-07-26,897.0
ALB,Europe,Albania,2020-08-23,8275.1
(Only the first line with ALB is printed`.)
You can also implemente the counting in awk instead of using wc -l.
awk -F , 'NR==FNR { if(NR!=1)c[$1]++; next } c[$1]++ == 0 {count++} END {print count}' file1 file2
If you want to print the lines from file2 that contain a code that is present in file1, the script can be modified like this:
awk -F, 'NR==FNR { if(NR!=1)c[$1]++; next} c[$1] { c[$1]=0; print}' file1 file2
This prints
ABW,North America,Aruba,2020-03-13,2.0
AFG,Asia,Afghanistan,2020-09-06,38324.0
(The first line with code ABW.)
Alternative solution as requested in a comment.
tail -n +2 file1|cut -f1 -d,|sort -u>code1
cut -f1 -d, file2|sort -u>code2
fgrep -vf code1 code2
rm code1 code2
Or combined in one command without using temporary files code1 and code2:
fgrep -f <(tail -n +2 file1|cut -f1 -d,|sort -u) <(cut -f1 -d, file2|sort -u)
Add | wc -l to count the lines instead of printing them.
Explanation:
tail -n +2 print everything starting from the 2nd line
cut -f1 -d, print the first field, delimited with ,
sort -u sort lines and remove duplicates
fgrep -f code1 code2 print all lines from code2 that contain any of the strings from code1
occurrences that are similar in both file1 and file2 column 1:
$ awk -F, 'NR==FNR{a[$1];next}$1 in a' file1 file2
Output:
ABW,North America,Aruba,2020-03-13,2.0
ABW,North America,Aruba,2020-10-06,4079.0
AFG,Asia,Afghanistan,2020-09-06,38324.0
I have data set like this:
1 A
1 B
1 C
2 A
2 B
2 C
3 B
3 C
And I have a script which calculates me:
Number of occurrences in searching string
Number of rows
awk -v search="A" \
'BEGIN{count=0} $2 == search {count++} END{print count "\n" NR}' input
That works perfectly fine.
I would like to add to my awk one liner number of unique lines from the first column.
So the output should be separated by \n:
2
8
3
I can do this in separate awk code, but I am not able to integrate it to my original awk code.
awk '{a[$1]++}END{for(i in a){print i}}' input | wc -l
Any idea how to integrate it in one awk solution without piping ?
Looks like you want this:
awk -v search="A" '{a[$1]++}
$2 == search {count++}
END{OFS="\n";print count+0, NR, length(a)}' file
How do I use awk on a file that looks like this:
abcd Z
efdg Z
aqbs F
edf F
aasd A
I want to extract the number of times each letter of the alphabet occurs in the second column, so output should be:
Z 2
F 2
A 1
try: If you want the order of output same as Input_file then following may help you.
awk 'FNR==NR{A[$2]++;next} A[$2]{print $2,A[$2];delete A[$2]}' Input_file Input_file
if you don't bother of order of $2 then following may help you.
awk '{A[$2]++} END{for(i in A){print i,A[i]}}' Input_file
In first solution reading the Input_file twice and creating an array A whose index is $2 with it's incrementing value. then when second Input_file is being read then printing the $2 and it's count.
In Second solution creating an array A whose index $2 and incrementing value of it. Then in end section go through the array A and print it's index and array A's value.
I would use sort | uniq for this purpose as these two utils are designed specifically for this kind of task:
cat <<END |
abcd Z
efdg Z
aqbs F
edf F
aasd A
END
awk '{print $2}' | sort -r | uniq -c | awk '{printf "%s %d\n", $2, $1}'
Would produce exactly the desired output
Z 2
F 2
A 1
Here awk '{print $2}' is used to get the second column from a document with fields separated by one or more whitespace characters. If we knew the width of the columns is fixed, we could use a faster cut utility instead.
sort -r | uniq -c is doing the main algorithmic part of the task - sort the letters in reverse order and count the number of occurrences of each letter.
awk '{printf "%s %d\n", $2, $1}' does some reformatting of the uniq -c output to match the required format exactly.
Update: AWK has powerful array support so this can be done with awk alone:
cat <<END |
abcd Z
efdg Z
aqbs F
edf F
aasd A
END
awk '{a[$2]++}
END {n=asorti(a,b,"#ind_str_desc");
for (k=1;k<=n;k++) {printf b[k], a[b[k]]} }'
We use the array a that is indexed with letters found in the input stream, and on each line the element indexed by the corresponding letter gets incremented.
In the END clause we reverse the order of indices and output the array.
I have this data to sort. The 1st column is the item ID. The 2nd column is the numerical value. Some items do not have a numerical value.
03875334 -4.27
03860156 -7.27
03830332
19594535 7.87
01542392 -5.74
01481815 11.45
04213946 -10.06
03812865 -8.67
03831625
01552174 -9.28
13540266 -8.27
03927870 -7.25
00968327 -8.09
I want to use the Linux sort command to sort the items numerically in the ascending order of their value, but leave those empty items to the end. So, this is the expected output I want to obtain:
04213946 -10.06
01552174 -9.28
03812865 -8.67
13540266 -8.27
00968327 -8.09
03860156 -7.27
03927870 -7.25
01542392 -5.74
03875334 -4.27
19594535 7.87
01481815 11.45
03830332
03831625
I tried "sort -k2n" and "sort -k2g", but neither yielded the output I want. Any idea?
Here is a simple Schwartzian transform based on the assumption that all actual values are smaller than 123456789.
awk '{ printf "%s\t%s", ($2 || 123456789), $0 }' file |
sort -n | cut -f2- >output
Assuming data is in d.txt and blanks have 4 spaces at the end
egrep " $" d.txt > blanks.txt ; egrep -v " $" d.txt | sort -n -k2 | cat - blanks.txt
This should work:
awk '$2 ~ /[0-9]$/' d.txt | sort -k2g && awk '$2 !~ /[0-9]$/' d.txt
I have a file that contains several lines of data. Some lines contain three columns, but most contain only two. All lines are single-tab separated. For those that contain three columns, the third column is typically redundant and contains the same data as the second so I'd like to remove it.
I imagine awk or cut would be appropriate, but I'm drawing a blank on how to test the row for three columns so my script will only work on those rows. I know awk is a very powerful language with logic and whatnot built into it, I'm just not that strong with it.
I looked at a similar question, but I'm not sure what is going on with the awk answer. Should the -4 be -1 since I only want to remove one column? What about if the row has two columns; will it remove the second even though I don't want to do anything?
I modified it to what I think it would be:
awk -F"\t" -v OFS="\t" '{ for (i=1;i<=NF-4;i++){ print $i }}'
But when I run it (with the file) nothing happens. If I change NF-1 or NF-2 I get some output, but it only a handful of lines and only the first column.
Can anyone clue me into what I should be doing?
If you just want to remove the third column, you could just print the first and the second:
awk -F '\t' '{print $1 "\t" $2}'
And it's similar to cut:
cut -f 1,2
The awk variable NF gives you the number for fields. So an expression like this should work for you.
awk -F, 'NF == 3 {print $1 "," $2} NF != 3 {print $0}'
Running it on an input file like so
a,b,c
x,y
u,v,w
l,m
gives me
$ cat test | awk -F, 'NF == 3 {print $1 "," $2} NF != 3 {print $0}'
a,b
x,y
u,v
l,m
This might work for you (GNU sed):
sed 's/\t[^\t]*//2g' file
Restricts the file to two columns.
awk 'NF==3{print $1"\t"$2}NF==2{print}' your_file
Testde below:
> cat temp
1 2
3 4 5
6 7
8 9 10
>
> awk 'NF==3{print $1"\t"$2}NF==2{print}' temp
1 2
3 4
6 7
8 9
>
or in a much more simplere way in awk:
awk 'NF==3{print $1"\t"$2}NF==2' your_file
Or you can also go with perl:
perl -lane 'print "$F[0]\t$F[1]"' your_file