How to merge column output to the end of a row in the previous column? - linux

I have a .csv file containing three columns and I need to merge the value of column 2 with the end of the row of column 1.
The .csv file contains thousands of rows and this needs to be done for each row.
Iv'e tried using awk but I'm finding it difficult to get the code correct
cat file.csv | awk '{print $1, $2}'
awk '{if ($2!= " ") {print $1+$2 }}'
These of course don't work
Sample input:
The command used to produce the actual output is simply:
cat test.csv
[2,4,5,6,2,34,61,32,34,54,34, 22] 0.144354
[3,4,6,4,5,6,7,1,2,3,4,53,23, 34] 0.332453
[2,43,6,2,1,2,5,8,9,0,8,6,34, 21] 0.347643
Desired Output:
col1 col2
[2,4,5,6,2,34,61,32,34,54,34,22] 0.144354
[3,4,6,4,5,6,7,1,2,3,4,53,23,34] 0.332453
[2,43,6,2,1,2,5,8,9,0,8,6,34,21] 0.347643

Replace "comma followed by one or more spaces" with "comma":
sed 's/, \{1,\}/,/' file.csv
sed 's/, */,/g' file.csv
Print columns $1 and $2 as $1 (optionally separate with a tab):
awk '{print $1 $2, $3}' OFS='\t' file.csv

You can try:
awk '{printf("%s%s\t%s\n",$1,$2,$3)}' file.cvs

I only see spaces after a comma when you don't want them.
$: sed -E 's/,\s+/,/' file.csv
[2,4,5,6,2,34,61,32,34,54,34,22] 0.144354
[3,4,6,4,5,6,7,1,2,3,4,53,23,34] 0.332453
[2,43,6,2,1,2,5,8,9,0,8,6,34,21] 0.347643
Add -i (after the -E) to make it an in-place edit.
$: sed -Ei 's/,\s+/,/' file.csv
$: cat file.csv
[2,4,5,6,2,34,61,32,34,54,34,22] 0.144354
[3,4,6,4,5,6,7,1,2,3,4,53,23,34] 0.332453
[2,43,6,2,1,2,5,8,9,0,8,6,34,21] 0.347643

Related

How do I compare two files in unix based on their columns

I am fairly new to unix commands, but i have two .csv files where i would like to compare the first column either with diff or comm. Every line is different, if i were to compare the whole line, thats why i want to compare the first column in each file and then have the difference printed out in numbers where the landcode sould not be counted more than once. The first file has also has a header i want to skip when it compares.
sample from file1:
iso_code,continent,location,date,total_cases
AND,Denver ,America,2020-07-26,897.0
ABW,Copenhagen Denmark,,2020-03-13,2.0
AFG,Oslo,Norway,2020-09-06,324.0
AZE,Hamburg,Germany,2020-03-30,29.0
sample from file2:
AND,Denver ,America,2020-07-26,897.0
ABW,Copenhagen Denmark,,2020-03-13,5.0
ABW,Chil Ukrain,Aruba,2020-10-06,4449.0
ALB,Upsala,Sweden,2020-08-275.0,
AFG,Afghanistan,,2020-09-06,324.0
The expected output should be "2", as there are two occurrences of the same land code in the two files. Duplicates of the contry code sould only be counted one time. That is why expected out should be 2 and not 3
I have tried multiple solutions:
awk 'NR==FNR{c[$1]++;next};c[$1] == 0' owid-covid-data-filtered.csv owid-covid-data.csv | wc -l
with the awk i get output: 1
and
diff owid-covid-data.csv owid-covid-data-filtered.csv |cut -d' ' -f1 owid-covid-data-filtered.csv| wc -l
overall i want the occurrences that are similar in both file1 and file2 column 1
From the condition c[$1] == 0 in the awk script from the question I assumed you want to print lines from file2 that contain a code that is not present in file1.
As it is clarified now, that you want to count the codes that are present in both files, see below at the end of the answer for the reverse check.
Slight modifications to your script will fix the problems:
awk -F, 'NR==FNR { if(NR!=1)c[$1]++; next} c[$1]++ == 0' file1 file2
Option -F , specifies comma (,) as field separator.
The condition if(NR!=1)c[$1]++; skips the header line in file1.
The post-increment operator in c[$1]++ == 0 will make the condition fail for the second or later occurrence of the same code in file2.
I omit the trailing | wc -l here to show the output lines.
I modified file2 to contain two lines with the same code in column 1 that is not present in file1.
With file2 shown here
AND,Europe,Andorra,2020-07-26,897.0
ABW,North America,Aruba,2020-03-13,2.0
ABW,North America,Aruba,2020-10-06,4079.0
ALB,Europe,Albania,2020-08-23,8275.1
ALB,Europe,Albania,2020-08-23,8275.2
AFG,Asia,Afghanistan,2020-09-06,38324.0
AFG,Asia,Afghanistan,2020-09-06,38324.0
and file1 from the question I get this output:
AND,Europe,Andorra,2020-07-26,897.0
ALB,Europe,Albania,2020-08-23,8275.1
(Only the first line with ALB is printed`.)
You can also implemente the counting in awk instead of using wc -l.
awk -F , 'NR==FNR { if(NR!=1)c[$1]++; next } c[$1]++ == 0 {count++} END {print count}' file1 file2
If you want to print the lines from file2 that contain a code that is present in file1, the script can be modified like this:
awk -F, 'NR==FNR { if(NR!=1)c[$1]++; next} c[$1] { c[$1]=0; print}' file1 file2
This prints
ABW,North America,Aruba,2020-03-13,2.0
AFG,Asia,Afghanistan,2020-09-06,38324.0
(The first line with code ABW.)
Alternative solution as requested in a comment.
tail -n +2 file1|cut -f1 -d,|sort -u>code1
cut -f1 -d, file2|sort -u>code2
fgrep -vf code1 code2
rm code1 code2
Or combined in one command without using temporary files code1 and code2:
fgrep -f <(tail -n +2 file1|cut -f1 -d,|sort -u) <(cut -f1 -d, file2|sort -u)
Add | wc -l to count the lines instead of printing them.
Explanation:
tail -n +2 print everything starting from the 2nd line
cut -f1 -d, print the first field, delimited with ,
sort -u sort lines and remove duplicates
fgrep -f code1 code2 print all lines from code2 that contain any of the strings from code1
occurrences that are similar in both file1 and file2 column 1:
$ awk -F, 'NR==FNR{a[$1];next}$1 in a' file1 file2
Output:
ABW,North America,Aruba,2020-03-13,2.0
ABW,North America,Aruba,2020-10-06,4079.0
AFG,Asia,Afghanistan,2020-09-06,38324.0

Using awk with variable

I am using this awk command to extract three rows from a text file.
awk 'BEGIN {FS="\t";OFS=","}; {print $1,$3,$10}' $FILENAME > $OUTPUT
I wish to specify the column numbers as a variable separately so it will be easier to modify in the future like this:
COLUMNS=$1,$3,$10
awk 'BEGIN {FS="\t";OFS=","}; {print $COLUMNS}' $FILENAME > $OUTPUT
However it pulls all columns into the output, not only the 3 I specified. How do I do this properly?
like this ?
$ more file
a,b,c,d,e
1,2,3,4,5
$ a='$1,$2,$NF'
$ awk -F, "{print $a}" file
a b e
1 2 5

LINUX: Using cat to remove columns in CSV - some have commas in the data

I need to remove some columns from a CSV. Easy.
The problem is I have two columns with full text that actually has commas in them as a part of the data. My cols are enclosed with quotes and the cat is counting the commas in the text as columns. How can I do this so the commas enclosed with quotes are ignored?
example:
"first", "last", "dob", "some long sentence, it has commas in it,", "some data", "foo"
i want to print only rows 1-4, 6
You will save yourself a lot of aggravation by writing a short Perl script that uses Parse::CSV http://metacpan.org/pod/Parse::CSV
I am sure there is a Python way of doing this too.
cat file | sed -e 's|^"||;s|"$||' | awk 'BEGIN {FS="[\"], ?[\"]"}{print $2}'
Example:
http://ideone.com/g2gZmx
How it works:
Look at line:
"a,b","c,d","e,f"
We know that each row is surrounded by "". So we can split this line by ",":
cat file | awk 'BEGIN {FS="[\"], ?[\"]"}{print $2}'
and rows will be:
"a,b c,d e,f"
But we have annoying " in the start and end of line. So we remove it with sed:
cat file | sed -e 's|^"||;s|"$||' | awk 'BEGIN {FS="[\"], ?[\"]"}{print $2}'
And rows will be
a,b c,d e,f
Then we can simply take second row by awk '{print $2}.
Read about regexp field splitting in awk: http://www.gnu.org/software/gawk/manual/html_node/Regexp-Field-Splitting.html

Removing last column from rows that have three columns using bash

I have a file that contains several lines of data. Some lines contain three columns, but most contain only two. All lines are single-tab separated. For those that contain three columns, the third column is typically redundant and contains the same data as the second so I'd like to remove it.
I imagine awk or cut would be appropriate, but I'm drawing a blank on how to test the row for three columns so my script will only work on those rows. I know awk is a very powerful language with logic and whatnot built into it, I'm just not that strong with it.
I looked at a similar question, but I'm not sure what is going on with the awk answer. Should the -4 be -1 since I only want to remove one column? What about if the row has two columns; will it remove the second even though I don't want to do anything?
I modified it to what I think it would be:
awk -F"\t" -v OFS="\t" '{ for (i=1;i<=NF-4;i++){ print $i }}'
But when I run it (with the file) nothing happens. If I change NF-1 or NF-2 I get some output, but it only a handful of lines and only the first column.
Can anyone clue me into what I should be doing?
If you just want to remove the third column, you could just print the first and the second:
awk -F '\t' '{print $1 "\t" $2}'
And it's similar to cut:
cut -f 1,2
The awk variable NF gives you the number for fields. So an expression like this should work for you.
awk -F, 'NF == 3 {print $1 "," $2} NF != 3 {print $0}'
Running it on an input file like so
a,b,c
x,y
u,v,w
l,m
gives me
$ cat test | awk -F, 'NF == 3 {print $1 "," $2} NF != 3 {print $0}'
a,b
x,y
u,v
l,m
This might work for you (GNU sed):
sed 's/\t[^\t]*//2g' file
Restricts the file to two columns.
awk 'NF==3{print $1"\t"$2}NF==2{print}' your_file
Testde below:
> cat temp
1 2
3 4 5
6 7
8 9 10
>
> awk 'NF==3{print $1"\t"$2}NF==2{print}' temp
1 2
3 4
6 7
8 9
>
or in a much more simplere way in awk:
awk 'NF==3{print $1"\t"$2}NF==2' your_file
Or you can also go with perl:
perl -lane 'print "$F[0]\t$F[1]"' your_file

output the 2nd column of a file

given a file with two columns, separatedly by standard white space
a b
c d
f g
h
how do I output the second column
cut -d' ' -f2
awk '{print $2}'
Because the last line of your example data has no first column you'll have to parse it as fixed width columns:
awk 'BEGIN {FIELDWIDTHS = "2 1"} {print $2}'
Use cut with byte offsets:
cut -b 3
Use sed to remove trailing columns:
sed s/..//
cut -c2 listdir
Here you can see for visualization:

Resources