Awk to remove and move rows/records to another file where 3rd field value is empty - linux

I have below pipe separated pipe.
I want to remove the rows where 3rd field is blank in file1 and need to paste those removed line from File1 into another File(File2).
I tried the below code it is working fine and removing the rows for all three column where 3rd field is blank but not able to figure out how to paste those removed line in another file along with the below code.
So need to know single code statement to remove rows where 3rd column value is empty from File1 and paste those removed rows into another File(like File2)
awk -F"|" -v OFS"|" '$3!=""' File1.txt > test.txt
File1
billingtype|documentnumber|originaldocumentnumber
YMNC|420075416|765467
YMNC|429842808|74646464
YPBC|429842809
INV|430071605|7688888
YPBC|430071609
Output File
billingtype|documentnumber|originaldocumentnumber
YMNC|420075416|765467
YMNC|429842808|74646464
INV|430071605|7688888
File2
billingtype|documentnumber|originaldocumentnumber
YPBC|429842809
YPBC|430071609

$ awk 'BEGIN {FS=OFS="|"}
FNR==1 {print > "File2"}
{if($3!="") print;
else print > "File2"}' File1 > tmp && mv tmp File1
header will be printed in both files. Output to a temp file and move over to the input file. Missing field records will be printed to other file.

You can try this gnu sed
sed -i -Ee '1{WFile2' -e 'b' -e '}' -e '/(.*\|){2}/!{W File2' -e 'd}' File1

Related

How do I compare two files in unix based on their columns

I am fairly new to unix commands, but i have two .csv files where i would like to compare the first column either with diff or comm. Every line is different, if i were to compare the whole line, thats why i want to compare the first column in each file and then have the difference printed out in numbers where the landcode sould not be counted more than once. The first file has also has a header i want to skip when it compares.
sample from file1:
iso_code,continent,location,date,total_cases
AND,Denver ,America,2020-07-26,897.0
ABW,Copenhagen Denmark,,2020-03-13,2.0
AFG,Oslo,Norway,2020-09-06,324.0
AZE,Hamburg,Germany,2020-03-30,29.0
sample from file2:
AND,Denver ,America,2020-07-26,897.0
ABW,Copenhagen Denmark,,2020-03-13,5.0
ABW,Chil Ukrain,Aruba,2020-10-06,4449.0
ALB,Upsala,Sweden,2020-08-275.0,
AFG,Afghanistan,,2020-09-06,324.0
The expected output should be "2", as there are two occurrences of the same land code in the two files. Duplicates of the contry code sould only be counted one time. That is why expected out should be 2 and not 3
I have tried multiple solutions:
awk 'NR==FNR{c[$1]++;next};c[$1] == 0' owid-covid-data-filtered.csv owid-covid-data.csv | wc -l
with the awk i get output: 1
and
diff owid-covid-data.csv owid-covid-data-filtered.csv |cut -d' ' -f1 owid-covid-data-filtered.csv| wc -l
overall i want the occurrences that are similar in both file1 and file2 column 1
From the condition c[$1] == 0 in the awk script from the question I assumed you want to print lines from file2 that contain a code that is not present in file1.
As it is clarified now, that you want to count the codes that are present in both files, see below at the end of the answer for the reverse check.
Slight modifications to your script will fix the problems:
awk -F, 'NR==FNR { if(NR!=1)c[$1]++; next} c[$1]++ == 0' file1 file2
Option -F , specifies comma (,) as field separator.
The condition if(NR!=1)c[$1]++; skips the header line in file1.
The post-increment operator in c[$1]++ == 0 will make the condition fail for the second or later occurrence of the same code in file2.
I omit the trailing | wc -l here to show the output lines.
I modified file2 to contain two lines with the same code in column 1 that is not present in file1.
With file2 shown here
AND,Europe,Andorra,2020-07-26,897.0
ABW,North America,Aruba,2020-03-13,2.0
ABW,North America,Aruba,2020-10-06,4079.0
ALB,Europe,Albania,2020-08-23,8275.1
ALB,Europe,Albania,2020-08-23,8275.2
AFG,Asia,Afghanistan,2020-09-06,38324.0
AFG,Asia,Afghanistan,2020-09-06,38324.0
and file1 from the question I get this output:
AND,Europe,Andorra,2020-07-26,897.0
ALB,Europe,Albania,2020-08-23,8275.1
(Only the first line with ALB is printed`.)
You can also implemente the counting in awk instead of using wc -l.
awk -F , 'NR==FNR { if(NR!=1)c[$1]++; next } c[$1]++ == 0 {count++} END {print count}' file1 file2
If you want to print the lines from file2 that contain a code that is present in file1, the script can be modified like this:
awk -F, 'NR==FNR { if(NR!=1)c[$1]++; next} c[$1] { c[$1]=0; print}' file1 file2
This prints
ABW,North America,Aruba,2020-03-13,2.0
AFG,Asia,Afghanistan,2020-09-06,38324.0
(The first line with code ABW.)
Alternative solution as requested in a comment.
tail -n +2 file1|cut -f1 -d,|sort -u>code1
cut -f1 -d, file2|sort -u>code2
fgrep -vf code1 code2
rm code1 code2
Or combined in one command without using temporary files code1 and code2:
fgrep -f <(tail -n +2 file1|cut -f1 -d,|sort -u) <(cut -f1 -d, file2|sort -u)
Add | wc -l to count the lines instead of printing them.
Explanation:
tail -n +2 print everything starting from the 2nd line
cut -f1 -d, print the first field, delimited with ,
sort -u sort lines and remove duplicates
fgrep -f code1 code2 print all lines from code2 that contain any of the strings from code1
occurrences that are similar in both file1 and file2 column 1:
$ awk -F, 'NR==FNR{a[$1];next}$1 in a' file1 file2
Output:
ABW,North America,Aruba,2020-03-13,2.0
ABW,North America,Aruba,2020-10-06,4079.0
AFG,Asia,Afghanistan,2020-09-06,38324.0

How to merge column output to the end of a row in the previous column?

I have a .csv file containing three columns and I need to merge the value of column 2 with the end of the row of column 1.
The .csv file contains thousands of rows and this needs to be done for each row.
Iv'e tried using awk but I'm finding it difficult to get the code correct
cat file.csv | awk '{print $1, $2}'
awk '{if ($2!= " ") {print $1+$2 }}'
These of course don't work
Sample input:
The command used to produce the actual output is simply:
cat test.csv
[2,4,5,6,2,34,61,32,34,54,34, 22] 0.144354
[3,4,6,4,5,6,7,1,2,3,4,53,23, 34] 0.332453
[2,43,6,2,1,2,5,8,9,0,8,6,34, 21] 0.347643
Desired Output:
col1 col2
[2,4,5,6,2,34,61,32,34,54,34,22] 0.144354
[3,4,6,4,5,6,7,1,2,3,4,53,23,34] 0.332453
[2,43,6,2,1,2,5,8,9,0,8,6,34,21] 0.347643
Replace "comma followed by one or more spaces" with "comma":
sed 's/, \{1,\}/,/' file.csv
sed 's/, */,/g' file.csv
Print columns $1 and $2 as $1 (optionally separate with a tab):
awk '{print $1 $2, $3}' OFS='\t' file.csv
You can try:
awk '{printf("%s%s\t%s\n",$1,$2,$3)}' file.cvs
I only see spaces after a comma when you don't want them.
$: sed -E 's/,\s+/,/' file.csv
[2,4,5,6,2,34,61,32,34,54,34,22] 0.144354
[3,4,6,4,5,6,7,1,2,3,4,53,23,34] 0.332453
[2,43,6,2,1,2,5,8,9,0,8,6,34,21] 0.347643
Add -i (after the -E) to make it an in-place edit.
$: sed -Ei 's/,\s+/,/' file.csv
$: cat file.csv
[2,4,5,6,2,34,61,32,34,54,34,22] 0.144354
[3,4,6,4,5,6,7,1,2,3,4,53,23,34] 0.332453
[2,43,6,2,1,2,5,8,9,0,8,6,34,21] 0.347643

How to cut column data from flat file

I've data in format below;
111,Ja,M,Oes,2012-08-03 16:42:00,x,xz
112,Ln,d,D,Gn,2012-08-03 16:51:00,y,yx
I need to create files with data in the sequence below:
111,x,xz
112,y,yz
In output format, we've first value before comma and last two comma prefix values. Here we can have any number of commas in-between.
Kindly advise, how can generate required output file from input file in Linux machine.
The Awk statement for this is pretty straight-forward. Set the input and output field separators and print the fields using $1..$NF, where $NF is the value of the last column,
awk 'BEGIN{FS=OFS=","}{print $1,$(NF-1),$NF}' input.csv > newfile.csv
Not much to this one in awk:
awk -F"," 'BEGIN{OFS=","}{print $1,$(NF-1), $NF}' inFile > outFile
We split the lines in awk with a comma -F"," and then print the first field $1, the second to last field $(NF-1), and the last field $NF.
NF is the "Number of fields" so subtracting 1 from it will give you the second to last item.
with sed
$ sed -r 's/([^,]+).*(,[^,]+,[^,]+)/\1\2/' file
111,x,xz
112,y,yx
or
$ sed -r 's/([^,]+).*((,[^,]+){2})/\1\2/' file
awk '{print substr($1,1,4) substr($2,10,4)}' file
111,x,xz
112,y,yx

How to compare and replace fields using awk

I have a two files. If Field-9 of File-1 and Field-1 of File-2 is same then replace Field-1 of File-1 with Field-2 of File-1
file1:
12345||||||756432101000||756432||||
aaaaa||||||986754812345||986754||||
ccccc||||||134567222222||134567||||
file2:
756432|AAAAAAAAAAA
986754|20030040000
The expected output is:
12345||||||AAAAAAAAAAA||756432||||
aaaaa||||||20030040000||986754||||
ccccc||||||134567222222||134567|||
I tried this code
awk -F"|" 'NR==FNR{a[$1]=$2} NR>FNR{$7=a[$2];print}' OFS='|' file2 file1
but instead of replacing the field, it gets deleted.
You are using the wrong column as the index of the array in the second block, and you are not checking for missing keys. This produces the output you posted:
awk -F '|' -v OFS='|' 'NR==FNR{a[$1]=$2;next}$9 in a{$7=a[$9]}1' file2 file1

why does grep produce this behavior with these files

I have two files in this format, although they have the same amount of columns, the actual files have significantly more rows.
file1
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
file2
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Both files have the same amount of columns and rows, and are formatted exactly the same, to my knowledge. None of the files have an extension, so no .csv or .txt it's simply the file name. I wanted to compare each row in the files and output the count of the matching rows. So look at row 1 in file1 and then row 1 in file2 and if the match that gets a one and so forth.
To do this I've done grep -c -f file1 file2 in the past however, this time when I try to do it, grep never stops running, and it takes up an entire core. Now when I do grep -w -c -f file1 file2 I do get back a count of the matching lines in the files, however what is strange is if I change the order so if I do grep -w -c -f file2 file1 I get back a different number.
The actual files each have 2983 rows and 61 columns. For those files if I do grep -w -c -f file1 file2 I get back 2021 but if I do grep -w -c -f file2 file1 I get back 1950
Also if I do diff file1 file2 I get this output at the end of the other output \ No newline at end of file but if I do diff -w file1 file2 the no newline message is no longer returned.
If you want to find the common lines, use comm or grep -Fx
comm -12 file1 file2
grep -Fx -f file1 file2
You comment: "it worked the way I want it to. by adding a column with numbers that correspond to the line number"
So, you only want to match if line number N in file1 is the same as line number N in file2:
awk 'NR==FNR {line[FNR] = $0; next} $0 == line[FNR]' file1 file2
or with bash (this will likely be slower, but consume less memory):
while IFS= read -u3 -r line1; IFS= read -u4 -r line2; do
[ "$line1" = "$line2" ] && echo "$line1"
done 3<file1 4<file2
There is an excellent tool called diff designed for exactly this purpose:
diff -c file1 file2 #Difference
comm -1 -2 file1 file2 #Common
This will tell you matching lines, and also exactly specify where lines differ.

Resources