Get a line from a file and add as a column into another file - linux

I have two files.
File A:
Unix
File B:
A,B
C,D
E,f
.,.
.,.
N,N
Expected Output:
A,B,Unix
C,D,Unix
E,f,Unix
.,.,Unix
.,.,Unix
N,N,Unix
How it is possible with in Linux shell script?

➜ cat a
A,B
C,D
E,f
.,.
.,.
N,N
➜ cat f2
Unix
➜ awk 'BEGIN{getline f2<"f2"}; {printf("%s,%s\n",$0,f2);}' a
A,B,Unix
C,D,Unix
E,f,Unix
.,.,Unix
.,.,Unix
N,N,Unix

Assuming fileA contains only 1 word, it's better to pass it to awk as a parameter.
awk -v v="Unix" 'BEGIN{FS=OFS=","}{$(NF+1)=v}1' fileB
If this fileA contains more words, and assuming there is only 1 per line, you could also use this:
awk 'BEGIN{FS=OFS=","}NR==FNR{a[++i]=$1;next} {for(j=1; j<=i; j++) $(NF+1)=a[j]}1' fileA fileB

And then there is good old paste as an option:
$ cat file1
UNIX
$ cat file2
A,B
C,D
E,F
$ paste -d',' file2 <(yes `cat file1` | head -n $(cat file2 | wc -l))
A,B,UNIX
C,D,UNIX
E,F,UNIX
The tricky part here is that the number of rows differ in file1 and file2, so we need to repeat the UNIX row of file1 as often as necessary (i.e, as many times as there are rows in file2) to be able to use paste.

Related

How do I compare two files in unix based on their columns

I am fairly new to unix commands, but i have two .csv files where i would like to compare the first column either with diff or comm. Every line is different, if i were to compare the whole line, thats why i want to compare the first column in each file and then have the difference printed out in numbers where the landcode sould not be counted more than once. The first file has also has a header i want to skip when it compares.
sample from file1:
iso_code,continent,location,date,total_cases
AND,Denver ,America,2020-07-26,897.0
ABW,Copenhagen Denmark,,2020-03-13,2.0
AFG,Oslo,Norway,2020-09-06,324.0
AZE,Hamburg,Germany,2020-03-30,29.0
sample from file2:
AND,Denver ,America,2020-07-26,897.0
ABW,Copenhagen Denmark,,2020-03-13,5.0
ABW,Chil Ukrain,Aruba,2020-10-06,4449.0
ALB,Upsala,Sweden,2020-08-275.0,
AFG,Afghanistan,,2020-09-06,324.0
The expected output should be "2", as there are two occurrences of the same land code in the two files. Duplicates of the contry code sould only be counted one time. That is why expected out should be 2 and not 3
I have tried multiple solutions:
awk 'NR==FNR{c[$1]++;next};c[$1] == 0' owid-covid-data-filtered.csv owid-covid-data.csv | wc -l
with the awk i get output: 1
and
diff owid-covid-data.csv owid-covid-data-filtered.csv |cut -d' ' -f1 owid-covid-data-filtered.csv| wc -l
overall i want the occurrences that are similar in both file1 and file2 column 1
From the condition c[$1] == 0 in the awk script from the question I assumed you want to print lines from file2 that contain a code that is not present in file1.
As it is clarified now, that you want to count the codes that are present in both files, see below at the end of the answer for the reverse check.
Slight modifications to your script will fix the problems:
awk -F, 'NR==FNR { if(NR!=1)c[$1]++; next} c[$1]++ == 0' file1 file2
Option -F , specifies comma (,) as field separator.
The condition if(NR!=1)c[$1]++; skips the header line in file1.
The post-increment operator in c[$1]++ == 0 will make the condition fail for the second or later occurrence of the same code in file2.
I omit the trailing | wc -l here to show the output lines.
I modified file2 to contain two lines with the same code in column 1 that is not present in file1.
With file2 shown here
AND,Europe,Andorra,2020-07-26,897.0
ABW,North America,Aruba,2020-03-13,2.0
ABW,North America,Aruba,2020-10-06,4079.0
ALB,Europe,Albania,2020-08-23,8275.1
ALB,Europe,Albania,2020-08-23,8275.2
AFG,Asia,Afghanistan,2020-09-06,38324.0
AFG,Asia,Afghanistan,2020-09-06,38324.0
and file1 from the question I get this output:
AND,Europe,Andorra,2020-07-26,897.0
ALB,Europe,Albania,2020-08-23,8275.1
(Only the first line with ALB is printed`.)
You can also implemente the counting in awk instead of using wc -l.
awk -F , 'NR==FNR { if(NR!=1)c[$1]++; next } c[$1]++ == 0 {count++} END {print count}' file1 file2
If you want to print the lines from file2 that contain a code that is present in file1, the script can be modified like this:
awk -F, 'NR==FNR { if(NR!=1)c[$1]++; next} c[$1] { c[$1]=0; print}' file1 file2
This prints
ABW,North America,Aruba,2020-03-13,2.0
AFG,Asia,Afghanistan,2020-09-06,38324.0
(The first line with code ABW.)
Alternative solution as requested in a comment.
tail -n +2 file1|cut -f1 -d,|sort -u>code1
cut -f1 -d, file2|sort -u>code2
fgrep -vf code1 code2
rm code1 code2
Or combined in one command without using temporary files code1 and code2:
fgrep -f <(tail -n +2 file1|cut -f1 -d,|sort -u) <(cut -f1 -d, file2|sort -u)
Add | wc -l to count the lines instead of printing them.
Explanation:
tail -n +2 print everything starting from the 2nd line
cut -f1 -d, print the first field, delimited with ,
sort -u sort lines and remove duplicates
fgrep -f code1 code2 print all lines from code2 that contain any of the strings from code1
occurrences that are similar in both file1 and file2 column 1:
$ awk -F, 'NR==FNR{a[$1];next}$1 in a' file1 file2
Output:
ABW,North America,Aruba,2020-03-13,2.0
ABW,North America,Aruba,2020-10-06,4079.0
AFG,Asia,Afghanistan,2020-09-06,38324.0

combine two csv files based on common column using awk or sed [duplicate]

This question already has answers here:
How to merge two files using AWK? [duplicate]
(4 answers)
Closed 2 years ago.
I have a two CSV file which have a common column in both files along with duplicates in one file. How to merge both csv files using awk or sed?
CSV file 1
5/1/20,user,mark,Type1 445566
5/2/20,user,ally,Type1 445577
5/1/20,user,joe,Type1 445588
5/2/20,user,chris,Type1 445566
CSV file 2
Type1 445566,Name XYZ11
Type1 445577,Name AAA22
Type1 445588,Name BBB33
Type1 445566,Name XYZ11
What I want is?
5/1/20,user,mark,Type1 445566,Name XYZ11
5/2/20,user,ally,Type1 445577,Name AAA22
5/1/20,user,joe,Type1 445588,Name BBB33
5/2/20,user,chris,Type1 445566,Name XYZ11
So is there a bash command in Linux/Unix to achieve this? Can we do this using awk or sed?
Basically, I need to match column 4 of CSV file 1 with column 1 of CSV file 2 and merge both csv's.
Tried following command:
Command:
paste -d, <(cut -d, -f 1-2 ./test1.csv | sed 's/$/,Type1/') test2.csv
Got Result:
5/1/20,user,Type1,Type1 445566,Name XYZ11
If you are able to install the join utility, this command works:
join -t, -o 1.1 1.2 1.3 2.1 2.2 -1 4 -2 1 file1.csv file2.csv
Explanation:
-t, identify the field separator as comma (',')
-o 1.1 1.2 1.3 2.1 2.2 format the output to be "file1col1, file1col2, file1col3, file2col1, file2col2`
-1 4 join by column 4 in file1
-2 1 join by column 1 in file2
For additional usage information for join, reference the join manpage.
Edit: You specifically asked for the solution using awk or sed so here is the awk implementation:
awk -F"," 'NR==FNR {a[$1] = $2; next} {print $1","$2","$3","$4"," a[$4]}' \
file2.csv \
file1.csv
Explanation:
-F"," Delimit by the comma character
NR==FNR Read the first file argument (notice in the above solution that we're passing file2 first)
{a[$1] = $2; next} In the current file, save the contents of Column2 in an array that uses Column1 as the key
{print $1","$2","$3","$4"," a[$4]} Read file1 and using Column4, match the value to the key's value from the array. Print Column1, Column2, Column3, Column4, and the key's value.
The two example input files seem to be already appropriately sorted, so you just have to put them side by side, and paste is good for this; however you want to remove some ,-separated columns from file1, and you can use cut for that; but you also want to insert another (constant) column, and sed can do it. A possible command is this:
paste -d, <(cut -d, -f 1-2 file1 | sed 's/$/,abcd/') file2
Actually sed can do the whole processing of file1, and the output can be pided into paste, which uses - to capture it from the standard input:
sed -E 's/^(([^,]+,){2}).*/\1abcd/' file1 | paste -d, - file2

Linux - Delete lines from file 1 in file 2 BIG DATA

have two files:
file1:
a
b
c
d
file2:
a
b
f
c
d
e
output file (file2) should be:
f
e
I want that the lines of file1 should be deleted directly in file2. I want that the output should be not a new file. It should direct deleted in file 2. Of course there can be created a temp file.
I real file two contains more than 300.000 lines. That is the reason why some solution:
comm -13 file1 file2
don't work.
comm needs the input files to be sorted. You can use process substitution for that:
#!/bin/bash
comm -13 <(sort file1) <(sort file2) > tmp_file
mv tmp_file > original_file
Output:
e
f
Alternatively, if you have enough memory, you can use the following awk command which does not need the input to be sorted:
awk 'NR==FNR{a[$0];next} !($0 in a)' file1 file2
Output (preserved sort order):
f
e
Keep in mind that the size of the array a directly depends on the size of file1.
PS: grep -vFf file1 file2 can also be used and the memory requirements are the same as for the awk solution. Given that, I would probably just use grep.

Extract lines from File2 already found File1

Using linux commandline, i need to output the lines from text file2 that are already found in file1.
File1:
C
A
G
E
B
D
H
F
File2:
N
I
H
J
K
M
D
L
A
Output:
A
D
H
Thanks!
You are looking for the tools 'grep'
Check this out.
Lets say you have inputs in file1 & file2 files
grep -f file1 file2
will return you
H
D
A
A more flexible tool to use would be awk
awk 'NR==FNR{lines[$0]++; next} $1 in lines'
Example
$ awk 'NR==FNR{lines[$0]++; next} $1 in lines' file1 file2
H
D
A
What it does?
NR==FNR{lines[$0]++; next}
NR==FNR checks if the file number of records is equal to the overall number of records. This is true only for the first file, file1
lines[$0]++ Here we create an associative array with the line, $0 in file 1 as index.
$0 in lines This line works only for the second file because of the next in previous action. This checks if the line in file 2 is there in the saved array lines, if yes the default action of printing the entire line is taken
Awk is more flexible than the grep as you can columns in file1 with any column in file 2 and decides to print any column rather than printing the entire line
This is what the comm utility does, but you have to sort the files first: To get the lines in common between the 2 files:
comm -12 <(sort File1) <(sort File2)

extracting unique values between 2 sets/files

Working in linux/shell env, how can I accomplish the following:
text file 1 contains:
1
2
3
4
5
text file 2 contains:
6
7
1
2
3
4
I need to extract the entries in file 2 which are not in file 1. So '6' and '7' in this example.
How do I do this from the command line?
many thanks!
$ awk 'FNR==NR {a[$0]++; next} !($0 in a)' file1 file2
6
7
Explanation of how the code works:
If we're working on file1, track each line of text we see.
If we're working on file2, and have not seen the line text, then print it.
Explanation of details:
FNR is the current file's record number
NR is the current overall record number from all input files
FNR==NR is true only when we are reading file1
$0 is the current line of text
a[$0] is a hash with the key set to the current line of text
a[$0]++ tracks that we've seen the current line of text
!($0 in a) is true only when we have not seen the line text
Print the line of text if the above pattern returns true, this is the default awk behavior when no explicit action is given
Using some lesser-known utilities:
sort file1 > file1.sorted
sort file2 > file2.sorted
comm -1 -3 file1.sorted file2.sorted
This will output duplicates, so if there is 1 3 in file1, but 2 in file2, this will still output 1 3. If this is not what you want, pipe the output from sort through uniq before writing it to a file:
sort file1 | uniq > file1.sorted
sort file2 | uniq > file2.sorted
comm -1 -3 file1.sorted file2.sorted
There are lots of utilities in the GNU coreutils package that allow for all sorts of text manipulations.
I was wondering which of the following solutions was the "fastest" for "larger" files:
awk 'FNR==NR{a[$0]++}FNR!=NR && !a[$0]{print}' file1 file2 # awk1 by SiegeX
awk 'FNR==NR{a[$0]++;next}!($0 in a)' file1 file2 # awk2 by ghostdog74
comm -13 <(sort file1) <(sort file2)
join -v 2 <(sort file1) <(sort file2)
grep -v -F -x -f file1 file2
Results of my benchmarks in short:
Do not use grep -Fxf, it's much slower (2-4 times in my tests).
comm is slightly faster than join.
If file1 and file2 are already sorted, comm and join are much faster than awk1 + awk2. (Of course, they do not assume sorted files.)
awk1 + awk2, supposedly, use more RAM and less CPU. Real run times are lower for comm probably due to the fact that it uses more threads. CPU times are lower for awk1 + awk2.
For the sake of brevity I omit full details. However, I assume that anyone interested can contact me or just repeat the tests. Roughly, the setup was
# Debian Squeeze, Bash 4.1.5, LC_ALL=C, slow 4 core CPU
$ wc file1 file2
321599 321599 8098710 file1
321603 321603 8098794 file2
Typical results of fastest runs
awk2: real 0m1.145s user 0m1.088s sys 0m0.056s user+sys 1.144
awk1: real 0m1.369s user 0m1.324s sys 0m0.044s user+sys 1.368
comm: real 0m0.980s user 0m1.608s sys 0m0.184s user+sys 1.792
join: real 0m1.080s user 0m1.756s sys 0m0.140s user+sys 1.896
grep: real 0m4.005s user 0m3.844s sys 0m0.160s user+sys 4.004
BTW, for the awkies: It seems that a[$0]=1 is faster than a[$0]++, and (!($0 in a)) is faster than (!a[$0]). So, for an awk solution I suggest:
awk 'FNR==NR{a[$0]=1;next}!($0 in a)' file1 file2
How about:
diff file_1 file_2 | grep '^>' | cut -c 3-
This would print the entries in file_2 which are not in file_1. For the opposite result one just has to replace '>' with '<'. 'cut' removes the first two characters added by 'diff', that are not part of the original content.
The files don't even need to be sorted.
with grep:
grep -F -x -v -f file_1 file_2
here's another awk solution
$ awk 'FNR==NR{a[$0]++;next}(!($0 in a))' file1 file2
6
7
$ cat file1 file1 file2 | sort | uniq -u
6
7
uniq -- report or filter out repeated lines in a file
... Repeated
lines in the input will not be detected if they are not adjacent, so
it may be necessary to sort the files first.
-u Only output lines that are not repeated in the input.
Print file1 twice to make sure all entries from file1 are skipped by uniq -u .
cat file1 file2 | sort -u > unique
If you are really set on doing this from the command line, this site (search for "no duplicates found") has an awk example that searches for duplicates. It may be a good starting point to look at that.
However, I'd encourage you to use Perl or Python for this. Basically, the flow of the program would be:
findUniqueValues(file1, file2){
contents1 = array of values from file1
contents2 = array of values from file2
foreach(value2 in contents2){
found=false
foreach(value1 in contents1){
if (value2 == value1) found=true
}
if(!found) print value2
}
}
This isn't the most elegant way of doing this, since it has a O(n^2) time complexity, but it will do the job.

Resources