check if any value of a column in csv file exists in a column in a second csv file - linux

I want to compare columns in two csv files. Basically check if any value in one column exists in another column. If they do exist print out any such values.
Ex:
file1:
id
value
abc
789
efg
766
hij
456
file2:
id
value
klm
789
nop
766
abc
456
I need to compare if any values in file2 'id' column exist in file1 'id' column. In the example above 'abc' is one value that is repeated and needs to be print out.
Is there a bash script that can do this?

Using awk:
awk -F, 'FNR==1 { next } NR==FNR { map[$1]=$2;next } map[$1]!="" { print;print $1"\t"map[$1] } ' file1 file2
If the line number is 1 (FNR==1), skip to the next line. When processing the first file (NR=FNR), create an array map with the first space separated field as the index and the second field the value. Then, when processing the second file, if there is an entry for the first field in map, print the line along with the entry in the map array.

If you are using python, you can use pandas library.
import pandas as pd
df = pd.DataFrame(yourdata, columns = ['X', 'Y', 'Z'])
duplicate = df[df.duplicated()]
print(duplicate)
For more detailed info, you can check this page.

Using join (and tail and sort and Bash's process substitution):
$ join -j 1 -o "1.1" <(tail -n +2 file1 | sort) <(tail -n +2 file2 | sort)
abc
Explained:
join -j 1 -o "1.1" join on the first space-separated field, output the first field of the first file
<(...) Bash's process substitution
tail -n +2 file1 ditch the header
| sort join expects the files to be sorted
(Yeah, I'd use #RamanSailopal's awk solution, too, ++)

Related

Filtering on a condition using the column names and not numbers

I am trying to filter a text file with columns based on two conditions. Due to the size of the file, I cannot use the column numbers (as there are thousands and are unnumbered) but need to use the column names. I have searched and tried to come up with multiple ways to do this but nothing is returned to the command line.
Here are a few things I have tried:
awk '($colname1==2 && $colname2==1) { count++ } END { print count }' file.txt
to filter out the columns based on their conditions
and
head -1 file.txt | tr '\t' | cat -n | grep "COLNAME
to try and return the possible column number related to the column.
An example file would be:
ID ad bd
1 a fire
2 b air
3 c water
4 c water
5 d water
6 c earth
Output would be:
2 (count of ad=c and bd=water)
with your input file and the implied conditions this should work
$ awk -v c1='ad' -v c2='bd' 'NR==1{n=split($0,h); for(i=1;i<=n;i++) col[h[i]]=i}
$col[c1]=="c" && $col[c2]=="water"{count++} END{print count+0}' file
2
or you can replace c1 and c2 with the values in the script as well.
to find the column indices you can run
$ awk -v cols='ad bd' 'BEGIN{n=split(cols,c); for(i=1;i<=n;i++) colmap[c[i]]}
NR==1{for(i=1;i<=NF;i++) if($i in colmap) print $i,i; exit}' file
ad 2
bd 3
or perhaps with this chain
$ sed 1q file | tr -s ' ' \\n | nl | grep -E 'ad|bd'
2 ad
3 bd
although may have false positives due to regex match...
You can rewrite the awk to be more succinct
$ awk -v cols='ad bd' '{while(++i<=NF) if(FS cols FS ~ FS $i FS) print $i,i;
exit}' file
ad 2
bd 3
As I mentioned in an earlier comment, the answer at https://unix.stackexchange.com/a/359699/133219 shows how to do this:
awk -F'\t' '
NR==1 {
for (i=1; i<=NF; i++) {
f[$i] = i
}
}
($(f["ad"]) == "c") && ($(f["bd"]) == "water") { cnt++ }
END { print cnt+0 }
' file
2
I'm assuming your input is tab-separated due to the tr '\t' in the command in your question that looks like you're trying to convert tabs to newlines to convert column names to numbers. If I'm wrong and they're just separated by any chains of white space then remove -F'\t' from the above.
Use miller toolkit to manipulate tab-delimited files using column names. Below is a one-liner that filters a tab-delimited file (delimiter is specified using --tsv) and writes the results to STDOUT together with the header. The header is removed using tail and the lines are counted with wc.
mlr --tsv filter '$ad == "c" && $bd == "water"' file.txt | tail -n +2 | wc -l
Prints:
2
SEE ALSO:
miller manual
Note that miller can be easily installed, for example, using conda, like so:
conda create --name miller miller
For years it bugged me there is no succinct way in Unix to do this sort of thing, although miller is a pretty good tool for this. Recently I wrote pick to choose columns by name, and additionally modify, combine and add them by name, as well as filtering rows by clauses using column names. The solution to the above with pick is
pick -h #ad=c #bd=water < data.txt | wc -l
By default pick prints the header of the selected columns, -h is to omit it. To print columns you simply name them on the command line, e.g.
pick ad water < data.txt | wc -l
Pick has many modes, all of them focused on manipulating columns and selecting/filtering rows with a minimal amount of syntax.

Linux split a file in two columns

I have the following file that contains 2 columns :
A:B:IP:80 apples
C:D:IP2:82 oranges
E:F:IP3:84 grapes
How is possible to split the file in 2 other files, each column in a file like this:
File1
A:B:IP:80
C:D:IP2:82
E:F:IP3:84
File2
apples
oranges
grapes
Try:
awk '{print $1>"file1"; print $2>"file2"}' file
After runningl that command, we can verify that the desired files have been created:
$ cat file1
A:B:IP:80
C:D:IP2:82
E:F:IP3:84
And:
$ cat file2
apples
oranges
grapes
How it works
print $1>"file1"
This tells awk to write the first column to file1.
print $2>"file2"
This tells awk to write the second column to file2.
Perl 1-liner using (abusing) the fact that print goes to STDOUT, i.e. file descriptor 1, and warn goes to STDERR, i.e. file descriptor 2:
# perl -n means loop over the lines of input automatically
# perl -e means execute the following code
# chomp means remove the trailing newline from the expression
perl -ne 'chomp(my #cols = split /\s+/); # Split each line on whitespace
print $cols[0] . "\n";
warn $cols[1] . "\n"' <input 1>col1 2>col2
You could, of course, just use cut -b with the appropriate columns, but then you would need to read the file twice.
Here's an awk solution that'll work with any number of columns:
awk '{for(n=1;n<=NF;n++)print $n>"File"n}' input.txt
This steps through each field on the line and prints the field to a different output file based on the column number.
Note that blank fields -- or rather, lines with fewer fields than other lines, will cause line numbers to mismatch. That is, if your input is:
A 1
B
C 3
Then File2 will contain:
1
3
If this is a concern, mention it in an update to your question.
You could of course do this in bash alone, in a number of ways. Here's one:
while read -r line; do
a=($line)
for m in "${!a[#]}"; do
printf '%s\n' "${a[$m]}" >> File$((m+1))
done
done < input.txt
This reads each line of input into $line, then word-splits $line into values in the $a[] array. It then steps through that array, printing each item to the appropriate file, named for the index of the array (plus one, since bash arrays start at zero).

Extract specific columns from delimited file (long row to next line)

Want to extract 2 columns from delimited file (delimiter '||') in unix can be easily be done if complete row in on one line like below
foo||bar||baz||quux
by
cut -d'||' -f1 file_name
but in my case records in file for a single row record went to next line for example:
foo||bar||baz||quux||foo||bar||baz||quux||foo||bar||baz||quux
||quux||bar||baz||quux||foo||bar||baz||quux||foo||bar||baz||quux
and its output from above command is
foo
quux
instead should be just "foo" because it is in first column.
file contain in row 1
foo||bar||baz||quux||foo||bar||baz||quux||foo||bar||baz||quux
||quux||bar||baz||quux||foo||bar||baz||quux||foo||bar||baz||quux
file contain in row 2
foo2||bar2||baz2||quux2||foo2||bar2||baz2||quux2||foo2||bar2||baz2||quux2
||quux2||bar2||baz2||quux2||foo2||bar2||baz2||quux2||foo2||bar2||baz2||quux2
output should be
foo
foo2
Almost, but the -d switch only takes one char:
cut -d'|' -f1 file_name
Output:
foo
foo2
Note: since the delimiters are doubled, the -f switch won't work as expected if the field number is greater than 1. One way to handle that is adjust the field to equal "2n-1". So to get field #3, do -f$(( (3*2) - 1 )).
Using awk. Since it's the first field of every other record (NR%2), use:
$ awk -F\| 'NR%2{print $1}' file
foo
foo2
Data (four records):
$ cat file
foo||bar||baz||quux||foo||bar||baz||quux||foo||bar||baz||quux
||quux||bar||baz||quux||foo||bar||baz||quux||foo||bar||baz||quux
foo2||bar2||baz2||quux2||foo2||bar2||baz2||quux2||foo2||bar2||baz2||quux2
||quux2||bar2||baz2||quux2||foo2||bar2||baz2||quux2||foo2||bar2||baz2||quux2
Interesting phenomenon is that mawk accepts -F"\|\|" (dual pipes) as delimiter but GNU awk doesn't.

How do I sort a CSV file by a specific column?

I want to sort csv as follow, what I want is
sort by column 2
if column is the same, sort by column 3(numerically)
here is what I do:
$ sort -t"," -k2 -nk3 /tmp/test.csv
55b64670abb9c0663e77de84,525e3bfad07b4377dc142a24:9999,0.081032
5510b33ec720d80086865312,525e3bfad07b4377dc142a24:9999,0.081033
55aca6a1d2e33dc888ddeb31,525e3bf7d07b4377d31429d2:2,0.081034
55aca6a1d2e33dc888ddeb31,525e3bf7d07b4377d31429d2:2,0.081034
5514548ec720d80086bfec46,525e3bfad07b4377dc142a24:9999,0.081035
551d4e21c720d80086084f45,525e3bfad07b4377dc142a24:9999,0.081036
557bff5276bd54a8df83268a,525e3bfad07b4377dc142a24:9999,0.081036
this result is strange, it sorts by the column three first, then by column 2
This command seems to yield correct output:
sort -t"," -k2,2 -k3,3n /tmp/test.csv
I use comma to constrain order to that column only, and use the numeric (-n) switch to last character in the third column.
It yields:
55aca6a1d2e33dc888ddeb31,525e3bf7d07b4377d31429d2:2,0.081034
55aca6a1d2e33dc888ddeb31,525e3bf7d07b4377d31429d2:2,0.081034
55b64670abb9c0663e77de84,525e3bfad07b4377dc142a24:9999,0.081032
5510b33ec720d80086865312,525e3bfad07b4377dc142a24:9999,0.081033
5514548ec720d80086bfec46,525e3bfad07b4377dc142a24:9999,0.081035
551d4e21c720d80086084f45,525e3bfad07b4377dc142a24:9999,0.081036
557bff5276bd54a8df83268a,525e3bfad07b4377dc142a24:9999,0.081036
Sort will work sorting data on csv & txt file , it will print the output on console
-t says columns are delimited by '|' , -k1 -k2 says that-- it will sort te data by column 1 & then by 2
$ sort -t '|' -k1 -k2 <INPUT_FILE>
For storing the result in output file use following command
$ sort -t '|' -k1 -k2 <INPUT_FILE> -o <OUTPUTFILE>
If you wann do it with ignoring header line then use following command
(head -n1 INPUT_FILE && sort <(tail -n+2 INPUT_FILE)) > OUTPUT_FILE
head -n1 INPUT_FILE which will print only the first line of your file i.e. header
&
This special tail syntax gets your file from second line up to EOF.
While the sort command has some hacks to partially handle CSV files, it won't handle all CSV format features. csvsort is a great option:
csvsort -c 2,3 /tmp/test.csv

Best way to print rows not common to two large files in Unix

I have two files which are of following format.
File1: - It contains 4 column. First field is ID in text format and rest of columns are also some text values.
id1 val12 val13 val14
id2 val22 val23 val24
id3 val32 val33 val34
File2 - In file two I only have IDs.
id1
id2
Output
id3 val32 val33 val34
My question is: How to find rows from first file whose ID(first field) does not appear in second file. Size of both files in pretty large with file1 containing 42 million rows, size 8GB and file2 contains 33 million IDs. Order of IDs in two files might not be same.
Assuming the two files are sorted by id, then something like
join "-t " -j 1 -v 1 file1 file2
should do it.
You could do like this with awk:
awk 'FNR == NR { h[$1] = 1; next } !h[$1]' file2 file1
The first block gathers ids from file2 into the h hash. The last part (!h[$1]) executes the default block ({ print $0 }) if the id wasn't present in file2.
I don't claim that this is the "best" way to do it because best can include a number of trade-off criteria, but here's one way:
You can do this with the -f option to specify File2 as the file containing search patterns to grep:
grep -v -f File2 File1 > output
And as #glennjackman suggests:
One way to force the id to match at the beginning of the line:grep -vf <(sed 's/^/^/' File2) File1

Resources