use awk to left outer join two csv file based on multiple columns while keeping order of the first file observations - linux

I have two csv files.
File 1
ID,Name,Gender,Salary,DOB
11,Jim,M,200,90
12,David,M,100,89
12,David,M,300,89
13,Lucy,F,150,86
14,Lily,F,200,85
13,Lucy,F,100,86
File 2
DOB,Name,Children
90,Jim,2
88,Michael,4
88,Lily,1
85,Lily,0
What I want to do is to left outer join File 2 into File 1 based on DOB and Name while keeping the order of File 1 observations.
So the output is expected to be
ID,Name,Gender,Salary,DOB,Children
11,Jim,M,200,90,2
12,David,M,100,89,
12,David,M,300,89,
13,Lucy,F,150,86,
14,Lily,F,200,85,0
13,Lucy,F,100,86,
I learned that we need to sort data if we use join command. So I was wondering whether I could use awk to do my work. But I am new with awk. Is there anyone can help me? By the way, if the data is very big, can I drop print command in awk but simply use > *.csv to save into a new csv file? It's because I found solutions to some related questions in this website often used {print ...}. Thank you.

awk to the rescue!
$ awk -F, 'NR==FNR{a[$1,$2]=$3; next} {print $0 FS a[$NF,$2]}' file2 file1
ID,Name,Gender,Salary,DOB,Children
11,Jim,M,200,90,2
12,David,M,100,89,
12,David,M,300,89,
13,Lucy,F,150,86,
14,Lily,F,200,85,0
13,Lucy,F,100,86,
join will require sorted input and you need embellishments to recover initial ordering. You can redirect the output to a file by adding > outputfile.csv

Related

Retrieve different information from several files to bring them together in one. BASH

I have a problem with my bash script, I would like to retrieve information contained in several files and gather them in one.
I have a file in this form which contains about 15000 lines: (file1)
1;1;A0200101C
2;2;A0200101C
3;3;A1160101A
4;4;A1160101A
5;5;A1130304G
6;6;A1110110U
7;7;A1110110U
8;8;A1030002V
9;9;A1030002V
10;10;A2120100C
11;11;A2120100C
12;12;A3410071A
13;13;A3400001A
14;14;A3385000G1
15;15;A3365070G1
I would need to retrieve the first record of each row matching the id.
My second file is this, I just need to retrieve the 3rd row: (file2)
count
-------
131
(1 row)
I would therefore like to be able to assemble the id of (file1) and the 3rd line of (file2) in order to achieve this result:
1;131
2;131
3;131
4;131
5;131
6;131
7;131
8;131
9;131
11;131
12;131
13;131
14;131
15;131
Thank you.
One possible way:
#!/usr/bin/env bash
count=$(awk 'NR == 3 { print $1 }' file2)
while IFS=';' read -r id _; do
printf "%s;%s\n" "$id" "$count"
done < file1
First, read just the third line of file2 and save that in a variable.
Then read each line of file1 in a loop, extracting the first semicolon-separated field, and print it along with that saved value.
Using the same basic approach in a purely awk script instead of shell will be much faster and more efficient. Such a rewrite is left as an exercise for the reader (Hint: In awk, FNR == NR is true when reading the first file given, and false on any later ones. Alternatively, look up how to pass a shell variable to an awk script; there are Q&As here on SO about it.)

How to randomly sort one key while the other is kept in its original sort order with GNU "sort"

Given an input list like the following:
405:alice#level1
405:bob#level2
405:chuck#level1
405:don#level3
405:eric#level1
405:francis#level1
004:ac#jjj
004:la#jjj
004:za#zzz
101:amy#floor1
101:brian#floor3
101:christian#floor1
101:devon#floor1
101:eunuch#floor2
101:frank#floor3
005:artie#le2
005:bono#nuk1
005:bozo#nor2
(As you can see, the first field was randomly sorted (the original input had all of the first field in numerical order, with 004 coming first, then 005, 101, 405, et al) but the second field is in alphabetical order on the first character.)
What is desired is a randomized sort where the first field - as separated by a colon ':', is randomly sorted so that all of the entries of the second field do not matter during the random sort, so long as all lines where the first field are the same are grouped together but randomly distributed throughout the file - is to have the second field randomly sorted as well. That is, in the final output, lines with the same value in the first field are grouped together (but randomly distributed throughout the file) but also to have the second field randomly sorted. I am unable to get this desired result as I am not too familiar with sort keys and whatnot.
The desired output would look similar to this:
405:francis#level1
405:don#level3
405:eric#level1
405:bob#level2
405:alice#level1
405:chuck#level1
004:za#zzz
004:ac#jjj
004:la#jjj
101:christian#floor1
101:amy#floor1
101:frank#floor3
101:eunuch#floor2
101:brian#floor3
101:devon#floor1
005:bono#nuk1
005:artie#le2
005:bozo#nor2
Does anyone know how to achieve this type of sort?
Thank you!
You can do this with awk pretty easily.
As a one-liner:
awk -F: 'BEGIN{cmd="sort -R"} $1 != key {close(cmd)} {key=$1; print | cmd}' input.txt
Or, broken apart for easier explanation:
-F: - Set awk's field separator to colon.
BEGIN{cmd="sort -R"} - before we start, set a variable that is a command to do the "randomized sort". This one works for me on FreeBSD. Should work with GNU sort as well.
$1 != key {close(cmd)} - If the current line has a different first field than the last one processed, close the output pipe...
{key=$1; print | cmd} - And finally, set the "key" var, and print the current line, piping output through the command stored in the cmd variable.
This usage takes advantage of a bit of awk awesomeness. When you pipe through a string (be it stored in a variable or not), that pipe is automatically created upon use. You can close it any time, and a subsequent use will reopen a new command.
The impact of this is that each time you close(cmd), you print the current set of randomly sorted lines. And awk closes cmd automatically once you come to the end of the file.
Of course, for this solution to work, it's vital that all lines with a shared first field are grouped together.
not as elegant but a different method
$ awk -F: '!($1 in a){a[$1]=c++} {print a[$1] "\t" $0}' file |
sort -R -k2 |
sort -nk1,1 -s |
cut -f2-
or, this alternative which doesn't assume initial grouping
$ sort -R file |
awk -F: '!($1 in a){a[$1]=c++} {print a[$1] "\t" $0}' |
sort -nk1,1 -s |
cut -f2-

Bash - Compare 2 lists of files with their md5 check sums

I have 2 lists with files with their md5sum checks. The lists have different paths even that the files are the same. I want to check the md5 sums of each file. We are talking for thousands of files and that's why I need script to show me only the differences. The first list is the vanilla and the second is the current state of the files. I want to find which of the files are changed/different than the original. To do that I want to compare the 2 lists. On every line there is md5 sum and file location/name. Did anyone have an idea how to do that? And what happens if there is one extra file in one of the lists?!
Example of content in first file with check sums (vanila list):
df7a0edcb7994581430379db56d8d53b /home/user/vanila/file-1.php
e1af39e94239a944440ab2925393ae60 /home/user/vanila/file-2.php
ce74e43d24d9c36cd579e932ee94b152 /home/user/vanila/file-3.php
95b7d47ed7134912270f8d3059100e8c /home/user/vanila/file-4.php
Example of content in second file with check sums (active list):
df7a0edcb7994581430379db56d8d53b /home/user/file-1.php
94b2a24a1fc9883246fc103f22818930 /home/user/file-1.1.php
e1af39e94239a944440ab2925393ae60 /home/user/file-2.php
ce74e43d24d9c36cd579e932ee94b152 /home/user/file-3.php
f5233ee990c50aade7c4e3ab9b4fe524 /home/user/file-4.php
Expecting results:
To show me that file-4.php is with different md5 sum.
If shows that there is an extra file (file-1.1.php) it's a bonus!
An attempt using Awk which is the right tool meant for this,
awk -F"/" 'FNR==NR{filearray[$1]=$NF; next }!($1 in filearray){printf "%s has a different md5sum\n",$NF}' file2 file1
file4.php has a different md5sum
Where, file2 and file1 are as follows
$ cat file1
df7a0edcb7994581430379db56d8d53b /home/user/vanila/file-1.php
e1af39e94239a944440ab2925393ae60 /home/user/vanila/file-2.php
ce74e43d24d9c36cd579e932ee94b152 /home/user/vanila/file-3.php
95b7d47ed7134912270f8d3059100e8c /home/user/vanila/file-4.php
$ cat file2
df7a0edcb7994581430379db56d8d53b /home/user/file-1.php
94b2a24a1fc9883246fc103f22818930 /home/user/file-1.1.php
e1af39e94239a944440ab2925393ae60 /home/user/file-2.php
ce74e43d24d9c36cd579e932ee94b152 /home/user/file-3.php
f5233ee990c50aade7c4e3ab9b4fe524 /home/user/file-4.php
To find the file is not present in one and not in other,
awk -F"/" 'FNR==NR{filelist[$NF]=$NF; next;}!($NF in filelist){printf "%s is an extra file",$NF}' file1 file2
file-1.1.php is an extra file

Filtering CSV File using AWK

I'm working on CSV file
This my csv file
Command used for filtering awk -F"," '{print $14}' out_file.csv > test1.csv
This is an example of my data looks like i have around 43 Row and 12,000 column
i planed to separate the single Row using awk command but i cant able to separate the row 3 alone (disease).
i use the following command to get my output
awk -F"," '{print $3}' out_file.csv > test1.csv
This is my file:
gender|gene_name |disease |1000g_oct2014|Polyphen |SNAP
male |RB1,GTF2A1L|cancer,diabetes |0.1 |0.46 |0.1
male |NONE,LOC441|diabetes |0.003 |0.52 |0.6
male |TBC1D1 |diabetes |0.940 |1 |0.9
male |BCOR |cancer |0 |0.31 |0.2
male |TP53 |diabetes |0 |0.54 |0.4
note "|" i did not use this a delimiter. it for show the row in an order my details looks exactly like this in the spreed sheet:
But i'm getting the output following way
Disease
GTF2A1L
LOC441
TBC1D1
BCOR
TP53
While opening in Spread Sheet i can get the results in the proper manner but when i uses awk the , in-between the row 2 is also been taken. i dont know why
can any one help me with this.
The root of your problem is - you have comma separated values with embedded commas.
That makes life more difficult. I would suggest the approach is to use a csv parser.
I quite like perl and Text::CSV:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;
open ( my $data, '<', 'data_file.csv' ) or die $!;
my $csv = Text::CSV -> new ( { binary => 1, sep_char => ',', eol => "\n" } );
while ( my $row = $csv -> getline ( $data ) ) {
print $row -> [2],"\n";
}
Of course, I can't tell for sure if that actually works, because the data you've linked on your google drive doesn't actually match the question you've asked. (note - perl starts arrays at zero, so [3] is actually the 4th field)
But it should do the trick - Text::CSV handles quoted comma fields nicely.
Unfortunately the link you provided ("This is my file") points to two files, neither of which (at the time of this writing) seems to correspond with the sample you gave. However, if your file really is a CSV file with commas used both for separating fields and embedded within fields, then the advice given elsewhere to use a CSV-aware tool is very sound. (I would recommend considering a command-line program that can convert CSV to TSV so the entire *nix tool chain remains at your disposal.)
Your sample output and attendant comments suggest you may already have a way to convert it to a pipe-delimited or tab-delimited file. If so, then awk can be used quite effectively. (If you have a choice, then I'd suggest tabs, since then programs such as cut are especially easy to use.)
The general idea, then, is to use awk with "|" (or tab) as the primary separator (awk -F"|" or awk -F\\t), and to use awk's split function to parse the contents of each top-level field.
At last this is what i did for getting my answers in a simple way thanks to #peak i found the solution
1st i used the
CSV filter which is an python module used for filtering the csv file.
i changed my delimiters using csvfilter using the following command
csvfilter input_file.csv --out-delimiter="|" > out_file.csv
This command used to change the delimiter ',' into '|'
now i used the awk command to sort and filter
awk -F"|" 'FNR == 1 {print} {if ($14 < 0.01) print }' out_file.csv > filtered_file.csv
Thanks for your help.

Split ordered file in Linux

I have a large delimited file (with pipe '|' as the delimiter) which I have managed to sort (using linux sort) according to first (numeric), second (numeric) and fourth column (string ordering since it is a timestamp value). The file is like this:
77|141|243848|2014-01-10 20:06:15.722|2.5|1389391203399
77|141|243849|2014-01-10 20:06:18.222|2.695|1389391203399
77|141|243850|2014-01-10 20:06:20.917|3.083|1389391203399
77|171|28563|2014-01-10 07:08:56|2.941|1389344702735
77|171|28564|2014-01-10 07:08:58.941|4.556|1389344702735
77|171|28565|2014-01-10 07:09:03.497|5.671|1389344702735
78|115|28565|2014-01-10 07:09:03.497|5.671|1389344702735
I was wondering if there is an easy way to split this file to multiple text files with an awk, sed, grep or perl one liner whenever the first column or the second column value changes. The final result for the example file should be 3 text files like that:
77|141|243848|2014-01-10 20:06:15.722|2.5|1389391203399
77|141|243849|2014-01-10 20:06:18.222|2.695|1389391203399
77|141|243850|2014-01-10 20:06:20.917|3.083|1389391203399
77|171|28563|2014-01-10 07:08:56|2.941|1389344702735
77|171|28564|2014-01-10 07:08:58.941|4.556|1389344702735
77|171|28565|2014-01-10 07:09:03.497|5.671|1389344702735
78|115|28565|2014-01-10 07:09:03.497|5.671|1389344702735
I could do that in Java of course, but I think it would be kind of overkill, if it can be done with a script. Also, is this possible that the filenames created use those two columns values, something like 77_141.txt for the first file, 77_171.txt for the second file and 78_115.txt for the third one?
awk is very handy for this kind of problems. This can be an approach:
awk -F"|" '{print >> $1"_"$2".txt"}' file
Explanation
-F"|" sets field separator as |.
{print > something} prints the lines into the file something.
$1"_"$2".txt" instead of something, set the output file as $1"_"$2, being $1 the first field based on the | separator. That is, 77, 78... And same for $2, being 141, 171...

Resources