Filtering CSV File using AWK - linux

I'm working on CSV file
This my csv file
Command used for filtering awk -F"," '{print $14}' out_file.csv > test1.csv
This is an example of my data looks like i have around 43 Row and 12,000 column
i planed to separate the single Row using awk command but i cant able to separate the row 3 alone (disease).
i use the following command to get my output
awk -F"," '{print $3}' out_file.csv > test1.csv
This is my file:
gender|gene_name |disease |1000g_oct2014|Polyphen |SNAP
male |RB1,GTF2A1L|cancer,diabetes |0.1 |0.46 |0.1
male |NONE,LOC441|diabetes |0.003 |0.52 |0.6
male |TBC1D1 |diabetes |0.940 |1 |0.9
male |BCOR |cancer |0 |0.31 |0.2
male |TP53 |diabetes |0 |0.54 |0.4
note "|" i did not use this a delimiter. it for show the row in an order my details looks exactly like this in the spreed sheet:
But i'm getting the output following way
Disease
GTF2A1L
LOC441
TBC1D1
BCOR
TP53
While opening in Spread Sheet i can get the results in the proper manner but when i uses awk the , in-between the row 2 is also been taken. i dont know why
can any one help me with this.

The root of your problem is - you have comma separated values with embedded commas.
That makes life more difficult. I would suggest the approach is to use a csv parser.
I quite like perl and Text::CSV:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;
open ( my $data, '<', 'data_file.csv' ) or die $!;
my $csv = Text::CSV -> new ( { binary => 1, sep_char => ',', eol => "\n" } );
while ( my $row = $csv -> getline ( $data ) ) {
print $row -> [2],"\n";
}
Of course, I can't tell for sure if that actually works, because the data you've linked on your google drive doesn't actually match the question you've asked. (note - perl starts arrays at zero, so [3] is actually the 4th field)
But it should do the trick - Text::CSV handles quoted comma fields nicely.

Unfortunately the link you provided ("This is my file") points to two files, neither of which (at the time of this writing) seems to correspond with the sample you gave. However, if your file really is a CSV file with commas used both for separating fields and embedded within fields, then the advice given elsewhere to use a CSV-aware tool is very sound. (I would recommend considering a command-line program that can convert CSV to TSV so the entire *nix tool chain remains at your disposal.)
Your sample output and attendant comments suggest you may already have a way to convert it to a pipe-delimited or tab-delimited file. If so, then awk can be used quite effectively. (If you have a choice, then I'd suggest tabs, since then programs such as cut are especially easy to use.)
The general idea, then, is to use awk with "|" (or tab) as the primary separator (awk -F"|" or awk -F\\t), and to use awk's split function to parse the contents of each top-level field.

At last this is what i did for getting my answers in a simple way thanks to #peak i found the solution
1st i used the
CSV filter which is an python module used for filtering the csv file.
i changed my delimiters using csvfilter using the following command
csvfilter input_file.csv --out-delimiter="|" > out_file.csv
This command used to change the delimiter ',' into '|'
now i used the awk command to sort and filter
awk -F"|" 'FNR == 1 {print} {if ($14 < 0.01) print }' out_file.csv > filtered_file.csv
Thanks for your help.

Related

use awk to left outer join two csv file based on multiple columns while keeping order of the first file observations

I have two csv files.
File 1
ID,Name,Gender,Salary,DOB
11,Jim,M,200,90
12,David,M,100,89
12,David,M,300,89
13,Lucy,F,150,86
14,Lily,F,200,85
13,Lucy,F,100,86
File 2
DOB,Name,Children
90,Jim,2
88,Michael,4
88,Lily,1
85,Lily,0
What I want to do is to left outer join File 2 into File 1 based on DOB and Name while keeping the order of File 1 observations.
So the output is expected to be
ID,Name,Gender,Salary,DOB,Children
11,Jim,M,200,90,2
12,David,M,100,89,
12,David,M,300,89,
13,Lucy,F,150,86,
14,Lily,F,200,85,0
13,Lucy,F,100,86,
I learned that we need to sort data if we use join command. So I was wondering whether I could use awk to do my work. But I am new with awk. Is there anyone can help me? By the way, if the data is very big, can I drop print command in awk but simply use > *.csv to save into a new csv file? It's because I found solutions to some related questions in this website often used {print ...}. Thank you.
awk to the rescue!
$ awk -F, 'NR==FNR{a[$1,$2]=$3; next} {print $0 FS a[$NF,$2]}' file2 file1
ID,Name,Gender,Salary,DOB,Children
11,Jim,M,200,90,2
12,David,M,100,89,
12,David,M,300,89,
13,Lucy,F,150,86,
14,Lily,F,200,85,0
13,Lucy,F,100,86,
join will require sorted input and you need embellishments to recover initial ordering. You can redirect the output to a file by adding > outputfile.csv

How to use awk for filtering(perl automation)

This is my txt file
type=0
vcpu_count=10
maste=0
h=0
p=0
memory=23.59
num=2
I want to get the vcpu_count and memory values and store it in some array through perl(automating script) .
awk -F'=' '/vcpu_count/{printf "\n",$1}' .vmConfig.txt
i am using this command just to test on terminal.but am getting a blank line. How do i do it. I need to get these two values and check for condition
If you are using Perl anyway, just use Perl for this too.
my %array;
open ($config, "<", ".vmConfig.txt") or die "$0: Could not open .vmConfig.txt: $!\n";
while (<$config>) {
next unless /^\s*(vcpu_count|memory)\s*=\s*(.*?)\s*\n/;
$array{$1} = $2;
}
close($config);
If you don't want the result to be an associative array (aka hash), refactoring should be relatively easy.
Following awk may help you on same.
Solution 1st:
awk '/vcpu_count/{print;next} /memory/{print}' Input_file
Output will be as follows:
vcpu_count=10
memory=23.59
Solution 2nd:
In case you want to print the values on a single line using printf then following may help you on same:
awk '/vcpu_count/{val=$0;next} /memory/{printf("%s AND %s\n",val,$0)}' Input_file
Output will be as follows:
vcpu_count=10 AND memory=23.59
when you use awk -F'=' '/vcpu_count/{printf "\n",$1}' .vmConfig.txt there are a couple of mistakes. Firstly, printf "\n" will only ever print a new line, as you have found. You need to add a format specifier - something like printf "%s\n", $2 will treat field 2 as a string and add it into the printed string. Checking out man printf at the command line will explain a bit more,.
Secondly, as I changed there, when you used $1 you were using the first field, which is the key in this case (while $0 is the whole line.)
Triplees solution is probably the most appropriate, but if there is a particular reason to start awk to perform this before perl, the following may help.
As you have done, it splits on =, but then outputs as csv, which you can change as appropriate. Even if input lines are not always in same order, will output in predictable order on single line
awk 'BEGIN {
FS="=";
OFS="," # tabs, etc if wanted, delete for spaces.
}
/vcpu_count/ {cpu=$2}
/memory/ {mem=$2}
END { print cpu, mem }'
This gives
10,23.59

How to randomly sort one key while the other is kept in its original sort order with GNU "sort"

Given an input list like the following:
405:alice#level1
405:bob#level2
405:chuck#level1
405:don#level3
405:eric#level1
405:francis#level1
004:ac#jjj
004:la#jjj
004:za#zzz
101:amy#floor1
101:brian#floor3
101:christian#floor1
101:devon#floor1
101:eunuch#floor2
101:frank#floor3
005:artie#le2
005:bono#nuk1
005:bozo#nor2
(As you can see, the first field was randomly sorted (the original input had all of the first field in numerical order, with 004 coming first, then 005, 101, 405, et al) but the second field is in alphabetical order on the first character.)
What is desired is a randomized sort where the first field - as separated by a colon ':', is randomly sorted so that all of the entries of the second field do not matter during the random sort, so long as all lines where the first field are the same are grouped together but randomly distributed throughout the file - is to have the second field randomly sorted as well. That is, in the final output, lines with the same value in the first field are grouped together (but randomly distributed throughout the file) but also to have the second field randomly sorted. I am unable to get this desired result as I am not too familiar with sort keys and whatnot.
The desired output would look similar to this:
405:francis#level1
405:don#level3
405:eric#level1
405:bob#level2
405:alice#level1
405:chuck#level1
004:za#zzz
004:ac#jjj
004:la#jjj
101:christian#floor1
101:amy#floor1
101:frank#floor3
101:eunuch#floor2
101:brian#floor3
101:devon#floor1
005:bono#nuk1
005:artie#le2
005:bozo#nor2
Does anyone know how to achieve this type of sort?
Thank you!
You can do this with awk pretty easily.
As a one-liner:
awk -F: 'BEGIN{cmd="sort -R"} $1 != key {close(cmd)} {key=$1; print | cmd}' input.txt
Or, broken apart for easier explanation:
-F: - Set awk's field separator to colon.
BEGIN{cmd="sort -R"} - before we start, set a variable that is a command to do the "randomized sort". This one works for me on FreeBSD. Should work with GNU sort as well.
$1 != key {close(cmd)} - If the current line has a different first field than the last one processed, close the output pipe...
{key=$1; print | cmd} - And finally, set the "key" var, and print the current line, piping output through the command stored in the cmd variable.
This usage takes advantage of a bit of awk awesomeness. When you pipe through a string (be it stored in a variable or not), that pipe is automatically created upon use. You can close it any time, and a subsequent use will reopen a new command.
The impact of this is that each time you close(cmd), you print the current set of randomly sorted lines. And awk closes cmd automatically once you come to the end of the file.
Of course, for this solution to work, it's vital that all lines with a shared first field are grouped together.
not as elegant but a different method
$ awk -F: '!($1 in a){a[$1]=c++} {print a[$1] "\t" $0}' file |
sort -R -k2 |
sort -nk1,1 -s |
cut -f2-
or, this alternative which doesn't assume initial grouping
$ sort -R file |
awk -F: '!($1 in a){a[$1]=c++} {print a[$1] "\t" $0}' |
sort -nk1,1 -s |
cut -f2-

Split ordered file in Linux

I have a large delimited file (with pipe '|' as the delimiter) which I have managed to sort (using linux sort) according to first (numeric), second (numeric) and fourth column (string ordering since it is a timestamp value). The file is like this:
77|141|243848|2014-01-10 20:06:15.722|2.5|1389391203399
77|141|243849|2014-01-10 20:06:18.222|2.695|1389391203399
77|141|243850|2014-01-10 20:06:20.917|3.083|1389391203399
77|171|28563|2014-01-10 07:08:56|2.941|1389344702735
77|171|28564|2014-01-10 07:08:58.941|4.556|1389344702735
77|171|28565|2014-01-10 07:09:03.497|5.671|1389344702735
78|115|28565|2014-01-10 07:09:03.497|5.671|1389344702735
I was wondering if there is an easy way to split this file to multiple text files with an awk, sed, grep or perl one liner whenever the first column or the second column value changes. The final result for the example file should be 3 text files like that:
77|141|243848|2014-01-10 20:06:15.722|2.5|1389391203399
77|141|243849|2014-01-10 20:06:18.222|2.695|1389391203399
77|141|243850|2014-01-10 20:06:20.917|3.083|1389391203399
77|171|28563|2014-01-10 07:08:56|2.941|1389344702735
77|171|28564|2014-01-10 07:08:58.941|4.556|1389344702735
77|171|28565|2014-01-10 07:09:03.497|5.671|1389344702735
78|115|28565|2014-01-10 07:09:03.497|5.671|1389344702735
I could do that in Java of course, but I think it would be kind of overkill, if it can be done with a script. Also, is this possible that the filenames created use those two columns values, something like 77_141.txt for the first file, 77_171.txt for the second file and 78_115.txt for the third one?
awk is very handy for this kind of problems. This can be an approach:
awk -F"|" '{print >> $1"_"$2".txt"}' file
Explanation
-F"|" sets field separator as |.
{print > something} prints the lines into the file something.
$1"_"$2".txt" instead of something, set the output file as $1"_"$2, being $1 the first field based on the | separator. That is, 77, 78... And same for $2, being 141, 171...

Using Awk to process a file where each record has different fixed-width fields

I have some data files from a legacy system that I would like to process using Awk. Each file consists of a list of records. There are several different record types and each record type has a different set of fixed-width fields (there is no field separator character). The first two characters of the record indicate the type, from this you then know which fields should follow. A file might look something like this:
AAField1Field2LongerField3
BBField4Field5Field6VeryVeryLongField7Field8
CCField99
Using Gawk I can set the FIELDWIDTHS, but that applies to the whole file (unless I am missing some way of setting this on a record-by-record basis), or I can set FS to "" and process the file one character at a time, but that's a bit cumbersome.
Is there a good way to extract the fields from such a file using Awk?
Edit: Yes, I could use Perl (or something else). I'm still keen to know whether there is a sensible way of doing it with Awk though.
Hopefully this will lead you in the right direction. Assuming your multi-line records are guaranteed to be terminated by a 'CC' type row you can pre-process your text file using simple if-then logic. I have presumed you require fields1,5 and 7 on one row and a sample awk script would be.
BEGIN {
field1=""
field5=""
field7=""
}
{
record_type = substr($0,1,2)
if (record_type == "AA")
{
field1=substr($0,3,6)
}
else if (record_type == "BB")
{
field5=substr($0,9,6)
field7=substr($0,21,18)
}
else if (record_type == "CC")
{
print field1"|"field5"|"field7
}
}
Create an awk script file called program.awk and pop that code into it. Execute the script using :
awk -f program.awk < my_multi_line_file.txt
You maybe can use two passes:
1step.awk
/^AA/{printf "2 6 6 12" }
/^BB/{printf "2 6 6 6 18 6"}
/^CC/{printf "2 8" }
{printf "\n%s\n", $0}
2step.awk
NR%2 == 1 {FIELDWIDTHS=$0}
NR%2 == 0 {print $2}
And then
awk -f 1step.awk sample | awk -f 2step.awk
You probably need to suppress (or at least ignore) awk's built-in field separation code, and use a program along the lines of:
awk '/^AA/ { manually process record AA out of $0 }
/^BB/ { manually process record BB out of $0 }
/^CC/ { manually process record CC out of $0 }' file ...
The manual processing will be a bit fiddly - I suppose you'll need to use the substr function to extract each field by position, so what I've got as one line per record type will be more like one line per field in each record type, plus the follow-on printing.
I do think you might be better off with Perl and its unpack feature, but awk can handle it too, albeit verbosely.
Could you use Perl and then select an unpack template based on the first two chars of the line?
Better use some fully featured scripting language like perl or ruby.
What about 2 scripts? E.g. 1st script inserts field separators based on the first characters, then the 2nd should process it?
Or first of all define some function in your AWK script, which splits the lines into variables based on the input - I would go this way, for the possible re-usage.

Resources