Cleaning Data Text FIle - text

I have a large text file. The following is an example of the data
3;N;X;01;A00;A00.-;A00;A00;Cholera;001;4-002;3-003;2-001;1-002
4;T;X;01;A00;A00.0;A00.0;A000;Cholera due to Vibrio cholerae 01, biovar cholerae;001;4-002;3-003;2-001;1-002
4;T;X;01;A00;A00.1;A00.1;A001;Cholera due to Vibrio cholerae 01, biovar eltor;001;4-002;3-003;2-001;1-002
4;T;X;01;A00;A00.9;A00.9;A009;Cholera, unspecified;001;4-002;3-003;2-001;1-002
3;N;X;01;A00;A01.-;A01;A01;Typhoid and paratyphoid fevers;002;4-002;3-003;2-003;1-004
4;T;X;01;A00;A01.0;A01.0;A010;Typhoid fever;002;4-002;3-003;2-003;1-004
4;T;X;01;A00;A01.1;A01.1;A011;Paratyphoid fever A;002;4-002;3-003;2-003;1-004
4;T;X;01;A00;A01.2;A01.2;A012;Paratyphoid fever B;002;4-002;3-003;2-003;1-004
4;T;X;01;A00;A01.3;A01.3;A013;Paratyphoid fever C;002;4-002;3-003;2-003;1-004
4;T;X;01;A00;A01.4;A01.4;A014;Paratyphoid fever, unspecified;002;4-002;3-003;2-003;1-004
I am only after the middle words. For example in row 1 Cholera,
I don't really come from a programming background but could potentially use SAS or Excel.
Any help is really appreciated

If you need just to extract this middle row, then you can do it like this:
awk -F';' '{print $9}' file

Related

extracting specific lines containing text and numbers from log file using awk

I haven't used my linux skills in a while and I'm struggling with extracting certain lines out of a csv log file.
The file is structured as:
code,client_id,local_timestamp,operation_code,error_code,etc
I want to extract only those lines of the file with a specific code and a positive client_id greater than 0.
for example if I have the lines:
message_received,1,134,20,0,xxx<br>
message_ack,0,135,10,1,xxx<br>
message_received,0,140,20,1,xxx<br>
message_sent,1,150,30,0,xxx
I only want to extract those lines having code message_received and positive client_id > 0, resulting in just the first line:
message_received,1,134,20,0,xxx
I want to use awk somewhat like:
awk '/message_received,[[:digit:]]>0'/ my log.csv which I know isn't quite correct.. but how do I achieve this in a one liner?
This is probably what you want:
awk -F, '($1=="message_received") && ($2>0)' mylog.csv
If not, edit your question to clarify.

Filtering CSV File using AWK

I'm working on CSV file
This my csv file
Command used for filtering awk -F"," '{print $14}' out_file.csv > test1.csv
This is an example of my data looks like i have around 43 Row and 12,000 column
i planed to separate the single Row using awk command but i cant able to separate the row 3 alone (disease).
i use the following command to get my output
awk -F"," '{print $3}' out_file.csv > test1.csv
This is my file:
gender|gene_name |disease |1000g_oct2014|Polyphen |SNAP
male |RB1,GTF2A1L|cancer,diabetes |0.1 |0.46 |0.1
male |NONE,LOC441|diabetes |0.003 |0.52 |0.6
male |TBC1D1 |diabetes |0.940 |1 |0.9
male |BCOR |cancer |0 |0.31 |0.2
male |TP53 |diabetes |0 |0.54 |0.4
note "|" i did not use this a delimiter. it for show the row in an order my details looks exactly like this in the spreed sheet:
But i'm getting the output following way
Disease
GTF2A1L
LOC441
TBC1D1
BCOR
TP53
While opening in Spread Sheet i can get the results in the proper manner but when i uses awk the , in-between the row 2 is also been taken. i dont know why
can any one help me with this.
The root of your problem is - you have comma separated values with embedded commas.
That makes life more difficult. I would suggest the approach is to use a csv parser.
I quite like perl and Text::CSV:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;
open ( my $data, '<', 'data_file.csv' ) or die $!;
my $csv = Text::CSV -> new ( { binary => 1, sep_char => ',', eol => "\n" } );
while ( my $row = $csv -> getline ( $data ) ) {
print $row -> [2],"\n";
}
Of course, I can't tell for sure if that actually works, because the data you've linked on your google drive doesn't actually match the question you've asked. (note - perl starts arrays at zero, so [3] is actually the 4th field)
But it should do the trick - Text::CSV handles quoted comma fields nicely.
Unfortunately the link you provided ("This is my file") points to two files, neither of which (at the time of this writing) seems to correspond with the sample you gave. However, if your file really is a CSV file with commas used both for separating fields and embedded within fields, then the advice given elsewhere to use a CSV-aware tool is very sound. (I would recommend considering a command-line program that can convert CSV to TSV so the entire *nix tool chain remains at your disposal.)
Your sample output and attendant comments suggest you may already have a way to convert it to a pipe-delimited or tab-delimited file. If so, then awk can be used quite effectively. (If you have a choice, then I'd suggest tabs, since then programs such as cut are especially easy to use.)
The general idea, then, is to use awk with "|" (or tab) as the primary separator (awk -F"|" or awk -F\\t), and to use awk's split function to parse the contents of each top-level field.
At last this is what i did for getting my answers in a simple way thanks to #peak i found the solution
1st i used the
CSV filter which is an python module used for filtering the csv file.
i changed my delimiters using csvfilter using the following command
csvfilter input_file.csv --out-delimiter="|" > out_file.csv
This command used to change the delimiter ',' into '|'
now i used the awk command to sort and filter
awk -F"|" 'FNR == 1 {print} {if ($14 < 0.01) print }' out_file.csv > filtered_file.csv
Thanks for your help.

Take the last word from a line and add it at the beginning using BASH

we have a requirement where contents of our text files are like this:
[some-section-1]
big_msg_line1 var=random_value1
big_msg_line2 var=random_value2
big_msg_line3 var=random_value3
[some-section-2]
"lots of irrelevant data"
[some-section-3]
"lots of irrelevant data"
[some-section-4]
big_msg_line4 var=random_value4
big_msg_line5 var=random_value5
big_msg_line6 var=random_value6
big_msg_line7 var=random_value7
big_msg_line8 var=random_value8
[some-section-5]
"lots of irrelevant data"
All the lines that we want to modify starts with common charaters, like in this example all lines which we would like to modify starts with the word "big". We would like to change it to something like this:
[some-section-1]
random_value1 msg=big_msg_line1
random_value2 msg=big_msg_line2
random_value3 msg=big_msg_line3
[some-section-2]
"lots of irrelevant data"
[some-section-3]
"lots of irrelevant data"
[some-section-4]
random_value4 msg=big_msg_line4
random_value5 msg=big_msg_line5
random_value6 msg=big_msg_line6
random_value7 msg=big_msg_line7
random_value8 msg=big_msg_line8
[some-section-5]
"lots of irrelevant data"
These were for examples only. The original file contains way lot more data than these. In hundreds if not in thousands lines.
I am currently doing this using for a loop, reading each line, cutting the values, formatting them like the way I want, putting then in separate file and then replace the original file with the new file. Is there a way to achieve this using some one liners? That would really be of great help. Hope I am clear with my question.
Thanks in advance.
From what I understood, this awk one-liner would do the job :
cat a
[some-section-1]
big_msg_line1 var=random_value1
big_msg_line2 var=random_value2
big_msg_line3 var=random_value3
[some-section-2]
lots of irrelevant data
[some-section-3]
lots of irrelevant data
[some-section-4]
big_msg_line4 var=random_value4
big_msg_line5 var=random_value5
big_msg_line6 var=random_value6
big_msg_line7 var=random_value7
big_msg_line8 var=random_value8
[some-section-5]
lots of irrelevant data
This :
awk '{FS="var="; if ($1~/big/) { print $2"\tmsg="$1} else {print }}' a
Gives
[some-section-1]
random_value1 msg=big_msg_line1
random_value2 msg=big_msg_line2
random_value3 msg=big_msg_line3
[some-section-2]
lots of irrelevant data
[some-section-3]
lots of irrelevant data
[some-section-4]
random_value4 msg=big_msg_line4
random_value5 msg=big_msg_line5
random_value6 msg=big_msg_line6
random_value7 msg=big_msg_line7
random_value8 msg=big_msg_line8
[some-section-5]
lots of irrelevant data
this command should do the job
sed -e 's/\(big[^ ]*\)\([ ]*\)var=\([^ ]*\)/\3\2msg=\1/' [your file] > [output file]
EDIT: You might need to change the slahes (/) to a letter which is not used in your file

How do we build Normalized table from DeNormalized text file one?

How do we build Normalized table from DeNormalized text file one?
Thanks for your replies/time.
We need to build a Normalized DB Table from DeNormalized text file. We explored couple of options such as unix shell , and PostgreSQL etc. I am looking learn better ideas for resolutions from this community.
The input text file is various length with comma delimited records. The content may look like this:
XXXXXXXXXX , YYYYYYYYYY, TTTTTTTTTTT, UUUUUUUUUU, RRRRRRRRR,JJJJJJJJJ
111111111111, 22222222222, 333333333333, 44444444, 5555555, 666666
EEEEEEEE,WWWWWW,QQQQQQQ,PPPPPPPP
We like to normalize as follows (Split & Pair):
XXXXXXXXXX , YYYYYYYYYY
TTTTTTTTTTT, UUUUUUUUUU
RRRRRRRRR,JJJJJJJJJ
111111111111, 22222222222
333333333333, 44444444
5555555, 666666
EEEEEEEE,WWWWWW
QQQQQQQ,PPPPPPPP
Do we need to go with text pre-process and Load approach?
If yes, what is the best way to pre-process?
Are there any single SQL/Function approach to get the above?
Thanks in helping.
Using gnu awk (due to the RS)
awk '{$1=$1} NR%2==1 {printf "%s,",$0} NR%2==0' RS="[,\n]" file
XXXXXXXXXX,YYYYYYYYYY
TTTTTTTTTTT,UUUUUUUUUU
RRRRRRRRR,JJJJJJJJJ
111111111111,22222222222
333333333333,44444444
5555555,666666
EEEEEEEE,WWWWWW
QQQQQQQ,PPPPPPPP
{$1=$1} Cleans up and remove extra spaces
NR%2==1 {printf "%s,",$0} prints odd parts
NR%2==0 prints even part and new line
RS="[,\n]" sets the record to , or newline
Here is an update. Here is what I did in Linux server.
sed -i 's/\,,//g' inputfile <------ Clean up lot of trailing commas
awk '{$1=$1} NR%2==1 {printf "%s,",$0} NR%2==0' RS="[,\n]" inputfile <----Jotne's idea
dos2unix -q -n inputfile outputfle <------ to remove ^M in some records
outputfile is ready to process as comma delimited format
Any thoughts to improve above steps further?
Thanks in helping.

Sort Tab-Deliminated File with Variable Words and Spaces

I have a file with lines that all begin with a date, followed by a tab, followed by a random number of words and spaces—some of which include numbers. For example:
20140217 iPhone Upgrade Available
20131101 Job Application Due
20131219 Renew or return all library books
20131114 Pay cell phone bill
I'm trying to sort this file by the date string and only the date string.
As per this thread, I've tried all kinds of combinations of sort -t$'\t' and -k1, but I keep getting garbled results.
Any help would be much appreciated. Also, it IS possible for me to replace that tab with a space or another character, if that would help for any reason.
you may want to try
sort -n -k1,1 file
output is
20131101 Job Application Due
20131114 Pay cell phone bill
20131219 Renew or return all library books
20140217 iPhone Upgrade Available
You can use it like this:
sort -k1,1 file

Resources