How do we build Normalized table from DeNormalized text file one? - linux

How do we build Normalized table from DeNormalized text file one?
Thanks for your replies/time.
We need to build a Normalized DB Table from DeNormalized text file. We explored couple of options such as unix shell , and PostgreSQL etc. I am looking learn better ideas for resolutions from this community.
The input text file is various length with comma delimited records. The content may look like this:
XXXXXXXXXX , YYYYYYYYYY, TTTTTTTTTTT, UUUUUUUUUU, RRRRRRRRR,JJJJJJJJJ
111111111111, 22222222222, 333333333333, 44444444, 5555555, 666666
EEEEEEEE,WWWWWW,QQQQQQQ,PPPPPPPP
We like to normalize as follows (Split & Pair):
XXXXXXXXXX , YYYYYYYYYY
TTTTTTTTTTT, UUUUUUUUUU
RRRRRRRRR,JJJJJJJJJ
111111111111, 22222222222
333333333333, 44444444
5555555, 666666
EEEEEEEE,WWWWWW
QQQQQQQ,PPPPPPPP
Do we need to go with text pre-process and Load approach?
If yes, what is the best way to pre-process?
Are there any single SQL/Function approach to get the above?
Thanks in helping.

Using gnu awk (due to the RS)
awk '{$1=$1} NR%2==1 {printf "%s,",$0} NR%2==0' RS="[,\n]" file
XXXXXXXXXX,YYYYYYYYYY
TTTTTTTTTTT,UUUUUUUUUU
RRRRRRRRR,JJJJJJJJJ
111111111111,22222222222
333333333333,44444444
5555555,666666
EEEEEEEE,WWWWWW
QQQQQQQ,PPPPPPPP
{$1=$1} Cleans up and remove extra spaces
NR%2==1 {printf "%s,",$0} prints odd parts
NR%2==0 prints even part and new line
RS="[,\n]" sets the record to , or newline

Here is an update. Here is what I did in Linux server.
sed -i 's/\,,//g' inputfile <------ Clean up lot of trailing commas
awk '{$1=$1} NR%2==1 {printf "%s,",$0} NR%2==0' RS="[,\n]" inputfile <----Jotne's idea
dos2unix -q -n inputfile outputfle <------ to remove ^M in some records
outputfile is ready to process as comma delimited format
Any thoughts to improve above steps further?
Thanks in helping.

Related

Split single record into Multiple records in Unix shell Script

I have record
Example:
EMP_ID|EMP_NAME|AGE|SALARAy
123456|XXXXXXXXX|30|10000000
Is there a way i can split the record into multiple records. Example output should be like
EMP_ID|Attributes
123456|XXXXXXX
123456|30
123456|10000000
I want to split the same record into multiple records. Here Employee id is my unique column and remaining 3 columns i want to run in a loop and create 3 records. Like EMP_ID|EMP_NAME , EMP_ID|AGE , EMP_ID|SALARY. I may have some more columns as well but for sample i have provided 3 columns along with Employee id.
Please help me with any suggestion.
With bash:
record='123456|XXXXXXXXX|30|10000000'
IFS='|' read -ra fields <<<"$record"
for ((i=1; i < "${#fields[#]}"; i++)); do
printf "%s|%s\n" "${fields[0]}" "${fields[i]}"
done
123456|XXXXXXXXX
123456|30
123456|10000000
For the whole file:
{
IFS= read -r header
while IFS='|' read -ra fields; do
for ((i=1; i < "${#fields[#]}"; i++)); do
printf "%s|%s\n" "${fields[0]}" "${fields[i]}"
done
done
} < filename
Record of lines with fields separated by a special delimiter character such as | can be manipulated by basic Unix command line tools such as awk. For example with your input records in file records.txt:
awk -F\| 'NR>1{for(i=2;i<=NF;i++){print $1"|"$(i)}}' records.txt
I recommend to read a awk tutorial and play around with it. Related command line tools worth to learn include grep, sort, wc, uniq, head, tail, and cut. If you regularly do data processing of delimiter-separated files, you will likely need them on a daily basis. As soon as your data structuring format gets more complex (e.g. CSV format with possibility to also use the delimiter character in field values) you need more specific tools, for instance see this question on CSV tools or jq for processing JSON. Still knowledge of basic Unix command line tools will save you a lot of time.

extracting specific lines containing text and numbers from log file using awk

I haven't used my linux skills in a while and I'm struggling with extracting certain lines out of a csv log file.
The file is structured as:
code,client_id,local_timestamp,operation_code,error_code,etc
I want to extract only those lines of the file with a specific code and a positive client_id greater than 0.
for example if I have the lines:
message_received,1,134,20,0,xxx<br>
message_ack,0,135,10,1,xxx<br>
message_received,0,140,20,1,xxx<br>
message_sent,1,150,30,0,xxx
I only want to extract those lines having code message_received and positive client_id > 0, resulting in just the first line:
message_received,1,134,20,0,xxx
I want to use awk somewhat like:
awk '/message_received,[[:digit:]]>0'/ my log.csv which I know isn't quite correct.. but how do I achieve this in a one liner?
This is probably what you want:
awk -F, '($1=="message_received") && ($2>0)' mylog.csv
If not, edit your question to clarify.

Filtering CSV File using AWK

I'm working on CSV file
This my csv file
Command used for filtering awk -F"," '{print $14}' out_file.csv > test1.csv
This is an example of my data looks like i have around 43 Row and 12,000 column
i planed to separate the single Row using awk command but i cant able to separate the row 3 alone (disease).
i use the following command to get my output
awk -F"," '{print $3}' out_file.csv > test1.csv
This is my file:
gender|gene_name |disease |1000g_oct2014|Polyphen |SNAP
male |RB1,GTF2A1L|cancer,diabetes |0.1 |0.46 |0.1
male |NONE,LOC441|diabetes |0.003 |0.52 |0.6
male |TBC1D1 |diabetes |0.940 |1 |0.9
male |BCOR |cancer |0 |0.31 |0.2
male |TP53 |diabetes |0 |0.54 |0.4
note "|" i did not use this a delimiter. it for show the row in an order my details looks exactly like this in the spreed sheet:
But i'm getting the output following way
Disease
GTF2A1L
LOC441
TBC1D1
BCOR
TP53
While opening in Spread Sheet i can get the results in the proper manner but when i uses awk the , in-between the row 2 is also been taken. i dont know why
can any one help me with this.
The root of your problem is - you have comma separated values with embedded commas.
That makes life more difficult. I would suggest the approach is to use a csv parser.
I quite like perl and Text::CSV:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;
open ( my $data, '<', 'data_file.csv' ) or die $!;
my $csv = Text::CSV -> new ( { binary => 1, sep_char => ',', eol => "\n" } );
while ( my $row = $csv -> getline ( $data ) ) {
print $row -> [2],"\n";
}
Of course, I can't tell for sure if that actually works, because the data you've linked on your google drive doesn't actually match the question you've asked. (note - perl starts arrays at zero, so [3] is actually the 4th field)
But it should do the trick - Text::CSV handles quoted comma fields nicely.
Unfortunately the link you provided ("This is my file") points to two files, neither of which (at the time of this writing) seems to correspond with the sample you gave. However, if your file really is a CSV file with commas used both for separating fields and embedded within fields, then the advice given elsewhere to use a CSV-aware tool is very sound. (I would recommend considering a command-line program that can convert CSV to TSV so the entire *nix tool chain remains at your disposal.)
Your sample output and attendant comments suggest you may already have a way to convert it to a pipe-delimited or tab-delimited file. If so, then awk can be used quite effectively. (If you have a choice, then I'd suggest tabs, since then programs such as cut are especially easy to use.)
The general idea, then, is to use awk with "|" (or tab) as the primary separator (awk -F"|" or awk -F\\t), and to use awk's split function to parse the contents of each top-level field.
At last this is what i did for getting my answers in a simple way thanks to #peak i found the solution
1st i used the
CSV filter which is an python module used for filtering the csv file.
i changed my delimiters using csvfilter using the following command
csvfilter input_file.csv --out-delimiter="|" > out_file.csv
This command used to change the delimiter ',' into '|'
now i used the awk command to sort and filter
awk -F"|" 'FNR == 1 {print} {if ($14 < 0.01) print }' out_file.csv > filtered_file.csv
Thanks for your help.

Split ordered file in Linux

I have a large delimited file (with pipe '|' as the delimiter) which I have managed to sort (using linux sort) according to first (numeric), second (numeric) and fourth column (string ordering since it is a timestamp value). The file is like this:
77|141|243848|2014-01-10 20:06:15.722|2.5|1389391203399
77|141|243849|2014-01-10 20:06:18.222|2.695|1389391203399
77|141|243850|2014-01-10 20:06:20.917|3.083|1389391203399
77|171|28563|2014-01-10 07:08:56|2.941|1389344702735
77|171|28564|2014-01-10 07:08:58.941|4.556|1389344702735
77|171|28565|2014-01-10 07:09:03.497|5.671|1389344702735
78|115|28565|2014-01-10 07:09:03.497|5.671|1389344702735
I was wondering if there is an easy way to split this file to multiple text files with an awk, sed, grep or perl one liner whenever the first column or the second column value changes. The final result for the example file should be 3 text files like that:
77|141|243848|2014-01-10 20:06:15.722|2.5|1389391203399
77|141|243849|2014-01-10 20:06:18.222|2.695|1389391203399
77|141|243850|2014-01-10 20:06:20.917|3.083|1389391203399
77|171|28563|2014-01-10 07:08:56|2.941|1389344702735
77|171|28564|2014-01-10 07:08:58.941|4.556|1389344702735
77|171|28565|2014-01-10 07:09:03.497|5.671|1389344702735
78|115|28565|2014-01-10 07:09:03.497|5.671|1389344702735
I could do that in Java of course, but I think it would be kind of overkill, if it can be done with a script. Also, is this possible that the filenames created use those two columns values, something like 77_141.txt for the first file, 77_171.txt for the second file and 78_115.txt for the third one?
awk is very handy for this kind of problems. This can be an approach:
awk -F"|" '{print >> $1"_"$2".txt"}' file
Explanation
-F"|" sets field separator as |.
{print > something} prints the lines into the file something.
$1"_"$2".txt" instead of something, set the output file as $1"_"$2, being $1 the first field based on the | separator. That is, 77, 78... And same for $2, being 141, 171...

Splitting A File On Delimiter

I have a file on a Linux system that is roughly 10GB. It contains 20,000,000 binary records, but each record is separated by an ASCII delimiter "$". I would like to use the split command or some combination thereof to chunk the file into smaller parts. Ideally I would be able to specify that the command should split every 1,000 records (therefore every 1,000 delimiters) into separate files. Can anyone help with this?
The only unorthodox part of the problem seems to be the record separator. I'm sure this is fixable in awk pretty simply - but I happen to hate awk.
I would transfer it in the realm of 'normal' problems first:
tr '$' '\n' < large_records.txt | split -l 1000
This will by default create xaa, xab, xac... files; look at man split for more options
I love awk :)
BEGIN { RS="$"; chunk=1; count=0; size=1000 }
{
print $0 > "/tmp/chunk" chunk;
if (++count>=size) {
chunk++;
count=0;
}
}
(note that the redirection operator in awk only truncates/creates the file on its first invocation - subsequent references are treated as append operations - unlike shell redirection)
Make sure by default the unix split will exhaust with suffixes once it reaches max threshold of default suffix limit of 2. More info on : https://www.gnu.org/software/coreutils/manual/html_node/split-invocation.html

Resources