How to pad CSV file missing columns - linux

I have a problem with some CSV files comming from a soft and that I want to use to make PostgreSQL import (function COPY FROM CSV). The problem is that some last columns are missing like this (letter for headers, number for values, _ for the TAB delimiter):
a_b_c_d
1_2_3_4
5_6_7 <- last column missing
8_9_0_1
2_6_7 <- last column missing
COPY in_my_table FROM file.csv result is :
ERROR: missing data for column "d"
Sample of a correct file for import :
a_b_c_d
1_2_3_4
5_6_7_ <- null column but not missing
8_9_0_1
2_6_7_ <- null column but not missing
My question : is there some commands in bash / linux shell to add the TAB delimiter to make a correct / comlete / padded csv file with all columns.
Thanks for help.

Ok, so in fact I found this:
awk -F'\t' -v OFS='\t' 'NF=50' input.csv > output.csv
where 50 is the number of TAB + 1.

Don't knew much about linux but this could be easily done in postgresql via simple command like
copy tableName from '/filepath/name.csv' delimiter '_' csv WITH NULL AS 'null';

You can use a combination of sed and regular expressions:
sed -r 's/^[0-9](_[0-9]){2}$/\0_/g' file.csv
You only need to replace _ by your delimiter (\t).

Awk is good for this.
awk -F"\t" '{ # Tell awk we are working with tabs
if ($4 =="") # If the last field is empty
print $0"\t" # print the whole line with a tab
else
print $0 # Otherwise just print the line
}' your.csv > your.fixed.csv

Perl has a CSV module, which might be handy to fix even more complicated CSV errors. On my Ubuntu test system it is part of the package libtext-csv-perl.
This fixes your problem:
#! /usr/bin/perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new ({ binary => 1, eol => $/, sep_char => '_' });
open my $broken, '<', 'broken.csv';
open my $fixed, '>', 'fixed.csv';
while (my $row = $csv->getline ($broken)) {
$#{$row} = 3;
$csv->print ($fixed, $row);
}
Change sep_char to "\t", if you have a tabulator delimited file and keep in mind that Perl treats "\t" and '\t' differently.

Related

Reformat data using awk

I have a dataset that contains rows of UUIDs followed by locations and transaction IDs. The UUIDs are separated by a semi-colon (';') and the transactions are separated by tabs, like the following:
01234;LOC_1=ABC LOC_1=BCD LOC_2=CDE
56789;LOC_2=DEF LOC_3=EFG
I know all of the location codes in advance. What I want to do is transform this data into a format I can load into SQL/Postgres for analysis, like this:
01234;LOC_1=ABC
01234;LOC_1=BCD
01234;LOC_2=CDE
56789;LOC_2=DEF
56789;LOC_3=EFG
I'm pretty sure I can do this easily using awk (or similar) by looking up location IDs from a file (ex. LOC_1) and matching any instance of the location ID and printing that out next to the UUID. I haven't been able to get it right yet, and any help is much appreciated!
My locations file is named location and my dataset is data. Note that I can edit the original file or write the results to a new file, either is fine.
awk without using split: use semicolon or tab as the field separator
awk -F'[;\t]' -v OFS=';' '{for (i=2; i<=NF; i++) print $1,$i}' file
I don't think you need to match against a known list of locations; you should be able to just print each line as you go:
$ awk '{print $1; split($1,a,";"); for (i=2; i<=NF; ++i) print a[1] ";" $i}' file
01234;LOC_1=ABC
01234;LOC_1=BCD
01234;LOC_2=CDE
56789;LOC_2=DEF
56789;LOC_3=EFG
You comment on knowing the locations and the mapping file makes me suspicious what your example seems to have done isn't exactly what is being asked - but it seems like you're wanting to reformat each set of tab delimited LOC= values into a row with their UUID in front.
If so, this will do the trick:
awk ' BEGIN {OFS=FS=";"} {split($2,locs,"\t"); for (n in locs) { print $1,locs[n]}}'
Given:
$ cat -A data.txt
01234;LOC_1=ABC^ILOC_1=BCD^ILOC_2=CDE$
56789;LOC_2=DEF^ILOC_3=EFG$
Then:
$ awk ' BEGIN {OFS=FS=";"} {split($2,locs,"\t"); for (n in locs) { print $1,locs[n]}}' data.txt
01234;LOC_1=ABC
01234;LOC_1=BCD
01234;LOC_2=CDE
56789;LOC_2=DEF
56789;LOC_3=EFG
The BEGIN {OFS=FS=";"} block sets the input and output delimiter to ;.
For each row, we then split the second field into an array named locs, splitting on tab, via - split($2,locs,"\t")
And then loop through locs printing the UUID and each loc value - for (n in locs) { print $1,locs[n]}
How about without loop or without split one as follows.(considering that Input_file is same as shown samples only)
awk 'BEGIN{FS=OFS=";"}{gsub(/[[:space:]]+/,"\n"$1 OFS)} 1' Input_file
This might work for you (GNU sed):
sed -r 's/((.*;)\S+)\s+(\S+)/\1\n\2\3/;P;D' file
Repeatedly replace the white space between locations with a newline, followed by the UUID and a ;, printing/deleting each line as it appears.

Remove CRLF character from a single data element in CSV file

Hi im editing my question here, the requirement has slightly changed wherein the CSV file has only LF to begin with . However the CSV file could also have LF between the element within double quotes. We want to retain the LF's within double quotes and replace the LF at the end of the line with CRLF.
so if my source file looks like this :
enter code here
Date,Open,High,Low,Close,comments,Remark
5-Dec-16,8088.75,8141.9,8056.85,8128.75,"TEST1 <LF>
TEST2 <LF>
with NL",remark<LF>
6-Dec-16,8153.15,8178.7,8130.85,8143.15,AAAA,remark<LF>
7-Dec-16,8168.4,8190.45,8077.5,8102.05,BBBB,remark<LF>
8-Dec-16,8152.1,8256.25,8151.75,8246.85,"TEST1<LF>
TEST with NL",remark<LF>
9-Dec-16,8271.7,8274.95,8241.95,8261.75,CCCC,remark<LF>
Date,Open,High,Low,Close,comments,Remark
5-Dec-16,8088.75,8141.9,8056.85,8128.75,"TEST1 <LF>
TEST2 <LF>
with NL",remark<CRLF>
6-Dec-16,8153.15,8178.7,8130.85,8143.15,AAAA,remark<CRLF>
7-Dec-16,8168.4,8190.45,8077.5,8102.05,BBBB,remark<CRLF>
8-Dec-16,8152.1,8256.25,8151.75,8246.85,"TEST1<LF>
TEST2 with NL",remark<CRLF>
9-Dec-16,8271.7,8274.95,8241.95,8261.75,CCCC,remark<CRLF>
enter code here
Appreciate your help.
Thanks,
Chandan
Best to use a proper CSV parser that can handle newlines in quoted fields. Perl has one:
perl -MText::CSV -e '
$csv = Text::CSV->new({ binary => 1 });
while ($row = $csv->getline(STDIN)) {
$row = [map {s/\n+/ /g; $_} #$row];
$csv->say(STDOUT, $row)
}
' < file.csv
or ruby
ruby -rcsv -e '
CSV.parse( readlines.join "" ).each {|row|
puts CSV.generate_line( row.collect {|elem| elem.gsub /\n+/, " "} )
}
' file
Chances are you're looking for:
awk -v RS='\r\n' '{gsub(/[\r\n]+/," ")}1' file
but without details on where the \rs and \ns appear in your input that's just a guess. The above uses GNU awk for multi-char RS and in addition to replacing chains of carriage returns and/or linefeeds from inside every field with blanks will convert your newlines from \r\n (Windows style) to just \n (UNIX style) to make it easier to do anything else with them from that point onwards.
See also What's the most robust way to efficiently parse CSV using awk? for how to handle CSVs in general using awk.
A little state machine in awk: uses a double quote as the field separator, and acts upon the number of fields:
awk -F '"' '
partial {$0 = partial OFS $0; partial = ""}
NF % 2 == 0 {partial = $0; next}
{print}
' file

Remove carriage returns from CSV data value

I am importing data from a pipe-delimited CSV to MySQL using a LOAD DATA INFILE statement. I am terminating lines by using '\r\n'. My problem is that some of the data within each row has '\r\n' in it, causing the load to error. I have similar files that just use '\n' within data to indicate linebreaks, and that causes no issues.
Example GOOD CSV
School|City|State|Country\r
Harvard University|Cambridge|MA|USA\r
Princeton University|Princeton|New
Jersey
|USA\r
Example BAD CSV
School|City|State|Country\r
Harvard University|Cambridge|MA|USA\r
Princeton University|Princeton|New\r
Jersey\r
|USA\r
Is there a way to pre-process the CSV, using sed, awk, or perl, to clean up the extra carriage return in the column values?
This is one possible solution in perl. It reads in a line and if there are less than 4 fields, it keeps reading in the next line and merging it until it does have 4 fields. Just change the value of $number_of_fields to the right number.
#!/usr/bin/perl
use strict;
use warnings;
my $number_of_fields=4;
while(<STDIN>)
{
s/[\r\n]//g;
my #fields=split(/\|/);
next if($#fields==-1);
while($#fields<$number_of_fields-1)
{
my $nextline=<STDIN> || last;
$nextline =~ s/[\r\n]//g;
my #tmpfields=split(/\|/,$nextline);
next if($#tmpfields==-1);
$fields[$#fields] .= "\n".$tmpfields[0];
shift #tmpfields;
push #fields,#tmpfields;
}
print join("|",#fields),"\r\n";
}
With GNU awk for multi-char RS and RT:
$ awk -v RS='([^|]+[|]){3}[^|]+\r\n' -v ORS= '{$0=RT; gsub(/\r/,""); sub(/\n$/,"\r\n")} 1' file | cat -v
School|City|State|Country^M
Harvard University|Cambridge|MA|USA^M
Princeton University|Princeton|New
Jersey
|USA^M
Note that it assumes the number of fields is 4 so if you have some other number of fields then change 3 to that number minus 1. The script COULD instead calculate the number of fields by reading the first line of your input if that first line cannot have your problem:
$ awk '
BEGIN { RS="\r\n"; ORS=""; FS="|" }
FNR==1 { RS="([^|]+[|]){"NF-1"}[^|]+\r\n"; RT=$0 RT }
{ $0=RT; gsub(/\r/,""); sub(/\n$/,"\r\n"); print }
' file | cat -v
School|City|State|Country^M
Harvard University|Cambridge|MA|USA^M
Princeton University|Princeton|New
Jersey
|USA^M

remove lines from text file that contain specific text

I'm trying to remove lines that contain 0/0 or ./. in column 71 "FORMAT.1.GT" from a tab delimited text file.
I've tried the following code but it doesn't work. What is the correct way of accomplishing this? Thank you
my $cmd6 = `fgrep -v "0/0" | fgrep -v "./." $Variantlinestsvfile > $MDLtsvfile`; print "$cmd6";
You can either call a one-liner as borodin and zdim said. Which one is right for you is still not clear because you don't tell whether 71st column means the 71st tab-separated field of a line or the 71st character of that line. Consider
12345\t6789
Now what is the 2nd column? Is it the character 2 or the field 6789? Borodin's answer assumes it's 6789 while zdim assumes it's 2. Both showed a solution for either case but these solutions are stand-alone solutions. Programs of its own to be run from the commandline.
If you want to integrate that into your Perl script you could do it like this:
Replace this line:
my $cmd6 = `fgrep -v "0/0" | fgrep -v "./." $Variantlinestsvfile > $MDLtsvfile`; print "$cmd6";
with this snippet:
open( my $fh_in, '<', $Variantlinestsvfile ) or die "cannot open $Variantlinestsvfile: $!\n";
open( my $fh_out, '>', $MDLtsvfile ) or die "cannot open $MDLtsvfile: $!\n";
while( my $line = <$fh_in> ) {
# character-based:
print $fh_out $line unless (substr($line, 70, 3) =~ m{(?:0/0|\./\.)});
# tab/field-based:
my #fields = split(/\s+/, $line);
print $fh_out $line unless ($fields[70] =~ m|([0.])/\1|);
}
close($fh_in);
close($fh_out);
Use either the character-based line or the tab/field-based lines. Not both!
Borodin and zdim condensed this snippet to a one-liner, but you must not call that from a Perl script.
Since you need the exact position and know string lenghts substr can find it
perl -ne 'print if not substr($_, 70, 3) =~ m{(?:0/0|\./\.)}' filename
This prints lines only when a three-character long string starting at 71st column does not match either of 0/0 and ./.
The {} delimiters around the regex allow us to use / and | inside without escaping. The ?: is there so that the () are used only for grouping, and not capturing. It will work fine also without ?: which is there only for efficiency's sake.
perl -ane 'print unless $F[70] =~ m|([0.])/\1|' myfile > newfile
The problem with your command is that you are attempting to capture the output of a command which produces no output - all the matches are redirected to a file, so that's where all the output is going.
Anyway, calling grep from Perl is just wacky. Reading the file in Perl itself is the way to go.
If you do want a single shell command,
grep -Ev $'^([^\t]*\t){70}(\./\.|0/0)\t' file
would do what you are asking more precisely and elegantly. But you can use that regex straight off in your Perl program just as well.
Try it!
awk '{ if ($71 != "./." && $71 != ".0.") print ; }' old_file.txt > new_file.txt

Replace characters in specific columns only (CSV)

I have data like this:
1;2015-04-10;23:10:00;10.4.2015 23:10;8.9;1007.5;0.3;0.0;0;55
2;2015-04-10;23:20:00;10.4.2015 23:20;8.6;1007.8;0.4;0.0;0;56
3;2015-04-10;23:30:00;10.4.2015 23:30;8.5;1008.1;0.4;0.0;0;57
It has dot . as decimal separator but I need to use , instead.
Desired data:
1;2015-04-10;23:10:00;10.4.2015 23:10;8,9;1007,5;0,3;0,0;0;55
I tried using Sed. With sed -i 's/\./,/g' myfile.csv I could replace all dots with commas but would destroy dates on the fourth column. How can I change dots to commas in elsewhere but leave the fourth column as it is? If some other Linux tool is better for this task than Sed I could use it as well.
sed is for simple substitutions, for anything else just use awk:
$ awk 'BEGIN{FS=OFS=";"} {for (i=5;i<=NF;i++) sub(/\./,",",$i)} 1' file
1;2015-04-10;23:10:00;10.4.2015 23:10;8,9;1007,5;0,3;0,0;0;55
2;2015-04-10;23:20:00;10.4.2015 23:20;8,6;1007,8;0,4;0,0;0;56
3;2015-04-10;23:30:00;10.4.2015 23:30;8,5;1008,1;0,4;0,0;0;57
Perl and Text::CSV:
#! /usr/bin/perl
use warnings;
use strict;
use Text::CSV;
my $csv = 'Text::CSV'->new({ binary => 1,
sep_char => ';',
quote_space => 0,
}) or die 'Text::CSV'->error_diag;
open my $FH, '<:encoding(utf8)', 'input.csv' or die $!;
$csv->eol("\n");
while (my $row = $csv->getline($FH)) {
s/\./,/g for #$row[ 0 .. 2, 4 .. $#$row ];
$csv->print(*STDOUT, $row);
}
You could go with:
awk 'BEGIN {FS=OFS=";"} {if(NF==5);gsub(/\./,",",$5)} 1 ' filename
Here I have used gsub instead of sub; the difference is that sub will replace only the first occurrence, whereas gsub will replace all occurrences.
changes dot to comma in second column
awk '{gsub(/\./,",",$2)}1' file
1;2015-04-10;23:10:00;10.4.2015 23:10;8,9;1007,5;0,3;0,0;0;55
2;2015-04-10;23:20:00;10.4.2015 23:20;8,6;1007,8;0,4;0,0;0;56
3;2015-04-10;23:30:00;10.4.2015 23:30;8,5;1008,1;0,4;0,0;0;57

Resources