Remove CRLF character from a single data element in CSV file - string

Hi im editing my question here, the requirement has slightly changed wherein the CSV file has only LF to begin with . However the CSV file could also have LF between the element within double quotes. We want to retain the LF's within double quotes and replace the LF at the end of the line with CRLF.
so if my source file looks like this :
enter code here
Date,Open,High,Low,Close,comments,Remark
5-Dec-16,8088.75,8141.9,8056.85,8128.75,"TEST1 <LF>
TEST2 <LF>
with NL",remark<LF>
6-Dec-16,8153.15,8178.7,8130.85,8143.15,AAAA,remark<LF>
7-Dec-16,8168.4,8190.45,8077.5,8102.05,BBBB,remark<LF>
8-Dec-16,8152.1,8256.25,8151.75,8246.85,"TEST1<LF>
TEST with NL",remark<LF>
9-Dec-16,8271.7,8274.95,8241.95,8261.75,CCCC,remark<LF>
Date,Open,High,Low,Close,comments,Remark
5-Dec-16,8088.75,8141.9,8056.85,8128.75,"TEST1 <LF>
TEST2 <LF>
with NL",remark<CRLF>
6-Dec-16,8153.15,8178.7,8130.85,8143.15,AAAA,remark<CRLF>
7-Dec-16,8168.4,8190.45,8077.5,8102.05,BBBB,remark<CRLF>
8-Dec-16,8152.1,8256.25,8151.75,8246.85,"TEST1<LF>
TEST2 with NL",remark<CRLF>
9-Dec-16,8271.7,8274.95,8241.95,8261.75,CCCC,remark<CRLF>
enter code here
Appreciate your help.
Thanks,
Chandan

Best to use a proper CSV parser that can handle newlines in quoted fields. Perl has one:
perl -MText::CSV -e '
$csv = Text::CSV->new({ binary => 1 });
while ($row = $csv->getline(STDIN)) {
$row = [map {s/\n+/ /g; $_} #$row];
$csv->say(STDOUT, $row)
}
' < file.csv
or ruby
ruby -rcsv -e '
CSV.parse( readlines.join "" ).each {|row|
puts CSV.generate_line( row.collect {|elem| elem.gsub /\n+/, " "} )
}
' file

Chances are you're looking for:
awk -v RS='\r\n' '{gsub(/[\r\n]+/," ")}1' file
but without details on where the \rs and \ns appear in your input that's just a guess. The above uses GNU awk for multi-char RS and in addition to replacing chains of carriage returns and/or linefeeds from inside every field with blanks will convert your newlines from \r\n (Windows style) to just \n (UNIX style) to make it easier to do anything else with them from that point onwards.
See also What's the most robust way to efficiently parse CSV using awk? for how to handle CSVs in general using awk.

A little state machine in awk: uses a double quote as the field separator, and acts upon the number of fields:
awk -F '"' '
partial {$0 = partial OFS $0; partial = ""}
NF % 2 == 0 {partial = $0; next}
{print}
' file

Related

Extract a string between double quotes from the 6th line of a file in Unix and assign it to variable

Newbie to unix/shell/bash. I have a file name CellSite whose 6th line is as below:
btsName = "RV74XC038",
I want to extract the string from 6th line that is between double quotes (i.e.RV74XC038) and save it to a variable. Please note that the 6th line starts with 4 blank spaces. And this string would vary from file. So I am looking for a solution that would extract a string from 6th line between the double quotes.
I tried below. But does not work.
str2 = sed '6{ s/^btsName = \([^ ]*\) *$/\1/;q } ;d' CellSite;
Any help is much appreciated. TIA.
sed is a stream editor.
For just parsing files, you want to look into awk. Something like this:
awk -F \" '/btsName/ { print $2 }' CellSite
Where:
-F defines a "field separator", in your case the quotation marks "
the entire script consists of:
/btsName/ act only on lines that contain the regex "btsName"
from that line print out the second field; the first field will be everything before the first quotation marks, second field will be everything from the first quotes to the second quotes, third field will be everything after the second quotes
parse through the file named "CellSite"
There are possibly better alternatives, but you would have to show the rest of your file.
Using sed
$ str2=$(sed '6s/[^"]*"\([^"]*\).*/\1/' CellSite)
$ echo "$str2"
RV74XC038
You can use the following awk solution:
btsName=$(awk -F\" 'NR==6{print $2; exit}' CellSite)
Basically, get to the sixth line (NR==6), print the second field value (" is used to split records (lines) into fields) and then exit.
See the online demo:
#!/bin/bash
CellSite='Line 1
Line 2
Line 3
btsName = "NO74NO038",
Line 5
btsName = "RV74XC038","
Line 7
btsName = "no11no000",
'
btsName=$(awk -F\" 'NR==6{print $2; exit}' <<< "$CellSite")
echo "$btsName" # => RV74XC038
This might work for you (GNU sed):
var=$(sed -En '6s/.*"(.*)".*/\1/p;6q' file)
Simplify regexs and turn off implicit printing.
Focus on the 6th line only and print the value between double quotes, then quit.
Bash interpolates the sed invocation by means of the $(...) and the value extracted defines the variable var.

Remove carriage returns from CSV data value

I am importing data from a pipe-delimited CSV to MySQL using a LOAD DATA INFILE statement. I am terminating lines by using '\r\n'. My problem is that some of the data within each row has '\r\n' in it, causing the load to error. I have similar files that just use '\n' within data to indicate linebreaks, and that causes no issues.
Example GOOD CSV
School|City|State|Country\r
Harvard University|Cambridge|MA|USA\r
Princeton University|Princeton|New
Jersey
|USA\r
Example BAD CSV
School|City|State|Country\r
Harvard University|Cambridge|MA|USA\r
Princeton University|Princeton|New\r
Jersey\r
|USA\r
Is there a way to pre-process the CSV, using sed, awk, or perl, to clean up the extra carriage return in the column values?
This is one possible solution in perl. It reads in a line and if there are less than 4 fields, it keeps reading in the next line and merging it until it does have 4 fields. Just change the value of $number_of_fields to the right number.
#!/usr/bin/perl
use strict;
use warnings;
my $number_of_fields=4;
while(<STDIN>)
{
s/[\r\n]//g;
my #fields=split(/\|/);
next if($#fields==-1);
while($#fields<$number_of_fields-1)
{
my $nextline=<STDIN> || last;
$nextline =~ s/[\r\n]//g;
my #tmpfields=split(/\|/,$nextline);
next if($#tmpfields==-1);
$fields[$#fields] .= "\n".$tmpfields[0];
shift #tmpfields;
push #fields,#tmpfields;
}
print join("|",#fields),"\r\n";
}
With GNU awk for multi-char RS and RT:
$ awk -v RS='([^|]+[|]){3}[^|]+\r\n' -v ORS= '{$0=RT; gsub(/\r/,""); sub(/\n$/,"\r\n")} 1' file | cat -v
School|City|State|Country^M
Harvard University|Cambridge|MA|USA^M
Princeton University|Princeton|New
Jersey
|USA^M
Note that it assumes the number of fields is 4 so if you have some other number of fields then change 3 to that number minus 1. The script COULD instead calculate the number of fields by reading the first line of your input if that first line cannot have your problem:
$ awk '
BEGIN { RS="\r\n"; ORS=""; FS="|" }
FNR==1 { RS="([^|]+[|]){"NF-1"}[^|]+\r\n"; RT=$0 RT }
{ $0=RT; gsub(/\r/,""); sub(/\n$/,"\r\n"); print }
' file | cat -v
School|City|State|Country^M
Harvard University|Cambridge|MA|USA^M
Princeton University|Princeton|New
Jersey
|USA^M

How to pad CSV file missing columns

I have a problem with some CSV files comming from a soft and that I want to use to make PostgreSQL import (function COPY FROM CSV). The problem is that some last columns are missing like this (letter for headers, number for values, _ for the TAB delimiter):
a_b_c_d
1_2_3_4
5_6_7 <- last column missing
8_9_0_1
2_6_7 <- last column missing
COPY in_my_table FROM file.csv result is :
ERROR: missing data for column "d"
Sample of a correct file for import :
a_b_c_d
1_2_3_4
5_6_7_ <- null column but not missing
8_9_0_1
2_6_7_ <- null column but not missing
My question : is there some commands in bash / linux shell to add the TAB delimiter to make a correct / comlete / padded csv file with all columns.
Thanks for help.
Ok, so in fact I found this:
awk -F'\t' -v OFS='\t' 'NF=50' input.csv > output.csv
where 50 is the number of TAB + 1.
Don't knew much about linux but this could be easily done in postgresql via simple command like
copy tableName from '/filepath/name.csv' delimiter '_' csv WITH NULL AS 'null';
You can use a combination of sed and regular expressions:
sed -r 's/^[0-9](_[0-9]){2}$/\0_/g' file.csv
You only need to replace _ by your delimiter (\t).
Awk is good for this.
awk -F"\t" '{ # Tell awk we are working with tabs
if ($4 =="") # If the last field is empty
print $0"\t" # print the whole line with a tab
else
print $0 # Otherwise just print the line
}' your.csv > your.fixed.csv
Perl has a CSV module, which might be handy to fix even more complicated CSV errors. On my Ubuntu test system it is part of the package libtext-csv-perl.
This fixes your problem:
#! /usr/bin/perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new ({ binary => 1, eol => $/, sep_char => '_' });
open my $broken, '<', 'broken.csv';
open my $fixed, '>', 'fixed.csv';
while (my $row = $csv->getline ($broken)) {
$#{$row} = 3;
$csv->print ($fixed, $row);
}
Change sep_char to "\t", if you have a tabulator delimited file and keep in mind that Perl treats "\t" and '\t' differently.

Remove \r\n in awk

I have a simple awk command that converts a date from MM/DD/YYYY to YYYY/MM/DD. However, the file I'm using has \r\n at the end of the lines, and sometimes the date is at the end of the line.
awk '
BEGIN { FS = OFS = "|" }
{
split($27, date, /\//)
$27 = date[3] "/" date[1] "/" date[2]
print $0
}
' file.txt
In this case, if the date is MM/DD/YYYY\r\n then I end up with this in the output:
YYYY
/MM/DD
What is the best way to get around this? Keep in mind, sometimes the input is simply \r\n in which case the output SHOULD be // but instead ends up as
/
/
Given that the \r isn't always at the end of field $27, the simplest approach is to remove the \r from the entire line.
With GNU Awk or Mawk (one of which is typically the default awk on Linux platforms), you can simply define your input record separator, RS, accordingly:
awk -v RS='\r\n' ...
Or, if you want \r\n-terminated output lines too, set the output record separator, ORS, to the same value:
awk 'BEGIN { RS=ORS="\r\n"; ...
Optional reading: an aside for BSD/macOS Awk users:
BSD/macOS awk doesn't support multi-character RS values (in line with the POSIX Awk spec: "If RS contains more than one character, the results are unspecified").
Therefore, a sub call inside the Awk script is necessary to trim the \r instance from the end of each input line:
awk '{ sub("\r$", ""); ...
To also output \r\n-terminated lines, option -v ORS='\r\n' (or ORS="\r\n" inside the script's BEGIN block) will work fine, as with GNU Awk and Mawk.
If you're on a system where \n by itself is the newline, you should remove the \r from the record. You could do it like:
$ awk '{sub(/\r/,"",$NF); ...}'

bash Changing every other comma to point

I am working with set of data which is written in Swedish format. comma is used instead of point for decimal numbers in Sweden.
My data set is like this:
1,188,1,250,0,757,0,946,8,960
1,257,1,300,0,802,1,002,9,485
1,328,1,350,0,846,1,058,10,021
1,381,1,400,0,880,1,100,10,418
Which I want to change every other comma to point and have output like this:
1.188,1.250,0.757,0.946,8.960
1.257,1.300,0.802,1.002,9.485
1.328,1.350,0.846,1.058,10.021
1.381,1.400,0.880,1.100,10.418
Any idea of how to do that with simple shell scripting. It is fine If I do it in multiple steps. I mean if I change first the first instance of comma and then the third instance and ...
Thank you very much for your help.
Using sed
sed 's/,\([^,]*\(,\|$\)\)/.\1/g' file
1.188,1.250,0.757,0.946,8.960
1.257,1.300,0.802,1.002,9.485
1.328,1.350,0.846,1.058,10.021
1.381,1.400,0.880,1.100,10.418
For reference, here is a possible way to achieve the conversion using awk:
awk -F, '{for(i=1;i<=NF;i=i+2) {printf $i "." $(i+1); if(i<NF-2) printf FS }; printf "\n" }' file
The for loop iterates every 2 fields separated by a comma (set by the option -F,) and prints the current element and the next one separated by a dot.
The comma separator represented by FS is printed except at the end of line.
As a Perl one-liner, using split and array manipulation:
perl -F, -e '#a = #b = (); while (#b = splice #F, 0, 2) {
push #a, join ".", #b} print join ",", #a' file
Output:
1.188,1.250,0.757,0.946,8.960
1.257,1.300,0.802,1.002,9.485
1.328,1.350,0.846,1.058,10.021
1.381,1.400,0.880,1.100,10.418
Many sed dialects allow you to specify which instance of a pattern to replace by specifying a numeric option to s///.
sed -e 's/,/./9' -e 's/,/./7' -e 's/,/./5' -e 's/,/./3' -e 's/,/./'
ISTR some sed dialects would allow you to simplify this to
sed 's/,/./1,2'
but this is not supported on my Debian.
Demo: http://ideone.com/6s2lAl

Resources