Remove carriage returns from CSV data value - linux

I am importing data from a pipe-delimited CSV to MySQL using a LOAD DATA INFILE statement. I am terminating lines by using '\r\n'. My problem is that some of the data within each row has '\r\n' in it, causing the load to error. I have similar files that just use '\n' within data to indicate linebreaks, and that causes no issues.
Example GOOD CSV
School|City|State|Country\r
Harvard University|Cambridge|MA|USA\r
Princeton University|Princeton|New
Jersey
|USA\r
Example BAD CSV
School|City|State|Country\r
Harvard University|Cambridge|MA|USA\r
Princeton University|Princeton|New\r
Jersey\r
|USA\r
Is there a way to pre-process the CSV, using sed, awk, or perl, to clean up the extra carriage return in the column values?

This is one possible solution in perl. It reads in a line and if there are less than 4 fields, it keeps reading in the next line and merging it until it does have 4 fields. Just change the value of $number_of_fields to the right number.
#!/usr/bin/perl
use strict;
use warnings;
my $number_of_fields=4;
while(<STDIN>)
{
s/[\r\n]//g;
my #fields=split(/\|/);
next if($#fields==-1);
while($#fields<$number_of_fields-1)
{
my $nextline=<STDIN> || last;
$nextline =~ s/[\r\n]//g;
my #tmpfields=split(/\|/,$nextline);
next if($#tmpfields==-1);
$fields[$#fields] .= "\n".$tmpfields[0];
shift #tmpfields;
push #fields,#tmpfields;
}
print join("|",#fields),"\r\n";
}

With GNU awk for multi-char RS and RT:
$ awk -v RS='([^|]+[|]){3}[^|]+\r\n' -v ORS= '{$0=RT; gsub(/\r/,""); sub(/\n$/,"\r\n")} 1' file | cat -v
School|City|State|Country^M
Harvard University|Cambridge|MA|USA^M
Princeton University|Princeton|New
Jersey
|USA^M
Note that it assumes the number of fields is 4 so if you have some other number of fields then change 3 to that number minus 1. The script COULD instead calculate the number of fields by reading the first line of your input if that first line cannot have your problem:
$ awk '
BEGIN { RS="\r\n"; ORS=""; FS="|" }
FNR==1 { RS="([^|]+[|]){"NF-1"}[^|]+\r\n"; RT=$0 RT }
{ $0=RT; gsub(/\r/,""); sub(/\n$/,"\r\n"); print }
' file | cat -v
School|City|State|Country^M
Harvard University|Cambridge|MA|USA^M
Princeton University|Princeton|New
Jersey
|USA^M

Related

Remove CRLF character from a single data element in CSV file

Hi im editing my question here, the requirement has slightly changed wherein the CSV file has only LF to begin with . However the CSV file could also have LF between the element within double quotes. We want to retain the LF's within double quotes and replace the LF at the end of the line with CRLF.
so if my source file looks like this :
enter code here
Date,Open,High,Low,Close,comments,Remark
5-Dec-16,8088.75,8141.9,8056.85,8128.75,"TEST1 <LF>
TEST2 <LF>
with NL",remark<LF>
6-Dec-16,8153.15,8178.7,8130.85,8143.15,AAAA,remark<LF>
7-Dec-16,8168.4,8190.45,8077.5,8102.05,BBBB,remark<LF>
8-Dec-16,8152.1,8256.25,8151.75,8246.85,"TEST1<LF>
TEST with NL",remark<LF>
9-Dec-16,8271.7,8274.95,8241.95,8261.75,CCCC,remark<LF>
Date,Open,High,Low,Close,comments,Remark
5-Dec-16,8088.75,8141.9,8056.85,8128.75,"TEST1 <LF>
TEST2 <LF>
with NL",remark<CRLF>
6-Dec-16,8153.15,8178.7,8130.85,8143.15,AAAA,remark<CRLF>
7-Dec-16,8168.4,8190.45,8077.5,8102.05,BBBB,remark<CRLF>
8-Dec-16,8152.1,8256.25,8151.75,8246.85,"TEST1<LF>
TEST2 with NL",remark<CRLF>
9-Dec-16,8271.7,8274.95,8241.95,8261.75,CCCC,remark<CRLF>
enter code here
Appreciate your help.
Thanks,
Chandan
Best to use a proper CSV parser that can handle newlines in quoted fields. Perl has one:
perl -MText::CSV -e '
$csv = Text::CSV->new({ binary => 1 });
while ($row = $csv->getline(STDIN)) {
$row = [map {s/\n+/ /g; $_} #$row];
$csv->say(STDOUT, $row)
}
' < file.csv
or ruby
ruby -rcsv -e '
CSV.parse( readlines.join "" ).each {|row|
puts CSV.generate_line( row.collect {|elem| elem.gsub /\n+/, " "} )
}
' file
Chances are you're looking for:
awk -v RS='\r\n' '{gsub(/[\r\n]+/," ")}1' file
but without details on where the \rs and \ns appear in your input that's just a guess. The above uses GNU awk for multi-char RS and in addition to replacing chains of carriage returns and/or linefeeds from inside every field with blanks will convert your newlines from \r\n (Windows style) to just \n (UNIX style) to make it easier to do anything else with them from that point onwards.
See also What's the most robust way to efficiently parse CSV using awk? for how to handle CSVs in general using awk.
A little state machine in awk: uses a double quote as the field separator, and acts upon the number of fields:
awk -F '"' '
partial {$0 = partial OFS $0; partial = ""}
NF % 2 == 0 {partial = $0; next}
{print}
' file

How to Split a Delimited Text file in Linux, based on no of records, which has end-of-record separator in data fields

Problem Statement:
I have a delimited text file offloaded from Teradata which happens to have "\n" (newline characters or EOL markers) inside data fields.
The same EOL marker is at the end of each new line for one entire line or record.
I need to split this file in two or more files (based on no of records given by me) while retaining the newline chars in data fields but against the line breaks at the end of each lines.
Example:
1|Alan
Wake|15
2|Nathan
Drake|10
3|Gordon
Freeman|11
Expectation :
file1.txt
1|Alan
Wake|15
2|Nathan
Drake|10
file2.txt
3|Gordon
Freeman|11
What i have tried :
awk 'BEGIN{RS="\n"}NR%2==1{x="SplitF"++i;}{print > x}' inputfile.txt
The code can't discern between data field newlines and actual newlines. Is there a way it can be achieved?
EDIT:: i have changed the problem statement with example. Please share your thoughts on the new example.
Use the following awk approach:
awk '{ r=(r!="")?r RS $0 : $0; if(NR%4==0){ print r > "file"++i".txt"; r="" } }
END{ if(r) print r > "file"++i".txt" }' inputfile.txt
NR%4==0 - your logical single line occupies two physical records, so we expect to separate on each 4 records
Results:
> cat file1.txt
1|Alan
Wake
2|Nathan
Drake
> cat file2.txt
3|Gordon
Freeman
If you are using GNU awk you can do this by setting RS appropriately, e.g.:
parse.awk
BEGIN { RS="[0-9]\\|" }
# Skip the empty first record by checking NF (Note: this will also skip
# any empty records later in the input)
NF {
# Send record with the appropriate key to a numbered file
printf("%s", d $0) > "file" i ".txt"
}
# When we found enough records, close current file and
# prepare i for opening the next one
#
# Note: NR-1 because of the empty first record
(NR-1)%n == 0 {
close("file" i ".txt")
i++
}
# Remember the record key in d, again,
# becuase of the empty first record
{ d=RT }
Run it like this:
gawk -f parse.awk n=2 infile
Where n is the number of records to put into each file.
Output:
file1.txt
1|Alan
Wake|15
2|Nathan
Drake|10
file2.txt
3|Gordon
Freeman|11

How to pad CSV file missing columns

I have a problem with some CSV files comming from a soft and that I want to use to make PostgreSQL import (function COPY FROM CSV). The problem is that some last columns are missing like this (letter for headers, number for values, _ for the TAB delimiter):
a_b_c_d
1_2_3_4
5_6_7 <- last column missing
8_9_0_1
2_6_7 <- last column missing
COPY in_my_table FROM file.csv result is :
ERROR: missing data for column "d"
Sample of a correct file for import :
a_b_c_d
1_2_3_4
5_6_7_ <- null column but not missing
8_9_0_1
2_6_7_ <- null column but not missing
My question : is there some commands in bash / linux shell to add the TAB delimiter to make a correct / comlete / padded csv file with all columns.
Thanks for help.
Ok, so in fact I found this:
awk -F'\t' -v OFS='\t' 'NF=50' input.csv > output.csv
where 50 is the number of TAB + 1.
Don't knew much about linux but this could be easily done in postgresql via simple command like
copy tableName from '/filepath/name.csv' delimiter '_' csv WITH NULL AS 'null';
You can use a combination of sed and regular expressions:
sed -r 's/^[0-9](_[0-9]){2}$/\0_/g' file.csv
You only need to replace _ by your delimiter (\t).
Awk is good for this.
awk -F"\t" '{ # Tell awk we are working with tabs
if ($4 =="") # If the last field is empty
print $0"\t" # print the whole line with a tab
else
print $0 # Otherwise just print the line
}' your.csv > your.fixed.csv
Perl has a CSV module, which might be handy to fix even more complicated CSV errors. On my Ubuntu test system it is part of the package libtext-csv-perl.
This fixes your problem:
#! /usr/bin/perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new ({ binary => 1, eol => $/, sep_char => '_' });
open my $broken, '<', 'broken.csv';
open my $fixed, '>', 'fixed.csv';
while (my $row = $csv->getline ($broken)) {
$#{$row} = 3;
$csv->print ($fixed, $row);
}
Change sep_char to "\t", if you have a tabulator delimited file and keep in mind that Perl treats "\t" and '\t' differently.

Get list of all duplicates based on first column within large text/csv file in linux/ubuntu

I am trying to extract all the duplicates based on the first column/index of my very large text/csv file (7+ GB / 100+ Million lines). Format is like so:
foo0:bar0
foo1:bar1
foo2:bar2
first column is any lowercase utf-8 string and the second column is any utf-8 string. I have been able to sort my file based on the first column and only the first column with:
sort -t':' -k1,1 filename.txt > output_sorted.txt
I have also been able to drop all duplicates with:
sort -t':' -u -k1,1 filename.txt > output_uniq_sorted.txt
These operations take 4-8 min.
I am now trying to extract all duplicates based on the first column and only the first column, to make sure all entries in the second columns are matching.
I think I can achieve this with awk with this code:
BEGIN { FS = ":" }
{
count[$1]++;
if (count[$1] == 1){
first[$1] = $0;
}
if (count[$1] == 2){
print first[$1];
}
if (count[$1] > 1){
print $0;
}
}
running it with:
awk -f awk.dups input_sorted.txt > output_dup.txt
Now the problem is this takes way to long 3+hours and not yet done. I know uniq can get all duplicates with something like:
uniq -D sorted_file.txt > output_dup.txt
The problem is specifying the delimiter and only using the first column. I know uniq has a -f N to skip the first N fields. Is there a way to get these results without having to change/process my data? Is there another tool the could accomplish this? I have already used python + pandas with read_csv and getting the duplicates but this leads to errors (segmentation fault) and this is not efficient since I shouldn't have to load all the data in memory since the data is sorted. I have decent hardware
i7-4700HQ
16GB ram
256GB ssd samsung 850 pro
Anything that can help is welcome,
Thanks.
SOLUTION FROM BELOW
Using:
awk -F: '{if(p!=$1){p=$1; c=0; p0=$0} else c++} c==1{print p0} c'
with the command time I get the following performance.
real 0m46.058s
user 0m40.352s
sys 0m2.984s
If your file is already sorted you don't need to store more than one line, try this
$ awk -F: '{if(p!=$1){p=$1; c=0; p0=$0} else c++} c==1{print p0} c' sorted.input
If you try this please post the timings...
I have changed the awk script slightly because I couldn't fully understand what was happening in the above awnser.
awk -F: '{if(p!=$1){p=$1; c=0; p0=$0} else c++} c>=1{if(c==1){print p0;} print $0}' sorted.input > duplicate.entries
I have tested and this produces the same output as the above but might be easier to understand.
{if(p!=$1){p=$1; c=0; p0=$0} else c++}
If the first token in the line is not the same as the previous we will save the first token then set c to 0 and save the whole line into p0. If it is the same we increment c.
c>=1{if(c==1){print p0;} print $0}
In the case of the repeat, we check if its first repeat. If thats the case we print save line and current line, if not just print current line.

Bash: How to keep lines in a file that have fields that match lines in another file?

I have two big files with a lot of text, and what I have to do is keep all lines in file A that have a field that matches a field in file B.
file A is something like:
Name (tab) # (tab) # (tab) KEYFIELD (tab) Other fields
file B I managed to use cut and sed and other things to basically get it down to one field that is a list.
So The goal is to keep all lines in file A in the 4th field (it says KEYFIELD) if the field for that line matches one of the lines in file B. (Does NOT have to be an exact match, so if file B had Blah and file A said Blah_blah, it'd be ok)
I tried to do:
grep -f fileBcutdown fileA > outputfile
EDIT: Ok I give up. I just force killed it.
Is there a better way to do this? File A is 13.7MB and file B after cutting it down is 32.6MB for anyone that cares.
EDIT: This is an example line in file A:
chr21 33025905 33031813 ENST00000449339.1 0 - 33031813 33031813 0 3 1835,294,104, 0,4341,5804,
example line from file B cut down:
ENST00000111111
Here's one way using GNU awk. Run like:
awk -f script.awk fileB.txt fileA.txt
Contents of script.awk:
FNR==NR {
array[$0]++
next
}
{
line = $4
sub(/\.[0-9]+$/, "", line)
if (line in array) {
print
}
}
Alternatively, here's the one-liner:
awk 'FNR==NR { array[$0]++; next } { line = $4; sub(/\.[0-9]+$/, "", line); if (line in array) print }' fileB.txt fileA.txt
GNU awk can also perform the pre-processing of fileB.txt that you described using cut and sed. If you would like me to build this into the above script, you will need to provide an example of what this line looks like.
UPDATE using files HumanGenCodeV12 and GenBasicV12:
Run like:
awk -f script.awk HumanGenCodeV12 GenBasicV12 > output.txt
Contents of script.awk:
FNR==NR {
gsub(/[^[:alnum:]]/,"",$12)
array[$12]++
next
}
{
line = $4
sub(/\.[0-9]+$/, "", line)
if (line in array) {
print
}
}
This successfully prints lines in GenBasicV12 that can be found in HumanGenCodeV12. The output file (output.txt) contains 65340 lines. The script takes less than 10 seconds to complete.
You're hitting the limit of using the basic shell tools. Assuming about 40 characters per line, File A has 400,000 lines in it and File B has about 1,200,000 lines in it. You're basically running grep for each line in File A and having grep plow through 1,200,000 lines with each execution. that's 480 BILLION lines you're parsing through. Unix tools are surprisingly quick, but even something fast done 480 billion times will add up.
You would be better off using a full programming scripting language like Perl or Python. You put all lines in File B in a hash. You take each line in File A, check to see if that fourth field matches something in the hash.
Reading in a few hundred thousand lines? Creating a 10,000,000 entry hash? Perl can parse both of those in a matter of minutes.
Something -- off the top of my head. You didn't give us much in the way of spects, so I didn't do any testing:
#! /usr/bin/env perl
use strict;
use warnings;
use autodie;
use feature qw(say);
# Create your index
open my $file_b, "<", "file_b.txt";
my %index;
while (my $line = <$file_b>) {
chomp $line;
$index{$line} = $line; #Or however you do it...
}
close $file_b;
#
# Now check against file_a.txt
#
open my $file_a, "<", "file_a.txt";
while (my $line = <$file_a>) {
chomp $line;
my #fields = split /\s+/, $line;
if (exists $index{$field[3]}) {
say "Line: $line";
}
}
close $file_a;
The hash means you only have to read through file_b once instead of 400,000 times. Start the program, go grab a cup of coffee from the office kitchen. (Yum! non-dairy creamer!) By the time you get back to your desk, it'll be done.
grep -f seems to be very slow even for medium sized pattern files (< 1MB). I guess it tries every pattern for each line in the input stream.
A solution, which was faster for me, was to use a while loop. This assumes that fileA is reasonably small (it is the smaller one in your example), so iterating multiple times over the smaller file is preferable over iterating the larger file multiple times.
while read line; do
grep -F "$line" fileA
done < fileBcutdown > outputfile
Note that this loop will output a line several times if it matches multiple patterns. To work around this limitation use sort -u, but this might be slower by quite a bit. You have to try.
while read line; do
grep -F "$line" fileA
done < fileBcutdown | sort -u | outputfile
If you depend on the order of the lines, then I don't think you have any other option than using grep -f. But basically it boils down to trying m*n pattern matches.
use the below command:
awk 'FNR==NR{a[$0];next}($4 in a)' <your filtered fileB with single field> fileA

Resources