Awk to get .csv column containing commas and newlines [duplicate]

Awk to get .csv column containing commas and newlines [duplicate] - linux

This question already has answers here:
What's the most robust way to efficiently parse CSV using awk?
(6 answers)
Closed 3 years ago.
I have data in a .csv column that sometimes contains commas and newlines. If there is a comma in my data, I have enclosed the entire string with double quotes. How would I go about parsing the output of that column to a .txt file taking the newlines and commas into consideration.
Sample data that doesn't work with my command:
,"This is some text with a , in it.", #data with commas are enclosed in double quotes
,line 1 of data
line 2 of data, #data with a couple of newlines
,"Data that may a have , in it and
also be on a newline as well.",
Here is what I have so far:
awk -F "\"*,\"*" '{print $4}' file.csv > column_output.txt

$ cat decsv.awk
BEGIN { FPAT = "([^,]*)|(\"[^\"]+\")"; OFS="," }
{
# create strings that cannot exist in the input to map escaped quotes to
gsub(/a/,"aA")
gsub(/\\"/,"aB")
gsub(/""/,"aC")
# prepend previous incomplete record segment if any
$0 = prev $0
numq = gsub(/"/,"&")
if ( numq % 2 ) {
# this is inside double quotes so incomplete record
prev = $0 RT
next
}
prev = ""
for (i=1;i<=NF;i++) {
# map the replacement strings back to their original values
gsub(/aC/,"\"\"",$i)
gsub(/aB/,"\\\"",$i)
gsub(/aA/,"a",$i)
}
printf "Record %d:\n", ++recNr
for (i=0;i<=NF;i++) {
printf "\t$%d=<%s>\n", i, $i
}
print "#######"
.
$ awk -f decsv.awk file
Record 1:
$0=<,"This is some text with a , in it.", #data with commas are enclosed in double quotes>
$1=<>
$2=<"This is some text with a , in it.">
$3=< #data with commas are enclosed in double quotes>
#######
Record 2:
$0=<,"line 1 of data
line 2 of data", #data with a couple of newlines>
$1=<>
$2=<"line 1 of data
line 2 of data">
$3=< #data with a couple of newlines>
#######
Record 3:
$0=<,"Data that may a have , in it and
also be on a newline as well.",>
$1=<>
$2=<"Data that may a have , in it and
also be on a newline as well.">
$3=<>
#######
Record 4:
$0=<,"Data that \"may\" a have ""quote"" in it and
also be on a newline as well.",>
$1=<>
$2=<"Data that \"may\" a have ""quote"" in it and
also be on a newline as well.">
$3=<>
#######
The above uses GNU awk for FPAT and RT. I don't know of any CSV format that would allow you to have a newline in the middle of a field that's not enclosed by quotes (if it did you'd never know where any record ended) so the script doesn't allow for that. The above was run on this input file:
$ cat file
,"This is some text with a , in it.", #data with commas are enclosed in double quotes
,"line 1 of data
line 2 of data", #data with a couple of newlines
,"Data that may a have , in it and
also be on a newline as well.",
,"Data that \"may\" a have ""quote"" in it and
also be on a newline as well.",

Related

Extract a string between double quotes from the 6th line of a file in Unix and assign it to variable

Newbie to unix/shell/bash. I have a file name CellSite whose 6th line is as below:
btsName = "RV74XC038",
I want to extract the string from 6th line that is between double quotes (i.e.RV74XC038) and save it to a variable. Please note that the 6th line starts with 4 blank spaces. And this string would vary from file. So I am looking for a solution that would extract a string from 6th line between the double quotes.
I tried below. But does not work.
str2 = sed '6{ s/^btsName = \([^ ]*\) *$/\1/;q } ;d' CellSite;
Any help is much appreciated. TIA.

sed is a stream editor.
For just parsing files, you want to look into awk. Something like this:
awk -F \" '/btsName/ { print $2 }' CellSite
Where:
-F defines a "field separator", in your case the quotation marks "
the entire script consists of:
/btsName/ act only on lines that contain the regex "btsName"
from that line print out the second field; the first field will be everything before the first quotation marks, second field will be everything from the first quotes to the second quotes, third field will be everything after the second quotes
parse through the file named "CellSite"
There are possibly better alternatives, but you would have to show the rest of your file.

Using sed
$ str2=$(sed '6s/[^"]*"\([^"]*\).*/\1/' CellSite)
$ echo "$str2"
RV74XC038

You can use the following awk solution:
btsName=$(awk -F\" 'NR==6{print $2; exit}' CellSite)
Basically, get to the sixth line (NR==6), print the second field value (" is used to split records (lines) into fields) and then exit.
See the online demo:
#!/bin/bash
CellSite='Line 1
Line 2
Line 3
btsName = "NO74NO038",
Line 5
btsName = "RV74XC038","
Line 7
btsName = "no11no000",
'
btsName=$(awk -F\" 'NR==6{print $2; exit}' <<< "$CellSite")
echo "$btsName" # => RV74XC038

This might work for you (GNU sed):
var=$(sed -En '6s/.*"(.*)".*/\1/p;6q' file)
Simplify regexs and turn off implicit printing.
Focus on the 6th line only and print the value between double quotes, then quit.
Bash interpolates the sed invocation by means of the $(...) and the value extracted defines the variable var.

how do you count and replace a string in a text file that starts at the end of one line and continues on the next using linux commands?

I have a large (4 GB) Windows .csv text file (each lines end in "\r\n") in a Linux environment that was supposed to have been a csv delimited file (delimiter = '|', text qualifier = '"') with each field separated by a pipe and enclosed in double quotes. Any narrative text field with embedded double quotes was supposed to have the double quote escaped with a second double quote (ie. " the quick "brown" fox" was supposed to have been represented as "the quick ""brown"" fox"). Unfortunately escaping the embedded double quotes did not occur. Further the text fields may include embedded new lines (i.e. Windows CR (\r\n)) which need to be retained.
Sample lines might look as follows:
"1234567890123456"|"2016-07-30"|"2016-08-01"|"123"|"456"|"789"|"text narrative field starts\r\n
with text lines that may have embedded double quotes "For example"\r\n
and may include measurements such as 1/2" x 2" with \r\n
the text continuing and includes embedded line breaks \r\n
which will finally be terminated with a double quote"\r\n
"9876543210654321"|"2017-01-31"|"2018-08-01"|"123"|"456"|"789"|"text narrative field"\r\n
"2345678901234567"|"...."\r\n
with the objective to have the output appear as follows:
~1234567890123456~|~2016-07-30~|~2016-08-01~|~123~|~456~|~789~|~text narrative field starts\r\n
with text lines that may have embedded double quotes ""For example""\r\n
and may include measurements such as 1/2"" x 2"" with \r\n
the text continuing and includes embedded line breaks \r\n
which will finally be terminated with a double quote~\r\n
~9876543210654321~|~2017-01-31~|~2018-08-01~|~123~|~456~|~789~|~text narrative field~\r\n
~2345678901234567~|~....~\r\n
The solution I was attempting to implement was to:
SUCCESSFUL: change all the "|" sequences to ~|~
SUCCESSFUL: change the double quote (")at the start of the first line and end of the last line to a tilde (~)
change the ending and starting double quotes to tildes for any lines ending in a double quote at the end of the first line and terminated with a CR (\r\n) (eg. ..."\r\n) and the next line begins with a double quote, followed by 16 digit number and a tilde (eg. "1234567890123456~...) (i.e. it is the start of a new record)
convert all remaining double quote characters to two successive double quotes (change " to "")
then reverse the first 3 steps above changing all ~ back to double quotes.
I started by using sed to replace all strings with double quote, followed by a pipe, followed by a double quote (i.e. "|") with a tilde, pipe, tilde (i.e. ~|~). I then manually replaced the first and last doublequote in the file with a tilde.
This is where I ran into issues as I tried to count the number of occurrences where a line ends with a doublequote(") and the start of the next line begins with a doublequote followed by a 16 digit number and a "~" which will tell me the actual number of csv records in the file (minus one) as opposed to the number of lines. I attempted to do this using grep: grep '"\r\n"\d{16}~' | wc -l but that didn't work
I then need to replace those double quotes wherein a double quote ends a record and the succeeding record begins with a double quote followed by a 16 digit number and a "~" leaving everything else intact.
I tried to use sed: sed 's/"\r\n"(\d{16}~)/~\r\n~\1' windows_file.txt but it is not working as hoped.
I would welcome any recommendations as to how to accomplish the above.

The script below does what you expect using awk, except for the very last line in the file since it does not know where that record ends.
It could be fixed counting lines in the file but would be impractical since it's a big file.
Looking at data structure records are separated by "\r\n" and fields by "|" let's use that with awk.
gawk 'BEGIN{
RS="\"\r\n\"" # input record separator RS, 2 double quotes with a DOS line ending in the middle
FS="\"\\|\"" # input field separator FS, 2 double quotes with a pipe in the middle
ORS="~\r\n~" # your record separator
OFS="~|~" # your field separator
} {
$1=$1 # trick awk into believing something has changed
if (NR == 1){ # first record, replace first character
print "~" substr($0,2)
}else{
print $0
}
} ' test.txt
Result (assuming lines end with \r\n):
~1234567890123456~|~2016-07-30~|~2016-08-01~|~123~|~456~|~789~|~text narrative field starts
with text lines that may have embedded double quotes "For example"
and may include measurements such as 1/2" x 2" with
the text continuing and includes embedded line breaks
which will finally be terminated with a double quote~
~9876543210654321~|~2017-01-31~|~2018-08-01~|~123~|~456~|~789~|~text narrative field~
~10654321~|~2018-09-31~|~2018-08-01~|~123~|~456~|~789~|~asdasdasdasdad asasda"
~
~
PS: will break if a field contains a line that starts with " and the preceding line within the same ends with "\r\n since the pattern will match the proposed RS.
"10654321"|"2018-09-31"|"2018-08-01"|"123"|"456"|"789"|"asdasdasdasdad asasda"\r\n
"some more"\r\n
"22222"|".... (another record)

Remove CRLF character from a single data element in CSV file

Hi im editing my question here, the requirement has slightly changed wherein the CSV file has only LF to begin with . However the CSV file could also have LF between the element within double quotes. We want to retain the LF's within double quotes and replace the LF at the end of the line with CRLF.
so if my source file looks like this :
enter code here
Date,Open,High,Low,Close,comments,Remark
5-Dec-16,8088.75,8141.9,8056.85,8128.75,"TEST1 <LF>
TEST2 <LF>
with NL",remark<LF>
6-Dec-16,8153.15,8178.7,8130.85,8143.15,AAAA,remark<LF>
7-Dec-16,8168.4,8190.45,8077.5,8102.05,BBBB,remark<LF>
8-Dec-16,8152.1,8256.25,8151.75,8246.85,"TEST1<LF>
TEST with NL",remark<LF>
9-Dec-16,8271.7,8274.95,8241.95,8261.75,CCCC,remark<LF>
Date,Open,High,Low,Close,comments,Remark
5-Dec-16,8088.75,8141.9,8056.85,8128.75,"TEST1 <LF>
TEST2 <LF>
with NL",remark<CRLF>
6-Dec-16,8153.15,8178.7,8130.85,8143.15,AAAA,remark<CRLF>
7-Dec-16,8168.4,8190.45,8077.5,8102.05,BBBB,remark<CRLF>
8-Dec-16,8152.1,8256.25,8151.75,8246.85,"TEST1<LF>
TEST2 with NL",remark<CRLF>
9-Dec-16,8271.7,8274.95,8241.95,8261.75,CCCC,remark<CRLF>
enter code here
Appreciate your help.
Thanks,
Chandan

Best to use a proper CSV parser that can handle newlines in quoted fields. Perl has one:
perl -MText::CSV -e '
$csv = Text::CSV->new({ binary => 1 });
while ($row = $csv->getline(STDIN)) {
$row = [map {s/\n+/ /g; $_} #$row];
$csv->say(STDOUT, $row)
}
' < file.csv
or ruby
ruby -rcsv -e '
CSV.parse( readlines.join "" ).each {|row|
puts CSV.generate_line( row.collect {|elem| elem.gsub /\n+/, " "} )
}
' file

Chances are you're looking for:
awk -v RS='\r\n' '{gsub(/[\r\n]+/," ")}1' file
but without details on where the \rs and \ns appear in your input that's just a guess. The above uses GNU awk for multi-char RS and in addition to replacing chains of carriage returns and/or linefeeds from inside every field with blanks will convert your newlines from \r\n (Windows style) to just \n (UNIX style) to make it easier to do anything else with them from that point onwards.
See also What's the most robust way to efficiently parse CSV using awk? for how to handle CSVs in general using awk.

A little state machine in awk: uses a double quote as the field separator, and acts upon the number of fields:
awk -F '"' '
partial {$0 = partial OFS $0; partial = ""}
NF % 2 == 0 {partial = $0; next}
{print}
' file

how to replace every 6th ooccurrence blank space with new line from a file?

For example if a file has words (or strings) separated by space as show below:
cat bat mat ram sdk kgb fsb cia
this should change to new line after every 6th word and make those words separated by comma ? not sure how to do this using awk
cat ,bat, mat ,ram ,sdk ,kgb (new line )
fsb ,cia

This is actually a pretty tricky thing to do. Here are a few ways:
sed: convert all whitespace to commas, then replace every 6th comma with a newline.
sed -r 's/[[:blank:]]+/,/g; s/([^,]+(,[^,]+){5}),/\1\n/g' file
awk, print each field and decide what separator to use for each one.
awk '{
for (i=1; i<=NF; i++)
printf "%s%s", $i, (i == NF ? "" : ( i%6 == 0 ? "\n" : ","))
print ""
}' file
bash
myjoin() { local IFS=$1; shift; echo "$*"; }
while read -ra words; do
while (( ${#words[#]} > 0 )); do
myjoin , "${words[#]:0:6}"
words=( "${words[#]:6}" )
done
done < file
this one's my favourite: tr to convert whitespace to newlines; paste to print 6 fields per line; sed to clean up trailing commas on the last line
tr -s '[:blank:]' '\n' < file | paste -d, - - - - - - | sed '$s/,\+$//'
This one acts differently from the others though: if your input file has 3 lines of 8 words, all the other methods will output 6 lines, odd-numered lines with 6 fields and even-numbered lines with 2 fields. This answer will print 4 lines, all having 6 fields. So, depends on your need.

How to Convert a tab delimited file with commas in values to .CSV and the values with commas to be enclosed in double quotes?

I have a .CSV file (Lets say tab_delimited_file.csv) that I download from a web portal of a particular vendor. When I moved the file to one of my Linux directories, I noticed that this particular .CSV file is actually a tab delimited file which is named as .CSV. Please find below few sample records of the file.
"""column1""" """column2""" """column3""" """column4""" """column5""" """column6""" """column7"""
12 455 string with quotes, and with a comma in between 4432 6787 890 88
4432 6787 another, string with quotes, and with two comma in between 890 88 12 455
11 22 simple string 77 777 333 22
The above sample records are separated by tabs. I know the header of the file is very weird but this is the way I received the file format to be.
I tried to use tr command to replace the tabs with commas but the file gets messed up completely due to the extra commas in the record values. I need the record values with commas in them to be enclosed in double quotes. The command I used is as below.
tr '\t' ',' < tab_delimited_file.csv > comma_separated_file.csv
This converts the file into the below format.
"""column1""","""column2""","""column3""","""column4""","""column5""","""column6""","""column7"""
12,455,string with quotes, and with a comma in between,4432,6787,890,88
4432,6787,another, string with quotes, and with two comma in between,890,88,12,455
11,22,simple string,77,777,333,22
I need help to convert the sample file into the below format.
column1,column2,column3,column4,column5,column6,column7
12,455,"string with quotes, and with a comma in between",4432,6787,890,88
4432,6787,"another, string with quotes, and with two comma in between",890,88,12,455
11,22,"simple string",77,777,333,22
Any solution in either using sed or awk will be very useful.

This will produce the output you asked for, but it's not clear if the criteria I'm assuming to be true for which fields to put in quotes (any containing a comma or a space), for example, is actually what you want so test it yourself with other input to see:
$ awk 'BEGIN { FS=OFS="\t" }
{
gsub(/"/,"")
for (i=1;i<=NF;i++)
if ($i ~ /[,[:space:]]/)
$i = "\"" $i "\""
gsub(OFS,",")
print
}
' file
column1,column2,column3,column4,column5,column6,column7
12,455,"string with quotes, and with a comma in between",4432,6787,890,88
4432,6787,"another, string with quotes, and with two comma in between",890,88,12,455
11,22,"simple string",77,777,333,22

One way using awk:
awk '
BEGIN { FS = "\t"; OFS = "," }
FNR == 1 {
for ( i = 1; i <= NF; i++ ) { gsub( /"+/, "", $i ) }
print $0
next
}
FNR > 1 {
for ( i = 1; i <= NF; i++ ) {
w = split( $i, _, " " )
if ( w > 1 ) { $i = "\"" $i "\"" }
}
print $0
}
' infile
It uses a tab to split fields in input and a comma to write in output. For the header is easy, simple remove all double quotes. For data lines, for each field split with spaces and surround with double quotes only if the split returned more than one field.
It yields:
column1,column2,column3,column4,column5,column6,column7
12,455,"string with quotes, and with a comma in between",4432,6787,890,88
4432,6787,"another, string with quotes, and with two comma in between",890,88,12,455
11,22,"simple string",77,777,333,22

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string