I have a Perl script that accepts a comma separated csv file as input.
I would like to discard the last column (the column number is known in advance).
The problem is that the last column may contain quoted strings with commas, in which case I would like to cut the entire string.
Example:
colA,colB,colC
1,2,3
4,5,"6,6"
What I would like to end up with is:
colA,colB
1,2
4,5
The current solution I have is using Linux cut command in the following manner:
cat $file | cut -d ',' -f 3 --complement
Which outputs the following:
colA,colB
1,2
4,5,6"
Which works great unless the last column is a quoted string with commas in it.
I can only use native Perl/Linux commands to solve this.
Appreciate your help
Using Text::CSV, as a script to process STDIN into STDOUT:
use strict;
use warnings;
use Text::CSV 'csv';
my $csv = csv(in => \*STDIN, keep_headers => \my #headers,
auto_diag => 2, encoding => 'UTF-8');
pop #headers;
csv(in => $csv, out => \*STDOUT, headers => \#headers,
auto_diag => 2, encoding => 'UTF-8');
The obvious benefit of this approach is handling all common edge cases automatically.
Try this based on awk-regex:
awk -v FPAT='([^,]+)|(\"[^\"]+\")' -v OFS=',' '{print $1,$2}' ${file}
Example
echo '"4,4",5,"6,6"' | awk -v FPAT='([^,]+)|(\"[^\"]+\")' -v OFS=',' '{print $1,$2}'
"4,4",5
Reference
If quoted strings with comma is the only trouble you are facing, you can use this:
$ sed -E 's/,"[^"]*"$|,[^,]*$//' ip.txt
colA,colB
1,2
4,5
,"[^"]*"$ will match , followed by " followed by non " characters followed by " at the end of line
,[^,]*$ will match , followed by non , characters at end of line
The double quoted column will match earlier in the string and thus gets deleted completely
Equivalent for perl would be perl -lpe 's/,"[^"]*"$|,[^,]*$//' ip.txt
I believe sungtm answer is correct and requries some explanation:
awk -v FPAT='([^,]+)|(\"[^\"]+\")' -v OFS=',' '{print $1,$2}'
Is equivalent to:
script.awk
BEGIN {
FPAT = "([^,]+)|(\"[^\"]+\")"; # gnu awk specific: FPAT is RegEx pattern to identify the field's content
# [^,]+ ------ RegEx pattern to match all chars not ","
#"[^\"]+\" ------ RegEx pattern to match all quated chars including the quotes
#()|() ------ RegEx optional groups selector
OFS = ","; # Output field separator
}
{ # for each input line/record
print $1, $2; # print "1st field" OFS value "2nd field"
}
Runnig
awk -f scirpt.awk input.txt
Save the script in any file say script.pl
Execute as prompt>perl script.pl /opt/filename.csv
"1","2,3",4,"test, test" ==> "1","2,3",4
1,"2,3,4","5 , 6","7,8" ==> 1,"2,3,4","5 , 6"
0,0,0,"test" ==> 0,0,0
Handles above cases
use strict;
if (scalar(#ARGV) != 1 ) {
print "usage: perl script.pl absolute_file_path";
exit;
}
my $filename = $ARGV[0]; # complete file path here
open(DATA, '<', $filename)
or die "Could not open file '$filename' $!";
my #lines = <DATA>;
close(DATA);
my $counter=0;
open my $fo, '>', $filename;
foreach my $line(#lines) {
chomp($line);
my #update = split '(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)' , $line;
my #update2;
foreach (#update) {
if($_=~/\w+/) {
push(#update2,$_);
}
}
pop(#update2);
print #update2;
my $str = join(',',#update2);
print $fo "$str";
unless (++$counter == scalar(#lines)) {
print $fo "\n";
}
}
close $fo;
Well this case is quite interesting - please see my solution bellow.
You can change $debug = 1; to see what happens and how this mechanism works
use strict;
use warnings;
my $debug = 0;
while( <DATA> ) {
print "IN: $_" if $debug;
chomp;
s/"(.+?)"/replace($1)/ge; # do magic replacement , -> ___ in block of interest
print "REP: $_\n" if $debug;
my #data = split /,/; # split into array
pop #data; # pop last element of array
my $line = join ',', #data; # merge array into a string
$line =~ s/___/,/g; # do unmagic replacement
$line =~ s/\|/"/g; # restore | -> "
printf "%s$line\n", $debug ? "OUT: " : ''; # print result
}
sub replace {
my $line = shift;
$line =~ s/,/___/g; # do magic replacement in our block
return "|$line|"; # put | arount block of interest
}
__DATA__
colA,colB,colC
1,2,3
4,5,"6,6"
8,3,"1,2",37,82
64,12,"1,2,3,4",42,56
"3,4,7,8",2,8,"8,7,6,5,4",2,8
"3,4,7,8",2,8,"8,7,6,5,4",2,8,"2,8,4,1"
"3,4,7,8",2,8,"8,7,6,5,4",2,8,"2,8,4,1",3,4
Appreciate your help. Below is the solution I ended up using:
cat file.csv | perl -MText::ParseWords -nle '#f = parse_line(",",2, $_); tr/,/$/d for #f; print join ",", #f' | cut -d ',' -f 3 --complement | tr $ , ;
This will replace commas in field surrounded by quotes to the $ sign, to re replaced back after discarding the last unwanted column.
Related
Recently, I had to sort several files according to records' ID; the catch was that there can be several types of records, and in each of those the field I had to use for sorting is on a different position. The fields, however, are easily identifiable thanks to key=value structure. To show a simple sample of the general structure:
fieldA=valueA|fieldB=valueB|recordType=A|id=2|fieldC=valueC
fieldD=valueD|recordType=B|id=1|fieldE=valueE
fieldF=valueF|fieldG=valueG|fieldH=valueH|recordType=C|id=3
I came up with a pipeline as follows, which did the job:
awk -F'[|=]' '{for(i=1; i<=NF; i++) {if($i ~ "id") {i++; print $i"?"$0} }}' tester.txt | sort -n | awk -F'?' '{print $2}'
In other words the algorithm is as follows:
Split the record by both field and key-value separators (| and =)
Iterate through the elements and search for the id key
Print the next element (value of id key), a separator, and the whole line
Sort numerically
Remove prepended identifier to preserve records' structure
Processing the sample gives the output:
fieldD=valueD|recordType=B|id=1|fieldE=valueE
fieldA=valueA|fieldB=valueB|recordType=A|id=2|fieldC=valueC
fieldF=valueF|fieldG=valueG|fieldH=valueH|recordType=C|id=3
Is there a way, though, to do this task using single awk command?
You may try this gnu-awk code to to this in a single command:
awk -F'|' '{
for(i=1; i<=NF; ++i)
if ($i ~ /^id=/) {
a[gensub(/^id=/, "", 1, $i)] = $0
break
}
}
END {
PROCINFO["sorted_in"] = "#ind_num_asc"
for (i in a)
print a[i]
}' file
fieldD=valueD|recordType=B|id=1|fieldE=valueE
fieldA=valueA|fieldB=valueB|recordType=A|id=2|fieldC=valueC
fieldF=valueF|fieldG=valueG|fieldH=valueH|recordType=C|id=3
We are using | as field delimiter and when there is a column name starting with id= we store it in array a with index as text after = and value as the full record.
Using PROCINFO["sorted_in"] = "#ind_num_asc" we sort array a using numerical value of index and then in for loop we print value part to get the sorted output.
Using GNU awk for the 3rd arg to match() and sorted_in:
$ cat tst.awk
match($0,/(^|\|)id=([0-9]+)/,a) {
ids2vals[a[2]] = $0
}
END {
PROCINFO["sorted_in"] = "#ind_num_asc"
for ( id in ids2vals ) {
print ids2vals[id]
}
}
$ awk -f tst.awk file
fieldD=valueD|recordType=B|id=1|fieldE=valueE
fieldA=valueA|fieldB=valueB|recordType=A|id=2|fieldC=valueC
fieldF=valueF|fieldG=valueG|fieldH=valueH|recordType=C|id=3
Try Perl: perl -e 'print map { s/^.*? //; $_ } sort { $a <=> $b } map { ($id) = /id=(\d+)/; "$id $_" } <>' file
Some explanation of the code I use:
print #print the resulting list of lines
map {
s/^.*? //;
$_
} #remove numeric id from start of line
sort { $a <=> $b } #sort numerically
map {
($id) = /id=(\d+)/;
"$id $_"
} # capture id and place it in start of line
<> # read all lines from file
Or try sed and sort: sed 's/^\(.*id=\([0-9][0-9]*\).*\)$/\2 \1/' file | sort -n | sed 's/^[^ ][^ ]* //'
With your shown samples only, please try following(awk + sort + cut) solution, written and tested in GNU awk, should work in any awk.
awk '
match($0,/id=[0-9]+/){
print substr($0,RSTART,RLENGTH)";"$0
}
' Input_file | sort -t'=' -k2n | cut -d';' -f2-
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
match($0,/id=[0-9]+/){ ##Using awk match function to match id= followed by digits.
print substr($0,RSTART,RLENGTH)";"$0 ##printing sub string of matched value followed by current line along with semi-colon in it.
}
' Input_file | ##Mentioning Input_file here and passing awk output as a standard input to next command.
sort -t'=' -k2n | ##Sorting output with delimiter of = and by 2nd field then passing output to next command as an input.
cut -d';' -f2- ##Using cut command making delimiter as ; and printing everything from 2nd field onwards.
I have some data in this format saved in a string:
data = some-data.in.this.format
How can I perform a cut on $data so that I am only left with some-data?
You can use awk with -F (field-separator)
$ data=some-data.in.this.format
$ echo ${data} | awk -F "." '{print $1}'
some-data
Using a custom delimiter, in this case a dot . could do the trick. Try:
echo "data = some-data.in.this.format" | cut -d\. -f1
However that would return literally everything before first match in a string, which is literally what you asked for, so you will have:
data = some-data
Thus, if you want only to get some-data, I would be using:
echo "data = some-data.in.this.format" | cut -d\. -f1 |awk '{ print $3 }'
If data = is always going to be the first thing on the line, you can do a substring parameter expansion:
offset=7
str="data = some-data.in.this.format"
value=${str:${offset}}
echo "$value"
Output:
some-data.in.this.format
If both the key and value are unknown, you can split on the first = in the string:
str="data = some-data.in.this.format"
key="${str%% =*}"
value="${str#*= }"
echo "$key"
echo "$value"
Output
data
some-data.in.this.format
See:
${parameter:offset}
${parameter#word}
${parameter%%word}
at Shell Parameter Expansion
I have one file file1.dml and another fixed length data file file2.dat.
data in file1.dml be like
start
integer(16) field1 ;
string(1) filed2 ;
string(80) filed3 ;
decimal(16.2) field4;
string(1) newline = "\n";
end;
data in file2.dat be like
12345678 ABBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 1234567890
I need the output file like below
field1="12345678 "
filed2="A"
filed3="BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB "
field4="1234567890 "
newline="\n"
I have written below function which accepts file1.dml and file2.dat and generate the exact result, but i want to simplify this using AWK, thanks in advance for any help
function myfunc1
{
if [ $1 == "" -a $2 == "" -a ! -f $1 -a ! -f $2 ]; then
print "Input files not present"
else
dml_file=$1 #input parameter, dml file
cntl_file=$2 #input parametr, dat file
start_pos=1
end_pos=0
cat "$dml_file" | sed '1d' | sed '$d' | while IFS= read line
do
counter=`echo $line | cut -d'(' -f2 | cut -d')' -f1`
fld_name=`echo $line | cut -d'(' -f2 | cut -d')' -f2 | sed 's/;//g'`
#check decimal or not
if [[ $counter == +([0-9]) ]]; then
end_pos=$((counter+start_pos))
else
counter1=`echo $counter | cut -d'.' -f1`
counter=$counter1
end_pos=$((counter1+end_pos))
fi
newline_check=`echo $fld_name | grep -i 'newline' | wc -l`
if [ $newline_check -gt 0 ]; then
fld_name="newline"
fld_val="\n"
#write below line in one file
echo "$fld_name : \"$fld_val\""
else
fld_val=`cat $cntl_file | cut -c$start_pos-$end_pos`
#write below line in one file
echo "$fld_name : \"$fld_val\""
fi
start_pos=$((start_pos+counter))
done
fi
}
Since file2.dat only has a single line I'd start by reading this into a variable so that we don't have to continually scan the file; eg:
$ IFS= read -r rawdata < file2.dat # 'IFS=' needed in order to retain trailing white space
$ echo ".${rawdata}." # periods included to show that trailing white space retained
.12345678 ABBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 1234567890 .
At this point we can pass the rawdata variable to an awk solution:
awk -v rd="${rawdata}" -F'[();=]' '
BEGIN { s = 1 ; nl = "\\n" }
/start|end/ { next }
/newline/ { gsub(/ /,"",$4)
nl = $4
next
}
{ split($2,a,".")
len = a[1]
gsub(/ /,"",$3)
fname = $3
printf "%s=\"%s\"\n", fname, substr(rd,s,len)
s += len
}
END { printf "newline=%s\n", nl }
' file1.dat
Where:
-v rd="${rawdata}" - define awk variable rd as containing the current value of ${rawdata}
-F '[();=]' - define 4 different input/field delimiters ((, ), ;, =); $2=field length, $3=field name, $4=newline character(s)
BEGIN ( s = 1 ; nl= "\\n" } - initialize our start position in rd and a default newline character (\n)
/start|end/ { next } - ignore lines that contain start or end
/newline/ .... - if we see the pattern newline then remove spaces and set nl to this new value ($4)
next - stop processing for current line and go to next line of input
NOTE: for the rest of the lines in our input file:
split($2,a,".") / len = a[1] - split the 'length' field ($2) based on a period delimiter into array a, then set len to the first element of array a
gsub(/ /,"",$3) / fname = $3 - remove white space from the field name ($3) and assign resulting value to local variable fname
printf ... - output our line of data; use s and len to pull the desired substring from rd
s += len - add current len to s to get a new start position for next pass through the logic
END ... - when all input processing is done, print our newline record to stdout
Running the above generates the following:
field1="12345678 "
filed2="A"
filed3="BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB "
field4="1234567890 "
newline="\n"
Through a request made to the command line, I get a response in csv format. This is the file:
"STACK";"STACK_ID";"NAME";"DESCRIPTION";"CREATION_TIME";"DELETION_TIME";"STATUS"
"STACK";"arn:aws:cloudformation:us-west-1:222222222222:stack/LiveStream2/00000000-6000-00e2-acd3-333333333333";"LiveStream2";"(nil)";"2013-01-23T16:01:36Z";"2013-01-23T16:22:57Z";"DELETE_COMPLETE"
"STACK";"arn:aws:cloudformation:us-west-1:444444444444:stack/LiveStream/00000000-6000-00e2-acd3-222222222222";"LiveStream";"(nil)";"2013-01-23T13:53:13Z";"2013-01-23T15:20:29Z";"DELETE_COMPLETE"
With a script I would like to take the value of the NAME field and put it in a variable,
Can you help me?
If you don't know that NAME is the 3rd field, you have to hunt for it in the header line:
awk -F \; '
NR==1 {
for (i=1; i<=NF; i++) {
if ($i == "\"NAME\"") {
name_field = i
break
}
}
next
}
{ print $name_field }
' < filename
this outputs
"LiveStream2"
"LiveStream"
Well the cut command should work well for you:
For example if your file is test.txt,
cat test.txt| cut -d';' -f3 will get the field 3.
If you want to store the value line by line then use this:
for i in `cat test.txt`
do
MYVAR=$(echo $i| cut -d';' -f3)
//do something with the variable $MYVAR
done
NOTE: this is just one approach to your question. There are several ways to achieve what you've asked.
This is the example code.
#!/bin/sh
OLDIFS=$IFS
IFS=;
while read f1 f2 f3
do
echo "Name variable is :$f3"
done<filename.csv
IFS=$OLDIFS
Third field is "NAME" which is assign in variable in f3. Here varable is displayed two times.
Awk would be my best go at it.
declare NAMES=( $( cmd_that_generates_output | awk -F ';' 'NR==1{next}; { if( $3 ~ /^[^"].*[^"]$/){$3="\""$3"\""}; printf $3 " " }' ) )
So bash executes the comand subsititution ie everything inside the $( ) the cmd generates your csv and pipes it to awk.
Awk checks to see if the input is the first line ( NR==1 ) and moves on if it is ( {next} )
For every other line it checks that it is properly encased in quotes
if( $3 ~ /^[^"].*[^"]$/){$3="\""$3"\""};
then prints out the 3rd column seperated by the ';' ( ie the naem column ) followed by a space.
printf $3 " "
This has then expanded the $() to a space seperated list of names eg
declare NAMES=( name1 name2 )
Then the declare is executed which creates an array called NAMES with each name as a single element. so:
x.txt is the output you gave in your question.
pete.mccabe#jackfrog$ cat x.txt
"STACK";"STACK_ID";"NAME";"DESCRIPTION";"CREATION_TIME";"DELETION_TIME";"STATUS"
"STACK";"arn:aws:cloudformation:us-west-1:222222222222:stack/LiveStream2/00000000-6000-00e2-acd3-333333333333";"LiveStream2";"(nil)";"2013-01-23T16:01:36Z";"2013-01-23T16:22:57Z";"DELETE_COMPLETE"
"STACK";"arn:aws:cloudformation:us-west-1:444444444444:stack/LiveStream/00000000-6000-00e2-acd3-222222222222";Live Stream;"(nil)";"2013-01-23T13:53:13Z";"2013-01-23T15:20:29Z";"DELETE_COMPLETE"
pete.mccabe#jackfrog$ declare NAMES=( $( cat x.txt | awk -F ';' 'NR==1{next}; { if( $3 ~ /^[^"].*[^"]$/){$3="\""$3"\""}; printf $3 " " }' ) )
pete.mccabe#jackfrog$ echo ${NAMES[#]}
"LiveStream2" "Live Stream"
In Linux, if I have a file with entries like:
My Number is = 1234; #This is a random number
Can I use sed or anything else to replace all spaces after '#' with '+', so that the output looks like:
My Number is = 1234; #This+is+a+random+number
One way using awk:
awk -F# 'OFS=FS { gsub(" ", "+", $2) }1' file.txt
Result:
My Number is = 1234; #This+is+a+random+number
EDIT:
After reading comments below, if your file contains multiple #, you can try this:
awk -F# 'OFS=FS { for (i=2; i <= NF; i++) gsub(" ", "+", $i); print }' file.txt
You can do this in pure shell...
$ foo="My Number is = 1234; #This is a random number"
$ echo -n "${foo%%#*}#"; echo "${foo#*#}" | tr ' ' '+'
My Number is = 1234; #This+is+a+random+number
$
Capturing this data to variables for further use is left as an exercise for the reader. :-)
Note that this also withstands multiple # characters on the line:
$ foo="My Number is = 1234; #This is a # random number"
$ echo -n "${foo%%#*}#"; echo "${foo#*#}" | tr ' ' '+'
My Number is = 1234; #This+is+a+#+random+number
$
Or if you'd prefer to create a variable rather than pipe through tr:
$ echo -n "${foo%%#*}#"; bar="${foo#*#}"; echo "${bar// /+}"
My Number is = 1234; #This+is+a+#+random+number
And finally, if you don't mind subshells with pipes, you could do this:
$ bar=$(echo -n "$foo" | tr '#' '\n' | sed -ne '2,$s/ /+/g;p' | tr '\n' '#')
$ echo "$bar"
My Number is = 1234; #This+is+a+#+random+number
$
And for the fun of it, here's a short awk solution:
$ echo $foo | awk -vRS=# -vORS=# 'NR>1 {gsub(/ /,"+")} 1'
My Number is = 1234; #This+is+a+#+random+number
#$
Note the trailing ORS. I don't know if it's possible to avoid a final record separator. I suppose you could get rid of that by piping the line above through head -1, assuming you're only dealing with the one line of input data.
Not terrible efficient, but:
perl -pe '1 while (s/(.*#[^ ]*) /\1+/);'
This might work for you (GNU sed):
echo 'My Number is = 1234; #This is a random number' |
sed 's/#/\n&/;h;s/.*\n//;y/ /+/;H;g;s/\n.*\n//'
My Number is = 1234; #This+is+a+random+number
Here is yet another perl one-liner:
echo 'My Number is = 1234; #This is a random number' \
| perl -F\# -lane 'join "#", #F[1,-1]; s/ /+/g; print $F[1], "#", $_'
-F specifies how to split string into #F array.
-an wraps stdin with:
while (<>) {
#F = split('#');
# code from -e goes here
}