Remove last column from a csv file in Perl

Remove last column from a csv file in Perl - linux

I have a Perl script that accepts a comma separated csv file as input.
I would like to discard the last column (the column number is known in advance).
The problem is that the last column may contain quoted strings with commas, in which case I would like to cut the entire string.
Example:
colA,colB,colC
1,2,3
4,5,"6,6"
What I would like to end up with is:
colA,colB
1,2
4,5
The current solution I have is using Linux cut command in the following manner:
cat $file | cut -d ',' -f 3 --complement
Which outputs the following:
colA,colB
1,2
4,5,6"
Which works great unless the last column is a quoted string with commas in it.
I can only use native Perl/Linux commands to solve this.
Appreciate your help

Using Text::CSV, as a script to process STDIN into STDOUT:
use strict;
use warnings;
use Text::CSV 'csv';
my $csv = csv(in => \*STDIN, keep_headers => \my #headers,
auto_diag => 2, encoding => 'UTF-8');
pop #headers;
csv(in => $csv, out => \*STDOUT, headers => \#headers,
auto_diag => 2, encoding => 'UTF-8');
The obvious benefit of this approach is handling all common edge cases automatically.

Try this based on awk-regex:
awk -v FPAT='([^,]+)|(\"[^\"]+\")' -v OFS=',' '{print $1,$2}' ${file}
Example
echo '"4,4",5,"6,6"' | awk -v FPAT='([^,]+)|(\"[^\"]+\")' -v OFS=',' '{print $1,$2}'
"4,4",5
Reference

If quoted strings with comma is the only trouble you are facing, you can use this:
$ sed -E 's/,"[^"]*"$|,[^,]*$//' ip.txt
colA,colB
1,2
4,5
,"[^"]*"$ will match , followed by " followed by non " characters followed by " at the end of line
,[^,]*$ will match , followed by non , characters at end of line
The double quoted column will match earlier in the string and thus gets deleted completely
Equivalent for perl would be perl -lpe 's/,"[^"]*"$|,[^,]*$//' ip.txt

I believe sungtm answer is correct and requries some explanation:
awk -v FPAT='([^,]+)|(\"[^\"]+\")' -v OFS=',' '{print $1,$2}'
Is equivalent to:
script.awk
BEGIN {
FPAT = "([^,]+)|(\"[^\"]+\")"; # gnu awk specific: FPAT is RegEx pattern to identify the field's content
# [^,]+ ------ RegEx pattern to match all chars not ","
#"[^\"]+\" ------ RegEx pattern to match all quated chars including the quotes
#()|() ------ RegEx optional groups selector
OFS = ","; # Output field separator
}
{ # for each input line/record
print $1, $2; # print "1st field" OFS value "2nd field"
}
Runnig
awk -f scirpt.awk input.txt

Save the script in any file say script.pl
Execute as prompt>perl script.pl /opt/filename.csv
"1","2,3",4,"test, test" ==> "1","2,3",4
1,"2,3,4","5 , 6","7,8" ==> 1,"2,3,4","5 , 6"
0,0,0,"test" ==> 0,0,0
Handles above cases
use strict;
if (scalar(#ARGV) != 1 ) {
print "usage: perl script.pl absolute_file_path";
exit;
}
my $filename = $ARGV[0]; # complete file path here
open(DATA, '<', $filename)
or die "Could not open file '$filename' $!";
my #lines = <DATA>;
close(DATA);
my $counter=0;
open my $fo, '>', $filename;
foreach my $line(#lines) {
chomp($line);
my #update = split '(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)' , $line;
my #update2;
foreach (#update) {
if($_=~/\w+/) {
push(#update2,$_);
}
}
pop(#update2);
print #update2;
my $str = join(',',#update2);
print $fo "$str";
unless (++$counter == scalar(#lines)) {
print $fo "\n";
}
}
close $fo;

Well this case is quite interesting - please see my solution bellow.
You can change $debug = 1; to see what happens and how this mechanism works
use strict;
use warnings;
my $debug = 0;
while( <DATA> ) {
print "IN: $_" if $debug;
chomp;
s/"(.+?)"/replace($1)/ge; # do magic replacement , -> ___ in block of interest
print "REP: $_\n" if $debug;
my #data = split /,/; # split into array
pop #data; # pop last element of array
my $line = join ',', #data; # merge array into a string
$line =~ s/___/,/g; # do unmagic replacement
$line =~ s/\|/"/g; # restore | -> "
printf "%s$line\n", $debug ? "OUT: " : ''; # print result
}
sub replace {
my $line = shift;
$line =~ s/,/___/g; # do magic replacement in our block
return "|$line|"; # put | arount block of interest
}
__DATA__
colA,colB,colC
1,2,3
4,5,"6,6"
8,3,"1,2",37,82
64,12,"1,2,3,4",42,56
"3,4,7,8",2,8,"8,7,6,5,4",2,8
"3,4,7,8",2,8,"8,7,6,5,4",2,8,"2,8,4,1"
"3,4,7,8",2,8,"8,7,6,5,4",2,8,"2,8,4,1",3,4

Appreciate your help. Below is the solution I ended up using:
cat file.csv | perl -MText::ParseWords -nle '#f = parse_line(",",2, $_); tr/,/$/d for #f; print join ",", #f' | cut -d ',' -f 3 --complement | tr $ , ;
This will replace commas in field surrounded by quotes to the $ sign, to re replaced back after discarding the last unwanted column.

Related

Sorting a file using fields with specific value

Recently, I had to sort several files according to records' ID; the catch was that there can be several types of records, and in each of those the field I had to use for sorting is on a different position. The fields, however, are easily identifiable thanks to key=value structure. To show a simple sample of the general structure:
fieldA=valueA|fieldB=valueB|recordType=A|id=2|fieldC=valueC
fieldD=valueD|recordType=B|id=1|fieldE=valueE
fieldF=valueF|fieldG=valueG|fieldH=valueH|recordType=C|id=3
I came up with a pipeline as follows, which did the job:
awk -F'[|=]' '{for(i=1; i<=NF; i++) {if($i ~ "id") {i++; print $i"?"$0} }}' tester.txt | sort -n | awk -F'?' '{print $2}'
In other words the algorithm is as follows:
Split the record by both field and key-value separators (| and =)
Iterate through the elements and search for the id key
Print the next element (value of id key), a separator, and the whole line
Sort numerically
Remove prepended identifier to preserve records' structure
Processing the sample gives the output:
fieldD=valueD|recordType=B|id=1|fieldE=valueE
fieldA=valueA|fieldB=valueB|recordType=A|id=2|fieldC=valueC
fieldF=valueF|fieldG=valueG|fieldH=valueH|recordType=C|id=3
Is there a way, though, to do this task using single awk command?

You may try this gnu-awk code to to this in a single command:
awk -F'|' '{
for(i=1; i<=NF; ++i)
if ($i ~ /^id=/) {
a[gensub(/^id=/, "", 1, $i)] = $0
break
}
}
END {
PROCINFO["sorted_in"] = "#ind_num_asc"
for (i in a)
print a[i]
}' file
fieldD=valueD|recordType=B|id=1|fieldE=valueE
fieldA=valueA|fieldB=valueB|recordType=A|id=2|fieldC=valueC
fieldF=valueF|fieldG=valueG|fieldH=valueH|recordType=C|id=3
We are using | as field delimiter and when there is a column name starting with id= we store it in array a with index as text after = and value as the full record.
Using PROCINFO["sorted_in"] = "#ind_num_asc" we sort array a using numerical value of index and then in for loop we print value part to get the sorted output.

Using GNU awk for the 3rd arg to match() and sorted_in:
$ cat tst.awk
match($0,/(^|\|)id=([0-9]+)/,a) {
ids2vals[a[2]] = $0
}
END {
PROCINFO["sorted_in"] = "#ind_num_asc"
for ( id in ids2vals ) {
print ids2vals[id]
}
}
$ awk -f tst.awk file
fieldD=valueD|recordType=B|id=1|fieldE=valueE
fieldA=valueA|fieldB=valueB|recordType=A|id=2|fieldC=valueC
fieldF=valueF|fieldG=valueG|fieldH=valueH|recordType=C|id=3

Try Perl: perl -e 'print map { s/^.*? //; $_ } sort { $a <=> $b } map { ($id) = /id=(\d+)/; "$id $_" } <>' file
Some explanation of the code I use:
print #print the resulting list of lines
map {
s/^.*? //;
$_
} #remove numeric id from start of line
sort { $a <=> $b } #sort numerically
map {
($id) = /id=(\d+)/;
"$id $_"
} # capture id and place it in start of line
<> # read all lines from file
Or try sed and sort: sed 's/^\(.*id=\([0-9][0-9]*\).*\)$/\2 \1/' file | sort -n | sed 's/^[^ ][^ ]* //'

With your shown samples only, please try following(awk + sort + cut) solution, written and tested in GNU awk, should work in any awk.
awk '
match($0,/id=[0-9]+/){
print substr($0,RSTART,RLENGTH)";"$0
}
' Input_file | sort -t'=' -k2n | cut -d';' -f2-
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
match($0,/id=[0-9]+/){ ##Using awk match function to match id= followed by digits.
print substr($0,RSTART,RLENGTH)";"$0 ##printing sub string of matched value followed by current line along with semi-colon in it.
}
' Input_file | ##Mentioning Input_file here and passing awk output as a standard input to next command.
sort -t'=' -k2n | ##Sorting output with delimiter of = and by 2nd field then passing output to next command as an input.
cut -d';' -f2- ##Using cut command making delimiter as ; and printing everything from 2nd field onwards.

How to cut and save everything before first match in a string?

I have some data in this format saved in a string:
data = some-data.in.this.format
How can I perform a cut on $data so that I am only left with some-data?

You can use awk with -F (field-separator)
$ data=some-data.in.this.format
$ echo ${data} | awk -F "." '{print $1}'
some-data

Using a custom delimiter, in this case a dot . could do the trick. Try:
echo "data = some-data.in.this.format" | cut -d\. -f1
However that would return literally everything before first match in a string, which is literally what you asked for, so you will have:
data = some-data
Thus, if you want only to get some-data, I would be using:
echo "data = some-data.in.this.format" | cut -d\. -f1 |awk '{ print $3 }'

If data = is always going to be the first thing on the line, you can do a substring parameter expansion:
offset=7
str="data = some-data.in.this.format"
value=${str:${offset}}
echo "$value"
Output:
some-data.in.this.format
If both the key and value are unknown, you can split on the first = in the string:
str="data = some-data.in.this.format"
key="${str%% =*}"
value="${str#*= }"
echo "$key"
echo "$value"
Output
data
some-data.in.this.format
See:
${parameter:offset}
${parameter#word}
${parameter%%word}
at Shell Parameter Expansion

Find field length from one file and extract the same length of data from another fixed length file and store the field and data in new file

I have one file file1.dml and another fixed length data file file2.dat.
data in file1.dml be like
start
integer(16) field1 ;
string(1) filed2 ;
string(80) filed3 ;
decimal(16.2) field4;
string(1) newline = "\n";
end;
data in file2.dat be like
12345678 ABBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 1234567890
I need the output file like below
field1="12345678 "
filed2="A"
filed3="BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB "
field4="1234567890 "
newline="\n"
I have written below function which accepts file1.dml and file2.dat and generate the exact result, but i want to simplify this using AWK, thanks in advance for any help
function myfunc1
{
if [ $1 == "" -a $2 == "" -a ! -f $1 -a ! -f $2 ]; then
print "Input files not present"
else
dml_file=$1 #input parameter, dml file
cntl_file=$2 #input parametr, dat file
start_pos=1
end_pos=0
cat "$dml_file" | sed '1d' | sed '$d' | while IFS= read line
do
counter=`echo $line | cut -d'(' -f2 | cut -d')' -f1`
fld_name=`echo $line | cut -d'(' -f2 | cut -d')' -f2 | sed 's/;//g'`
#check decimal or not
if [[ $counter == +([0-9]) ]]; then
end_pos=$((counter+start_pos))
else
counter1=`echo $counter | cut -d'.' -f1`
counter=$counter1
end_pos=$((counter1+end_pos))
fi
newline_check=`echo $fld_name | grep -i 'newline' | wc -l`
if [ $newline_check -gt 0 ]; then
fld_name="newline"
fld_val="\n"
#write below line in one file
echo "$fld_name : \"$fld_val\""
else
fld_val=`cat $cntl_file | cut -c$start_pos-$end_pos`
#write below line in one file
echo "$fld_name : \"$fld_val\""
fi
start_pos=$((start_pos+counter))
done
fi
}

Since file2.dat only has a single line I'd start by reading this into a variable so that we don't have to continually scan the file; eg:
$ IFS= read -r rawdata < file2.dat # 'IFS=' needed in order to retain trailing white space
$ echo ".${rawdata}." # periods included to show that trailing white space retained
.12345678 ABBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 1234567890 .
At this point we can pass the rawdata variable to an awk solution:
awk -v rd="${rawdata}" -F'[();=]' '
BEGIN { s = 1 ; nl = "\\n" }
/start|end/ { next }
/newline/ { gsub(/ /,"",$4)
nl = $4
next
}
{ split($2,a,".")
len = a[1]
gsub(/ /,"",$3)
fname = $3
printf "%s=\"%s\"\n", fname, substr(rd,s,len)
s += len
}
END { printf "newline=%s\n", nl }
' file1.dat
Where:
-v rd="${rawdata}" - define awk variable rd as containing the current value of ${rawdata}
-F '[();=]' - define 4 different input/field delimiters ((, ), ;, =); $2=field length, $3=field name, $4=newline character(s)
BEGIN ( s = 1 ; nl= "\\n" } - initialize our start position in rd and a default newline character (\n)
/start|end/ { next } - ignore lines that contain start or end
/newline/ .... - if we see the pattern newline then remove spaces and set nl to this new value ($4)
next - stop processing for current line and go to next line of input
NOTE: for the rest of the lines in our input file:
split($2,a,".") / len = a[1] - split the 'length' field ($2) based on a period delimiter into array a, then set len to the first element of array a
gsub(/ /,"",$3) / fname = $3 - remove white space from the field name ($3) and assign resulting value to local variable fname
printf ... - output our line of data; use s and len to pull the desired substring from rd
s += len - add current len to s to get a new start position for next pass through the logic
END ... - when all input processing is done, print our newline record to stdout
Running the above generates the following:
field1="12345678 "
filed2="A"
filed3="BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB "
field4="1234567890 "
newline="\n"

Take the value of the NAME field and put it in a variable,

Through a request made to the command line, I get a response in csv format. This is the file:
"STACK";"STACK_ID";"NAME";"DESCRIPTION";"CREATION_TIME";"DELETION_TIME";"STATUS"
"STACK";"arn:aws:cloudformation:us-west-1:222222222222:stack/LiveStream2/00000000-6000-00e2-acd3-333333333333";"LiveStream2";"(nil)";"2013-01-23T16:01:36Z";"2013-01-23T16:22:57Z";"DELETE_COMPLETE"
"STACK";"arn:aws:cloudformation:us-west-1:444444444444:stack/LiveStream/00000000-6000-00e2-acd3-222222222222";"LiveStream";"(nil)";"2013-01-23T13:53:13Z";"2013-01-23T15:20:29Z";"DELETE_COMPLETE"
With a script I would like to take the value of the NAME field and put it in a variable,
Can you help me?

If you don't know that NAME is the 3rd field, you have to hunt for it in the header line:
awk -F \; '
NR==1 {
for (i=1; i<=NF; i++) {
if ($i == "\"NAME\"") {
name_field = i
break
}
}
next
}
{ print $name_field }
' < filename
this outputs
"LiveStream2"
"LiveStream"

Well the cut command should work well for you:
For example if your file is test.txt,
cat test.txt| cut -d';' -f3 will get the field 3.
If you want to store the value line by line then use this:
for i in `cat test.txt`
do
MYVAR=$(echo $i| cut -d';' -f3)
//do something with the variable $MYVAR
done
NOTE: this is just one approach to your question. There are several ways to achieve what you've asked.

This is the example code.
#!/bin/sh
OLDIFS=$IFS
IFS=;
while read f1 f2 f3
do
echo "Name variable is :$f3"
done<filename.csv
IFS=$OLDIFS
Third field is "NAME" which is assign in variable in f3. Here varable is displayed two times.

Awk would be my best go at it.
declare NAMES=( $( cmd_that_generates_output | awk -F ';' 'NR==1{next}; { if( $3 ~ /^[^"].*[^"]$/){$3="\""$3"\""}; printf $3 " " }' ) )
So bash executes the comand subsititution ie everything inside the $( ) the cmd generates your csv and pipes it to awk.
Awk checks to see if the input is the first line ( NR==1 ) and moves on if it is ( {next} )
For every other line it checks that it is properly encased in quotes
if( $3 ~ /^[^"].*[^"]$/){$3="\""$3"\""};
then prints out the 3rd column seperated by the ';' ( ie the naem column ) followed by a space.
printf $3 " "
This has then expanded the $() to a space seperated list of names eg
declare NAMES=( name1 name2 )
Then the declare is executed which creates an array called NAMES with each name as a single element. so:
x.txt is the output you gave in your question.
pete.mccabe#jackfrog$ cat x.txt
"STACK";"STACK_ID";"NAME";"DESCRIPTION";"CREATION_TIME";"DELETION_TIME";"STATUS"
"STACK";"arn:aws:cloudformation:us-west-1:222222222222:stack/LiveStream2/00000000-6000-00e2-acd3-333333333333";"LiveStream2";"(nil)";"2013-01-23T16:01:36Z";"2013-01-23T16:22:57Z";"DELETE_COMPLETE"
"STACK";"arn:aws:cloudformation:us-west-1:444444444444:stack/LiveStream/00000000-6000-00e2-acd3-222222222222";Live Stream;"(nil)";"2013-01-23T13:53:13Z";"2013-01-23T15:20:29Z";"DELETE_COMPLETE"
pete.mccabe#jackfrog$ declare NAMES=( $( cat x.txt | awk -F ';' 'NR==1{next}; { if( $3 ~ /^[^"].*[^"]$/){$3="\""$3"\""}; printf $3 " " }' ) )
pete.mccabe#jackfrog$ echo ${NAMES[#]}
"LiveStream2" "Live Stream"

Linux delete spaces after a character in a line

In Linux, if I have a file with entries like:
My Number is = 1234; #This is a random number
Can I use sed or anything else to replace all spaces after '#' with '+', so that the output looks like:
My Number is = 1234; #This+is+a+random+number

One way using awk:
awk -F# 'OFS=FS { gsub(" ", "+", $2) }1' file.txt
Result:
My Number is = 1234; #This+is+a+random+number
EDIT:
After reading comments below, if your file contains multiple #, you can try this:
awk -F# 'OFS=FS { for (i=2; i <= NF; i++) gsub(" ", "+", $i); print }' file.txt

You can do this in pure shell...
$ foo="My Number is = 1234; #This is a random number"
$ echo -n "${foo%%#*}#"; echo "${foo#*#}" | tr ' ' '+'
My Number is = 1234; #This+is+a+random+number
$
Capturing this data to variables for further use is left as an exercise for the reader. :-)
Note that this also withstands multiple # characters on the line:
$ foo="My Number is = 1234; #This is a # random number"
$ echo -n "${foo%%#*}#"; echo "${foo#*#}" | tr ' ' '+'
My Number is = 1234; #This+is+a+#+random+number
$
Or if you'd prefer to create a variable rather than pipe through tr:
$ echo -n "${foo%%#*}#"; bar="${foo#*#}"; echo "${bar// /+}"
My Number is = 1234; #This+is+a+#+random+number
And finally, if you don't mind subshells with pipes, you could do this:
$ bar=$(echo -n "$foo" | tr '#' '\n' | sed -ne '2,$s/ /+/g;p' | tr '\n' '#')
$ echo "$bar"
My Number is = 1234; #This+is+a+#+random+number
$
And for the fun of it, here's a short awk solution:
$ echo $foo | awk -vRS=# -vORS=# 'NR>1 {gsub(/ /,"+")} 1'
My Number is = 1234; #This+is+a+#+random+number
#$
Note the trailing ORS. I don't know if it's possible to avoid a final record separator. I suppose you could get rid of that by piping the line above through head -1, assuming you're only dealing with the one line of input data.

Not terrible efficient, but:
perl -pe '1 while (s/(.*#[^ ]*) /\1+/);'

This might work for you (GNU sed):
echo 'My Number is = 1234; #This is a random number' |
sed 's/#/\n&/;h;s/.*\n//;y/ /+/;H;g;s/\n.*\n//'
My Number is = 1234; #This+is+a+random+number

Here is yet another perl one-liner:
echo 'My Number is = 1234; #This is a random number' \
| perl -F\# -lane 'join "#", #F[1,-1]; s/ /+/g; print $F[1], "#", $_'
-F specifies how to split string into #F array.
-an wraps stdin with:
while (<>) {
#F = split('#');
# code from -e goes here
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Remove last column from a csv file in Perl - linux

Try this based on awk-regex: awk -v FPAT='([^,]+)|(\"[^\"]+\")' -v OFS=',' '{print $1,$2}' ${file} Example echo '"4,4",5,"6,6"' | awk -v FPAT='([^,]+)|(\"[^\"]+\")' -v OFS=',' '{print $1,$2}' "4,4",5 Reference

Related

Sorting a file using fields with specific value

How to cut and save everything before first match in a string?

Find field length from one file and extract the same length of data from another fixed length file and store the field and data in new file

Take the value of the NAME field and put it in a variable,

Linux delete spaces after a character in a line

Categories

Resources