Excel and awk disagree about CSV totals - linux

I have a CSV file that I'm totaling up two ways: one using Excel and the other using awk. Here are the totals of my first 8 columns in Excel:
1) 2640502474.00
2) 1272849386284.00
3) 36785.00
4)
5) 107.00
6) 239259.00
7) 0.00
8) 7418570893330.00
And here's my awk output:
$ cat /home/jason/import.csv | awk -F "\"*,\"*" '{s+=$1} END {printf("%01.2f\n", s)}'
2640502474.00
$ cat /home/jason/import.csv | awk -F "\"*,\"*" '{s+=$2} END {printf("%01.2f\n", s)}'
1272849386284.00
$ cat /home/jason/import.csv | awk -F "\"*,\"*" '{s+=$8} END {printf("%01.2f\n", s)}'
7411306364347.00
Notice how 1 and 2 match exactly but 8 is off by many millions. I'm assuming Excel's total is the correct one, so why is awk handling this file differently?

You likely have a comma formatted number contained in quotes. Excel will properly handle that number as a single field. Your regex for field separation in awk won't - a comma internal to a number is a valid separator according to that regex. It is very hard (and mostly futile) to try and handle optional nested escaping like what is possible in csv with a regex.
Compare the following to see what is likely going on:
$ echo '"1","10","15","1,000","14"' | awk -F "\"*,\"*" '{print $4}'
1
$ echo '"1","10","15","1,000","14"' | awk -F "\",\"" '{print $4}'
1,000
Note that the second regex above still has a problem with a trailing " in the last field and only works at all if all field are consistently quoted - it is for illustration purposes only.

Related

combine consecutive awk calls into one

I'm trying to use awk to get a specific number out of a text file. The number can be identified by consecutively applying three rules:
Get only lines starting with the string Name(s):
In the 6th of the said lines, get the 3th element. Elements are separated by one or more spaces
take 100 minus the number found
I got it working with two piped awk calls:
cat file | awk '/^Name\(s\):/' | awk -F " " 'NR==6 {printf "%2.2f", 100 - $3; exit}'
How can I combine the two awk calls into one?
Untested as the filewas not there but:
$ awk '/^Name\(s\):/ && ++c==6 {printf "%2.2f", 100 - $3; exit}' file
You can put the AWK statements into a file -example myprogram.awk- and use it like
awk -f myprogram.awk

Get last n characters of one field and complete second field of a string in Linux

I have 2 lines in a file :
MUMBAI,918889986665,POSTPAID,CRBT123,CRBT,SYSTEM,151004,MONTHLY,160201,160302
MUMBAI,912398456781,POSTPAID,SEGP,SEGP30,SMS,151004,MONTHLY,160201,160302
I wanted to cut field 2 and 4 in above lines. Condition is: from field 2, I need only ten digits.
Desired output:
8889986665,CRBT
2398456781,SEGP30
I am trying below command :
cut -d',' -f2 test.txt | cut -c3-12 && cut -d',' -f4 test.txt
My output:
8889986665
2398456781
CRBT
SEGP30
Kindly help me to achieve desired output.
Solution 2:
Here is the solution which will serve the purpose:
cut -d',' -f2,4 1 | sed 's/.*\([0-9]\{10\}\),\(.*\)/\1,\2/'
8889986665,CRBT123
2398456781,SEGP
cut will give us the second and forth field.
Inside sed, .* to skip the initial characters until the first pattern ahead is encountered.
First pattern is 10 digits followed by a semicolon:
\([0-9]\{10\}\),
Second pattern is rest of the line: \(.*\)
Now we print both the patterns with semicolon in between: \1,\2
Note that the number 10 can replaced by number of characters to be
extracted before the delimiter , [0-9] can be replaced by . if
these characters can be any type of characters.
Solution 1:
Using cut will be easiest for you in this case.
You first need to get desired fields (2,4) filtered from the line and then do more filtering (only 10 characters from field #2)
$ cut -d',' -f2,4 test.txt | cut -c3-
8889986665,CRBT123
2398456781,SEGP
This is job best done using awk:
awk -F, -v n=10 '{print substr($2, length($2)-n+1, n) FS $5}' file
8889986665,CRBT
2398456781,SEGP30
substr command will print last n characters in 2nd column.
sed -r 's/[^,]+,..([^,]+,)([^,]+,)([^,]+),.*/\1\3/' file
8889986665,CRBT123
2398456781,SEGP
cat test.txt | cut -f 2,4 -d ","
assuming your file is test.txt

Removing last column from rows that have three columns using bash

I have a file that contains several lines of data. Some lines contain three columns, but most contain only two. All lines are single-tab separated. For those that contain three columns, the third column is typically redundant and contains the same data as the second so I'd like to remove it.
I imagine awk or cut would be appropriate, but I'm drawing a blank on how to test the row for three columns so my script will only work on those rows. I know awk is a very powerful language with logic and whatnot built into it, I'm just not that strong with it.
I looked at a similar question, but I'm not sure what is going on with the awk answer. Should the -4 be -1 since I only want to remove one column? What about if the row has two columns; will it remove the second even though I don't want to do anything?
I modified it to what I think it would be:
awk -F"\t" -v OFS="\t" '{ for (i=1;i<=NF-4;i++){ print $i }}'
But when I run it (with the file) nothing happens. If I change NF-1 or NF-2 I get some output, but it only a handful of lines and only the first column.
Can anyone clue me into what I should be doing?
If you just want to remove the third column, you could just print the first and the second:
awk -F '\t' '{print $1 "\t" $2}'
And it's similar to cut:
cut -f 1,2
The awk variable NF gives you the number for fields. So an expression like this should work for you.
awk -F, 'NF == 3 {print $1 "," $2} NF != 3 {print $0}'
Running it on an input file like so
a,b,c
x,y
u,v,w
l,m
gives me
$ cat test | awk -F, 'NF == 3 {print $1 "," $2} NF != 3 {print $0}'
a,b
x,y
u,v
l,m
This might work for you (GNU sed):
sed 's/\t[^\t]*//2g' file
Restricts the file to two columns.
awk 'NF==3{print $1"\t"$2}NF==2{print}' your_file
Testde below:
> cat temp
1 2
3 4 5
6 7
8 9 10
>
> awk 'NF==3{print $1"\t"$2}NF==2{print}' temp
1 2
3 4
6 7
8 9
>
or in a much more simplere way in awk:
awk 'NF==3{print $1"\t"$2}NF==2' your_file
Or you can also go with perl:
perl -lane 'print "$F[0]\t$F[1]"' your_file

unix - count of columns in file

Given a file with data like this (i.e. stores.dat file)
sid|storeNo|latitude|longitude
2|1|-28.03720000|153.42921670
9|2|-33.85090000|151.03274200
What would be a command to output the number of column names?
i.e. In the example above it would be 4. (number of pipe characters + 1 in the first line)
I was thinking something like:
awk '{ FS = "|" } ; { print NF}' stores.dat
but it returns all lines instead of just the first and for the first line it returns 1 instead of 4
awk -F'|' '{print NF; exit}' stores.dat
Just quit right after the first line.
This is a workaround (for me: I don't use awk very often):
Display the first row of the file containing the data, replace all pipes with newlines and then count the lines:
$ head -1 stores.dat | tr '|' '\n' | wc -l
Unless you're using spaces in there, you should be able to use | wc -w on the first line.
wc is "Word Count", which simply counts the words in the input file. If you send only one line, it'll tell you the amount of columns.
You could try
cat FILE | awk '{print NF}'
Perl solution similar to Mat's awk solution:
perl -F'\|' -lane 'print $#F+1; exit' stores.dat
I've tested this on a file with 1000000 columns.
If the field separator is whitespace (one or more spaces or tabs) instead of a pipe:
perl -lane 'print $#F+1; exit' stores.dat
If you have python installed you could try:
python -c 'import sys;f=open(sys.argv[1]);print len(f.readline().split("|"))' \
stores.dat
This is usually what I use for counting the number of fields:
head -n 1 file.name | awk -F'|' '{print NF; exit}'
select any row in the file (in the example below, it's the 2nd row) and count the number of columns, where the delimiter is a space:
sed -n 2p text_file.dat | tr ' ' '\n' | wc -l
Proper pure bash way
Simply counting columns in file
Under bash, you could simply:
IFS=\| read -ra headline <stores.dat
echo ${#headline[#]}
4
A lot quicker as without forks, and reusable as $headline hold the full head line. You could, for sample:
printf " - %s\n" "${headline[#]}"
- sid
- storeNo
- latitude
- longitude
Nota This syntax will drive correctly spaces and others characters in column names.
Alternative: strong binary checking for max columns on each rows
What if some row do contain some extra columns?
This command will search for bigger line, counting separators:
tr -dc $'\n|' <stores.dat |wc -L
3
If there are max 3 separators, then there are 4 fields... Or if you consider:
each separator (|) is prepended by a Before and followed by an After, trimed to 1 letter by word:
tr -dc $'\n|' <stores.dat|sed 's/./b&a/g;s/ab/a/g;s/[^ab]//g'|wc -L
4
Counting columns in a CSV file
Under bash, you may use csv loadable plugins:
enable -f /usr/lib/bash/csv csv
IFS= read -r line <file.csv
csv -a fields <<<"$line"
echo ${#fields[#]}
4
For more infos, see How to parse a CSV file in Bash?.
Based on Cat Kerr response.
This command is working on solaris
awk '{print NF; exit}' stores.dat
you may try:
head -1 stores.dat | grep -o \| | wc -l

How to cut first n and last n columns?

How can I cut off the first n and the last n columns from a tab delimited file?
I tried this to cut first n column. But I have no idea to combine first and last n column
cut -f 1-10 -d "<CTR>v <TAB>" filename
Cut can take several ranges in -f:
Columns up to 4 and from 7 onwards:
cut -f -4,7-
or for fields 1,2,5,6 and from 10 onwards:
cut -f 1,2,5,6,10-
etc
The first part of your question is easy. As already pointed out, cut accepts omission of either the starting or the ending index of a column range, interpreting this as meaning either “from the start to column n (inclusive)” or “from column n (inclusive) to the end,” respectively:
$ printf 'this:is:a:test' | cut -d: -f-2
this:is
$ printf 'this:is:a:test' | cut -d: -f3-
a:test
It also supports combining ranges. If you want, e.g., the first 3 and the last 2 columns in a row of 7 columns:
$ printf 'foo:bar:baz:qux:quz:quux:quuz' | cut -d: -f-3,6-
foo:bar:baz:quux:quuz
However, the second part of your question can be a bit trickier depending on what kind of input you’re expecting. If by “last n columns” you mean “last n columns (regardless of their indices in the overall row)” (i.e. because you don’t necessarily know how many columns you’re going to find in advance) then sadly this is not possible to accomplish using cut alone. In order to effectively use cut to pull out “the last n columns” in each line, the total number of columns present in each line must be known beforehand, and each line must be consistent in the number of columns it contains.
If you do not know how many “columns” may be present in each line (e.g. because you’re working with input that is not strictly tabular), then you’ll have to use something like awk instead. E.g., to use awk to pull out the last 2 “columns” (awk calls them fields, the number of which can vary per line) from each line of input:
$ printf '/a\n/a/b\n/a/b/c\n/a/b/c/d\n' | awk -F/ '{print $(NF-1) FS $(NF)}'
/a
a/b
b/c
c/d
You can cut using following ,
-d: delimiter ,-f for fields
\t used for tab separated fields
cut -d$'\t' -f 1-3,7-
To use AWK to cut off the first and last fields:
awk '{$1 = ""; $NF = ""; print}' inputfile
Unfortunately, that leaves the field separators, so
aaa bbb ccc
becomes
[space]bbb[space]
To do this using kurumi's answer which won't leave extra spaces, but in a way that's specific to your requirements:
awk '{delim = ""; for (i=2;i<=NF-1;i++) {printf delim "%s", $i; delim = OFS}; printf "\n"}' inputfile
This also fixes a couple of problems in that answer.
To generalize that:
awk -v skipstart=1 -v skipend=1 '{delim = ""; for (i=skipstart+1;i<=NF-skipend;i++) {printf delim "%s", $i; delim = OFS}; printf "\n"}' inputfile
Then you can change the number of fields to skip at the beginning or end by changing the variable assignments at the beginning of the command.
You can use Bash for that:
while read -a cols; do echo ${cols[#]:0:1} ${cols[#]:1,-1}; done < file.txt
you can use awk, for example, cut off 1st,2nd and last 3 columns
awk '{for(i=3;i<=NF-3;i++} print $i}' file
if you have a programing language such as Ruby (1.9+)
$ ruby -F"\t" -ane 'print $F[2..-3].join("\t")' file
Try the following:
echo a#b#c | awk -F"#" '{$1 = ""; $NF = ""; print}' OFS=""
Use
cut -b COLUMN_N_BEGINS-COLUMN_N_UNTIL INPUT.TXT > OUTPUT.TXT
-f doesn't work if you have "tabs" in the text file.

Resources