LINUX: Using cat to remove columns in CSV - some have commas in the data - linux

I need to remove some columns from a CSV. Easy.
The problem is I have two columns with full text that actually has commas in them as a part of the data. My cols are enclosed with quotes and the cat is counting the commas in the text as columns. How can I do this so the commas enclosed with quotes are ignored?
example:
"first", "last", "dob", "some long sentence, it has commas in it,", "some data", "foo"
i want to print only rows 1-4, 6

You will save yourself a lot of aggravation by writing a short Perl script that uses Parse::CSV http://metacpan.org/pod/Parse::CSV
I am sure there is a Python way of doing this too.

cat file | sed -e 's|^"||;s|"$||' | awk 'BEGIN {FS="[\"], ?[\"]"}{print $2}'
Example:
http://ideone.com/g2gZmx
How it works:
Look at line:
"a,b","c,d","e,f"
We know that each row is surrounded by "". So we can split this line by ",":
cat file | awk 'BEGIN {FS="[\"], ?[\"]"}{print $2}'
and rows will be:
"a,b c,d e,f"
But we have annoying " in the start and end of line. So we remove it with sed:
cat file | sed -e 's|^"||;s|"$||' | awk 'BEGIN {FS="[\"], ?[\"]"}{print $2}'
And rows will be
a,b c,d e,f
Then we can simply take second row by awk '{print $2}.
Read about regexp field splitting in awk: http://www.gnu.org/software/gawk/manual/html_node/Regexp-Field-Splitting.html

Related

How to merge column output to the end of a row in the previous column?

I have a .csv file containing three columns and I need to merge the value of column 2 with the end of the row of column 1.
The .csv file contains thousands of rows and this needs to be done for each row.
Iv'e tried using awk but I'm finding it difficult to get the code correct
cat file.csv | awk '{print $1, $2}'
awk '{if ($2!= " ") {print $1+$2 }}'
These of course don't work
Sample input:
The command used to produce the actual output is simply:
cat test.csv
[2,4,5,6,2,34,61,32,34,54,34, 22] 0.144354
[3,4,6,4,5,6,7,1,2,3,4,53,23, 34] 0.332453
[2,43,6,2,1,2,5,8,9,0,8,6,34, 21] 0.347643
Desired Output:
col1 col2
[2,4,5,6,2,34,61,32,34,54,34,22] 0.144354
[3,4,6,4,5,6,7,1,2,3,4,53,23,34] 0.332453
[2,43,6,2,1,2,5,8,9,0,8,6,34,21] 0.347643
Replace "comma followed by one or more spaces" with "comma":
sed 's/, \{1,\}/,/' file.csv
sed 's/, */,/g' file.csv
Print columns $1 and $2 as $1 (optionally separate with a tab):
awk '{print $1 $2, $3}' OFS='\t' file.csv
You can try:
awk '{printf("%s%s\t%s\n",$1,$2,$3)}' file.cvs
I only see spaces after a comma when you don't want them.
$: sed -E 's/,\s+/,/' file.csv
[2,4,5,6,2,34,61,32,34,54,34,22] 0.144354
[3,4,6,4,5,6,7,1,2,3,4,53,23,34] 0.332453
[2,43,6,2,1,2,5,8,9,0,8,6,34,21] 0.347643
Add -i (after the -E) to make it an in-place edit.
$: sed -Ei 's/,\s+/,/' file.csv
$: cat file.csv
[2,4,5,6,2,34,61,32,34,54,34,22] 0.144354
[3,4,6,4,5,6,7,1,2,3,4,53,23,34] 0.332453
[2,43,6,2,1,2,5,8,9,0,8,6,34,21] 0.347643

How to read a .csv file with shell command? [duplicate]

This question already has answers here:
Bash: Parse CSV with quotes, commas and newlines
(10 answers)
Closed 2 years ago.
I have a .csv file which I need to extract values from. It is formatted like this :
First line of the file (no data)
1;Jack;Daniels;Madrid;484016;
2;Alice;Morgan;London;564127;
etc...
I would need a shell command that read all lines of a specific column within a .csv, compare each with a string and return a value whenever it finds a matching line. In Java i would define it something like :
> boolean findMatchInCSV(String valueToFind, int colNumber, String
> colSeparator)
The separator between columns may indeed change that is why I would like a something quite generic if possible :)
But I need it as a shell command, is that possible ?
Thanks
I would need a shell command that read all lines
cat 1.csv # read the file
of a specific column within a .csv
cat 1.csv | cut -f5 -d';' # keep only the field #5 (use ';' as separator)
compare each with a string
# keep only the row where the value of the field is exactly 'foo'
cat 1.csv | cut -f5 -d';' | grep '^foo$'
return a value whenever it finds a matching line.
This last one request is unclear.
The code above displays the searched string (foo) once for each row where it is the value of column #5 (start counting from 1). The columns are separated by ;.
Unfortunately, it doesn't handle quoted strings. If the value in any field contains the separator (;), the CSV format allows enclosing the field value into double quotes (") to prevent the separator character be interpreted as a separator (forcing its literal value).
I assume you're looking for something like
FILE=data.csv
VALUE="$1"
COLNUM=$2
IFS="$3"
while read -r -a myArray
do
if "$myArray[$COLNUM]"=="$VALUE"; then
exit 0
fi
done < tail -n +2 $FILE
exit 1
grep "my_string" file |awk -F ";" '{print $5}'
or
awk -F ";" '/my_string/ {print $5}' file
For 2nd column:
awk -F ";" '$2 ~ /my_string/ {print $5}' file
For exact matching:
awk -F ";" '$2 == "my_string" {print $5}' file

How to cut column data from flat file

I've data in format below;
111,Ja,M,Oes,2012-08-03 16:42:00,x,xz
112,Ln,d,D,Gn,2012-08-03 16:51:00,y,yx
I need to create files with data in the sequence below:
111,x,xz
112,y,yz
In output format, we've first value before comma and last two comma prefix values. Here we can have any number of commas in-between.
Kindly advise, how can generate required output file from input file in Linux machine.
The Awk statement for this is pretty straight-forward. Set the input and output field separators and print the fields using $1..$NF, where $NF is the value of the last column,
awk 'BEGIN{FS=OFS=","}{print $1,$(NF-1),$NF}' input.csv > newfile.csv
Not much to this one in awk:
awk -F"," 'BEGIN{OFS=","}{print $1,$(NF-1), $NF}' inFile > outFile
We split the lines in awk with a comma -F"," and then print the first field $1, the second to last field $(NF-1), and the last field $NF.
NF is the "Number of fields" so subtracting 1 from it will give you the second to last item.
with sed
$ sed -r 's/([^,]+).*(,[^,]+,[^,]+)/\1\2/' file
111,x,xz
112,y,yx
or
$ sed -r 's/([^,]+).*((,[^,]+){2})/\1\2/' file
awk '{print substr($1,1,4) substr($2,10,4)}' file
111,x,xz
112,y,yx

unix - count of columns in file

Given a file with data like this (i.e. stores.dat file)
sid|storeNo|latitude|longitude
2|1|-28.03720000|153.42921670
9|2|-33.85090000|151.03274200
What would be a command to output the number of column names?
i.e. In the example above it would be 4. (number of pipe characters + 1 in the first line)
I was thinking something like:
awk '{ FS = "|" } ; { print NF}' stores.dat
but it returns all lines instead of just the first and for the first line it returns 1 instead of 4
awk -F'|' '{print NF; exit}' stores.dat
Just quit right after the first line.
This is a workaround (for me: I don't use awk very often):
Display the first row of the file containing the data, replace all pipes with newlines and then count the lines:
$ head -1 stores.dat | tr '|' '\n' | wc -l
Unless you're using spaces in there, you should be able to use | wc -w on the first line.
wc is "Word Count", which simply counts the words in the input file. If you send only one line, it'll tell you the amount of columns.
You could try
cat FILE | awk '{print NF}'
Perl solution similar to Mat's awk solution:
perl -F'\|' -lane 'print $#F+1; exit' stores.dat
I've tested this on a file with 1000000 columns.
If the field separator is whitespace (one or more spaces or tabs) instead of a pipe:
perl -lane 'print $#F+1; exit' stores.dat
If you have python installed you could try:
python -c 'import sys;f=open(sys.argv[1]);print len(f.readline().split("|"))' \
stores.dat
This is usually what I use for counting the number of fields:
head -n 1 file.name | awk -F'|' '{print NF; exit}'
select any row in the file (in the example below, it's the 2nd row) and count the number of columns, where the delimiter is a space:
sed -n 2p text_file.dat | tr ' ' '\n' | wc -l
Proper pure bash way
Simply counting columns in file
Under bash, you could simply:
IFS=\| read -ra headline <stores.dat
echo ${#headline[#]}
4
A lot quicker as without forks, and reusable as $headline hold the full head line. You could, for sample:
printf " - %s\n" "${headline[#]}"
- sid
- storeNo
- latitude
- longitude
Nota This syntax will drive correctly spaces and others characters in column names.
Alternative: strong binary checking for max columns on each rows
What if some row do contain some extra columns?
This command will search for bigger line, counting separators:
tr -dc $'\n|' <stores.dat |wc -L
3
If there are max 3 separators, then there are 4 fields... Or if you consider:
each separator (|) is prepended by a Before and followed by an After, trimed to 1 letter by word:
tr -dc $'\n|' <stores.dat|sed 's/./b&a/g;s/ab/a/g;s/[^ab]//g'|wc -L
4
Counting columns in a CSV file
Under bash, you may use csv loadable plugins:
enable -f /usr/lib/bash/csv csv
IFS= read -r line <file.csv
csv -a fields <<<"$line"
echo ${#fields[#]}
4
For more infos, see How to parse a CSV file in Bash?.
Based on Cat Kerr response.
This command is working on solaris
awk '{print NF; exit}' stores.dat
you may try:
head -1 stores.dat | grep -o \| | wc -l

output the 2nd column of a file

given a file with two columns, separatedly by standard white space
a b
c d
f g
h
how do I output the second column
cut -d' ' -f2
awk '{print $2}'
Because the last line of your example data has no first column you'll have to parse it as fixed width columns:
awk 'BEGIN {FIELDWIDTHS = "2 1"} {print $2}'
Use cut with byte offsets:
cut -b 3
Use sed to remove trailing columns:
sed s/..//
cut -c2 listdir
Here you can see for visualization:

Resources