How to cut column data from flat file - linux

I've data in format below;
111,Ja,M,Oes,2012-08-03 16:42:00,x,xz
112,Ln,d,D,Gn,2012-08-03 16:51:00,y,yx
I need to create files with data in the sequence below:
111,x,xz
112,y,yz
In output format, we've first value before comma and last two comma prefix values. Here we can have any number of commas in-between.
Kindly advise, how can generate required output file from input file in Linux machine.

The Awk statement for this is pretty straight-forward. Set the input and output field separators and print the fields using $1..$NF, where $NF is the value of the last column,
awk 'BEGIN{FS=OFS=","}{print $1,$(NF-1),$NF}' input.csv > newfile.csv

Not much to this one in awk:
awk -F"," 'BEGIN{OFS=","}{print $1,$(NF-1), $NF}' inFile > outFile
We split the lines in awk with a comma -F"," and then print the first field $1, the second to last field $(NF-1), and the last field $NF.
NF is the "Number of fields" so subtracting 1 from it will give you the second to last item.

with sed
$ sed -r 's/([^,]+).*(,[^,]+,[^,]+)/\1\2/' file
111,x,xz
112,y,yx
or
$ sed -r 's/([^,]+).*((,[^,]+){2})/\1\2/' file

awk '{print substr($1,1,4) substr($2,10,4)}' file
111,x,xz
112,y,yx

Related

How to read a .csv file with shell command? [duplicate]

This question already has answers here:
Bash: Parse CSV with quotes, commas and newlines
(10 answers)
Closed 2 years ago.
I have a .csv file which I need to extract values from. It is formatted like this :
First line of the file (no data)
1;Jack;Daniels;Madrid;484016;
2;Alice;Morgan;London;564127;
etc...
I would need a shell command that read all lines of a specific column within a .csv, compare each with a string and return a value whenever it finds a matching line. In Java i would define it something like :
> boolean findMatchInCSV(String valueToFind, int colNumber, String
> colSeparator)
The separator between columns may indeed change that is why I would like a something quite generic if possible :)
But I need it as a shell command, is that possible ?
Thanks
I would need a shell command that read all lines
cat 1.csv # read the file
of a specific column within a .csv
cat 1.csv | cut -f5 -d';' # keep only the field #5 (use ';' as separator)
compare each with a string
# keep only the row where the value of the field is exactly 'foo'
cat 1.csv | cut -f5 -d';' | grep '^foo$'
return a value whenever it finds a matching line.
This last one request is unclear.
The code above displays the searched string (foo) once for each row where it is the value of column #5 (start counting from 1). The columns are separated by ;.
Unfortunately, it doesn't handle quoted strings. If the value in any field contains the separator (;), the CSV format allows enclosing the field value into double quotes (") to prevent the separator character be interpreted as a separator (forcing its literal value).
I assume you're looking for something like
FILE=data.csv
VALUE="$1"
COLNUM=$2
IFS="$3"
while read -r -a myArray
do
if "$myArray[$COLNUM]"=="$VALUE"; then
exit 0
fi
done < tail -n +2 $FILE
exit 1
grep "my_string" file |awk -F ";" '{print $5}'
or
awk -F ";" '/my_string/ {print $5}' file
For 2nd column:
awk -F ";" '$2 ~ /my_string/ {print $5}' file
For exact matching:
awk -F ";" '$2 == "my_string" {print $5}' file

Subtract a constant number from a column

I have two large files (~10GB) as follows:
file1.csv
name,id,dob,year,age,score
Mike,1,2014-01-01,2016,2,20
Ellen,2, 2012-01-01,2016,4,35
.
.
file2.csv
id,course_name,course_id
1,math,101
1,physics,102
1,chemistry,103
2,math,101
2,physics,102
2,chemistry,103
.
.
I want to subtract 1 from the "id" columns of these files:
file1_updated.csv
name,id,dob,year,age,score
Mike,0,2014-01-01,2016,2,20
Ellen,0, 2012-01-01,2016,4,35
file2_updated.csv
id,course_name,course_id
0,math,101
0,physics,102
0,chemistry,103
1,math,101
1,physics,102
1,chemistry,103
I have tried awk '{print ($1 - 1) "," $0}' file2.csv, but did not get the correct result:
-1,id,course_name,course_id
0,1,math,101
0,1,physics,102
0,1,chemistry,103
1,2,math,101
1,2,physics,102
1,2,chemistry,103
You've added an extra column in your attempt. Instead set your first field $1 to $1-1:
awk -F"," 'BEGIN{OFS=","} {$1=$1-1;print $0}' file2.csv
That semicolon separates the commands. We set the delimiter to comma (-F",") and the Output Field Seperator to comma BEGIN{OFS=","}. The first command to subtract 1 from the first field executes first, then the print command executes second, so the entire record, $0, will now contain the new $1 value when it's printed.
It might be helpful to only subtract 1 from records that are not your header. So you can add a condition to the first command:
awk -F"," 'BEGIN{OFS=","} NR>1{$1=$1-1} {print $0}' file2.csv
Now we only subtract when the record number (NR) is greater than 1. Then we just print the entire record.

Is there a way to remove only the followed duplicates?

I have a CSV input with these columns:
1,zzzz,xxxx,
1,xxxx,xyxy,
2,xxxx,xxxx,
3,yyyy,xxxx,
3,xxxx,yyyy,
3,xxxx,zzzz,
1,ffff,xxxx,
1,aaaa,xxxx,
And I need to discard lines where the first field matches that of the preceding line:
1,zzzz,xxxx,
2,xxxx,xxxx,
3,yyyy,xxxx,
1,ffff,xxxx,
I tried sort | uniq alone but didn't work because all lines are different with exception of first field (number).
Use awk instead of uniq:
awk -F, '$1 != last { last=$1; print }'
-F, sets the field separator to comma. $1 is the contents of the first field, so this prints the line whenever the first field changes.
Got the wanted output with uniq --check-chars=N; the uniq will check only a specified number of characters in the lines, and since the input isn't sorted this will allow the characters to appear later on the list.

LINUX: Using cat to remove columns in CSV - some have commas in the data

I need to remove some columns from a CSV. Easy.
The problem is I have two columns with full text that actually has commas in them as a part of the data. My cols are enclosed with quotes and the cat is counting the commas in the text as columns. How can I do this so the commas enclosed with quotes are ignored?
example:
"first", "last", "dob", "some long sentence, it has commas in it,", "some data", "foo"
i want to print only rows 1-4, 6
You will save yourself a lot of aggravation by writing a short Perl script that uses Parse::CSV http://metacpan.org/pod/Parse::CSV
I am sure there is a Python way of doing this too.
cat file | sed -e 's|^"||;s|"$||' | awk 'BEGIN {FS="[\"], ?[\"]"}{print $2}'
Example:
http://ideone.com/g2gZmx
How it works:
Look at line:
"a,b","c,d","e,f"
We know that each row is surrounded by "". So we can split this line by ",":
cat file | awk 'BEGIN {FS="[\"], ?[\"]"}{print $2}'
and rows will be:
"a,b c,d e,f"
But we have annoying " in the start and end of line. So we remove it with sed:
cat file | sed -e 's|^"||;s|"$||' | awk 'BEGIN {FS="[\"], ?[\"]"}{print $2}'
And rows will be
a,b c,d e,f
Then we can simply take second row by awk '{print $2}.
Read about regexp field splitting in awk: http://www.gnu.org/software/gawk/manual/html_node/Regexp-Field-Splitting.html

How to cut first n and last n columns?

How can I cut off the first n and the last n columns from a tab delimited file?
I tried this to cut first n column. But I have no idea to combine first and last n column
cut -f 1-10 -d "<CTR>v <TAB>" filename
Cut can take several ranges in -f:
Columns up to 4 and from 7 onwards:
cut -f -4,7-
or for fields 1,2,5,6 and from 10 onwards:
cut -f 1,2,5,6,10-
etc
The first part of your question is easy. As already pointed out, cut accepts omission of either the starting or the ending index of a column range, interpreting this as meaning either “from the start to column n (inclusive)” or “from column n (inclusive) to the end,” respectively:
$ printf 'this:is:a:test' | cut -d: -f-2
this:is
$ printf 'this:is:a:test' | cut -d: -f3-
a:test
It also supports combining ranges. If you want, e.g., the first 3 and the last 2 columns in a row of 7 columns:
$ printf 'foo:bar:baz:qux:quz:quux:quuz' | cut -d: -f-3,6-
foo:bar:baz:quux:quuz
However, the second part of your question can be a bit trickier depending on what kind of input you’re expecting. If by “last n columns” you mean “last n columns (regardless of their indices in the overall row)” (i.e. because you don’t necessarily know how many columns you’re going to find in advance) then sadly this is not possible to accomplish using cut alone. In order to effectively use cut to pull out “the last n columns” in each line, the total number of columns present in each line must be known beforehand, and each line must be consistent in the number of columns it contains.
If you do not know how many “columns” may be present in each line (e.g. because you’re working with input that is not strictly tabular), then you’ll have to use something like awk instead. E.g., to use awk to pull out the last 2 “columns” (awk calls them fields, the number of which can vary per line) from each line of input:
$ printf '/a\n/a/b\n/a/b/c\n/a/b/c/d\n' | awk -F/ '{print $(NF-1) FS $(NF)}'
/a
a/b
b/c
c/d
You can cut using following ,
-d: delimiter ,-f for fields
\t used for tab separated fields
cut -d$'\t' -f 1-3,7-
To use AWK to cut off the first and last fields:
awk '{$1 = ""; $NF = ""; print}' inputfile
Unfortunately, that leaves the field separators, so
aaa bbb ccc
becomes
[space]bbb[space]
To do this using kurumi's answer which won't leave extra spaces, but in a way that's specific to your requirements:
awk '{delim = ""; for (i=2;i<=NF-1;i++) {printf delim "%s", $i; delim = OFS}; printf "\n"}' inputfile
This also fixes a couple of problems in that answer.
To generalize that:
awk -v skipstart=1 -v skipend=1 '{delim = ""; for (i=skipstart+1;i<=NF-skipend;i++) {printf delim "%s", $i; delim = OFS}; printf "\n"}' inputfile
Then you can change the number of fields to skip at the beginning or end by changing the variable assignments at the beginning of the command.
You can use Bash for that:
while read -a cols; do echo ${cols[#]:0:1} ${cols[#]:1,-1}; done < file.txt
you can use awk, for example, cut off 1st,2nd and last 3 columns
awk '{for(i=3;i<=NF-3;i++} print $i}' file
if you have a programing language such as Ruby (1.9+)
$ ruby -F"\t" -ane 'print $F[2..-3].join("\t")' file
Try the following:
echo a#b#c | awk -F"#" '{$1 = ""; $NF = ""; print}' OFS=""
Use
cut -b COLUMN_N_BEGINS-COLUMN_N_UNTIL INPUT.TXT > OUTPUT.TXT
-f doesn't work if you have "tabs" in the text file.

Resources