Working with complex CSV from Linux command line - linux

I have a complex CSV file (here is external link because even a small part of it wouldn't look nice on SO) where a particular column may be composed of several columns separated by space.
reset,angle,sine,multiStepPredictions.actual,multiStepPredictions.1,anomalyScore,multiStepBestPredictions.actual,multiStepBestPredictions.1,anomalyLabel,multiStepBestPredictions:multiStep:errorMetric='altMAPE':steps=[1]:window=1000:field=sine,multiStepBestPredictions:multiStep:errorMetric='aae':steps=[1]:window=1000:field=sine
int,string,string,string,string,string,string,string,string,float,float
R,,,,,,,,,,
0,0.0,0.0,0.0,None,1.0,0.0,None,[],0,0
0,0.0314159265359,0.0314107590781,0.0314107590781,{0.0: 1.0},1.0,0.0314107590781,0.0,[],100.0,0.0314107590781
0,0.0628318530718,0.0627905195293,0.0627905195293,{0.0: 0.0039840637450199202 0.03141075907812829: 0.99601593625497931},1.0,0.0627905195293,0.0314107590781,[],66.6556977331,0.0313952597647
0,0.0942477796077,0.0941083133185,0.0941083133185,{0.03141075907812829: 1.0},1.0,0.0941083133185,0.0314107590781,[],66.63923621,0.0418293579232
0,0.125663706144,0.125333233564,0.125333233564,{0.06279051952931337: 0.98942669172932329 0.03141075907812829: 0.010573308270676691},1.0,0.125333233564,0.0627905195293,[],59.9506102238,0.0470076969512
0,0.157079632679,0.15643446504,0.15643446504,{0.03141075907812829: 0.0040463956041429626 0.09410831331851431: 0.94917381047888194 0.06279051952931337: 0.046779793916975114},1.0,0.15643446504,0.0941083133185,[],53.2586756624,0.0500713879053
0,0.188495559215,0.187381314586,0.187381314586,{0.12533323356430426: 0.85789473684210527 0.09410831331851431: 0.14210526315789476},1.0,0.187381314586,0.125333233564,[],47.5170631454,0.0520675034246
For viewing I am using this trick column -s,$'\t' -t < *.csv | less -#2 -N -S which is an upgraded version borrowed from Command line CSV viewer. If I'm using this trick is explicitly clear what is the 1st 2nd 3rd ... column and what is the data which are composed of several space separated data in particular column.
My question is if there is any trick to manipulating such complex CSV? I know that I can use awk to filter 5th column, then from this filtered column filter again 2nd column to get the desired portion of complex data, but I need to watch if there wasn't another composed column before 5th (so I need to get actually 6th not 5th column etc) some columns may contain also mix of composed and non composed data. So awk is probably not right tool.
The CSV viewer link mentions a tool called csvlook which adds to output pipes as a separator. This could be more easy to filter because pipes will delimit columns and white spaces will delimit composed data on one column. But I cannot run csvlook with multiple delimiters (comma and tab) as I did for column so it did not generate data properly. What is the most comfortable way of handling this?

As long as your input doesn't contain columns with escaped embedded , chars., you should be able to parse it with awk, using , as the field separator; e.g.:
awk -F, '{ n = split($5, subField, "[[:blank:]]+"); for (i=1;i<=n;++i) print subField[i] }' file.csv
The above splits the 5th field into sub-fields by whitespace, using the split() function.

Take a look at cut command. You can specify a list of fields, or a range of fields.

Related

Combine first two columns of a single csv file into another column

So I have a large CSV file (in Gb) where I have multiple columns, the first two columns are :
Invoice number|Line Item Number
I want a unix / linux /ubuntu command which can merge this two columns and create a new column which is separated by separator ':', so for eg : If invoice number is 64789544 and Line Item Number is 234533, then my Merged value should be
64789544:234533
Can it really be achieved, If yes can the merged column is possible to be added back to the source csv file.
You can use the following sed command:
$ cat large.csv
Invoice number|Line Item Number|Other1|Other2
64789544|234533|abc|134
64744123|232523|cde|awc
$ sed -i.bak 's/^\([^|]*\)|\([^|]*\)/\1:\2/' large.csv
$ cat large.csv
Invoice number:Line Item Number|Other1|Other2
64789544:234533|abc|134
64744123:232523|cde|awc
Just be aware that it will take a backup of your input file just in case so you need to have enough space in your file system.
Explanations:
s/^\([^|]*\)|\([^|]*\)/\1:\2/ this command will replace the first two field of your CSV separated by | and will replace the separator by : using back references what will merge the 2 columns.
If you are sure about what you are doing, you can change -i.bak in -i to avoid taking a backup of the CSV file.
Perhaps with this simple sed
sed 's/|/:/' infile

AWK - Show lines where column contains a specific string

I have a document (.txt) composed like that.
info1: info2: info3: info4
And I want to show some information by column.
For example, I have some different information in "info3" shield, I want to see only the lines who are composed by "test" in "info3" column.
I think I have to use sort but I'm not sure.
Any idea ?
The previous answers are assuming that the third column is exactly equal to test. It looks like you were looking for columns where the value included test. We need to use awk's match function
awk -F: 'match($3, "test")' file
You can use awk for this. Assuming your columns are de-limited by : and column 3 has entries having test, below command lists only those lines having that value.
awk -F':' '$3=="test"' input-file
Assuming that the spacing is consistent, and you're looking for only test in the third column, use
grep ".*:.*: test:.*" file.txt
Or to take care of any spacing that might occur
grep ".*:.*: *test *:.*" file.txt

use uniq -d on a particular column?

Have a text file like this.
john,3
albert,4
tom,3
junior,5
max,6
tony,5
I'm trying to fetch records where column2 value is same. My desired output.
john,3
tom,3
junior,5
tony,5
I'm checking if we can use uniq -d on second column?
Here's one way using awk. It reads the input file twice, but avoids the need to sort:
awk -F, 'FNR==NR { a[$2]++; next } a[$2] > 1' file file
Results:
john,3
tom,3
junior,5
tony,5
Brief explanation:
FNR==NR is a common AWK idiom that is true for the first file in the arguments list. Here, column two is added to an array and incremented. On the second read of the file, we simply check if the value of column two is greater than one (the next keyword skips processing the rest of the code).
You can use uniq on fields (columns), but not easily in your case.
Uniq's -f and -s options filter by fields and characters respectively. However neither of these quite do what want.
-f divides fields by whitespace and you separate them with commas.
-s skips a fixed number of characters and your names are of variable length.
Overall though, uniq is used to compress input by consolidating duplicates into unique lines. You are actually wishing to retain duplicates and eliminate singletons, which is the opposite of what uniq is used to do. It would appear you need a different approach.

Output columns format sort linux

Basically after I sort I want my columns to be separated by tabs. right now it is separated by two spaces. The man pages did not have anything related to output formatting (at least I didn't notice it).
If its not possible, I guess I have to use awk to sort and print. Any better alternative?
EDIT:
To clarify the question, the location of the double spaces is not consistent. I actually have data like this:
<date>\t<user>\t<message>.
I sort by date by year, month, day and time which looks like
Wed Jan 11 23:44:30 CST 2012
and then have the output of the sorted data like the original file that is
<date>\t<user>\t<message>.
EDIT 2: Seems like my testing for tab was wrong. I was copy pasting raw line from bash to my Windows box. That's why it didn't recognize as a tab instead it showed spaces. I downloaded whole file to windows and now I can see that the fields are tab separated.
Also, I figured out that separation of fields (\t \n , : ;, etc) is same in the new file after sorting. That means, in the original file if I have tab separated field, my sorted file is also going to be tab separated.
One last thing, the "correct" answer was not exactly the correct solution to the problem. I don't know if I can comment on my own thread and mark it as correct. If it is OK to do that, please let me know.
Thanks for the comments guys. Really appreciate your help!
Pipe your output to column:
sort <whatever> | column -t -s\t
You can use sed:
sort data.txt | sed 's/ /\t/g'
^^
||
2 blank spaces
This will take the output of your sort operation and substitute a single tab for 2 consecutive blanks.
From what I understood the file is already sorted and what you want is to replace the two separating spaces by a TAB character, in that case, use the following:
sed 's/ /\t/g' < sorted_file > new_formatted_file
(Be careful to copy/paste correctly the two spaces in the regular expression)

Sorting on a numerical value using csvfix for linux - turns numbers to strings

I'm using csvfix to sort a CSV file based on an integer (counter) value in the second column. However it seems that csvfix puts double quotes around all fields in the file, turning them to strings, before it performs the sort. The result is that the rows are sorted by the string value, such that "1000" comes before "2".
There is a command-line option -smq that is supposed to apply "smart quoting" but that's not helping me. If I use the command csvfix echo -smq file.csv, the output has no quotes around numerical fields, but when I pipe that into csvfix sort -f 2 file.csv, the file is written without quotes but still sorted in "string order". It makes no difference whether I include the -smq flag in the sort command or not.
Additionally I would like csvfix to ignore the first row of string headers. Csvfix issue tracking claims this is already implemented but I can only find the -ifn flag that seems to cut the header row out entirely.
These seem pretty basic pieces of functionality for this tool, so I'm probably missing something very simple. Hoping someone on here has used csvfix and figured out.
According to the on line documentation for csvfix, sort has a N option for numeric sorts:
csvfix sort -f 2:N file.csv
Having said this, CSV isn't a particularly good format for text manipulation. If possible, you're much better off choosing DSV (delimiter separated values) such as Tab or Pipe separated, so that you can simply pipe the output to sort, which has ample capability to sort by field, using whatever collation method you need.

Resources