Combine multiple files csv into one using awk - linux

I want to combine two .csv files based on the unique id that exists in both files.
First file consist of 17 columns and the second one in 2 columns where in both files the first column is the same unique id.
In the to be created file 3 i would like 18 columns.
I have been trying paste
paste -d ' ' SPOOL1.csv SPOOL2.csv > MERGED.csv
but that of course does not take the unique columns into consideration.
Not proficient in awk so all help is appreciated.
Thanks

sounds like if the files are sorted then
join SPOOL1 SPOOL2 > MERGED
should get you closer if you deal with the delimiters not shown

Related

Split one large .csv file in to multiple csv files based on row count using c#

I have CSV file with below data. I need to split based on row count using c#.
H1,H2,H3,H4,H5
1,2,3,4,5
12,11,8,7,6
23,23,34,1,0
23,23,32,1,0
For example, if row count is 2. It will split in two files
File1.csv (first two records)
H1,H2,H3,H4,H5
1,2,3,4,5
12,11,8,7,6
File2.csv (Next two records)
H1,H2,H3,H4,H5
23,23,34,1,0
23,23,32,1,0
Can anyone help on this.
Thanks

How do I compare two spreadsheets to identify missing line items and add them?

I am trying to compare two .xlsx files. What I am looking to do is basically the following:
Does any cell in column B of file1 exist in column B of file2?
If yes, continue.
Else, add the row to file2
The structure of the files is different, so I would need to organize the information being added to file2 to match the format, also, but I think I would be able to do that myself once I know how to do the transfer.
The files are basically a vulnerability export from ACAS and a POA&M. I want to add any existing vulnerabilities from the export that are not already represented on the POA&M.

Combine first two columns of a single csv file into another column

So I have a large CSV file (in Gb) where I have multiple columns, the first two columns are :
Invoice number|Line Item Number
I want a unix / linux /ubuntu command which can merge this two columns and create a new column which is separated by separator ':', so for eg : If invoice number is 64789544 and Line Item Number is 234533, then my Merged value should be
64789544:234533
Can it really be achieved, If yes can the merged column is possible to be added back to the source csv file.
You can use the following sed command:
$ cat large.csv
Invoice number|Line Item Number|Other1|Other2
64789544|234533|abc|134
64744123|232523|cde|awc
$ sed -i.bak 's/^\([^|]*\)|\([^|]*\)/\1:\2/' large.csv
$ cat large.csv
Invoice number:Line Item Number|Other1|Other2
64789544:234533|abc|134
64744123:232523|cde|awc
Just be aware that it will take a backup of your input file just in case so you need to have enough space in your file system.
Explanations:
s/^\([^|]*\)|\([^|]*\)/\1:\2/ this command will replace the first two field of your CSV separated by | and will replace the separator by : using back references what will merge the 2 columns.
If you are sure about what you are doing, you can change -i.bak in -i to avoid taking a backup of the CSV file.
Perhaps with this simple sed
sed 's/|/:/' infile

How to get a count of same values in the same column of two files in linux shell?

If you have two files of the same tab separated format, and you want to get a count of how many values in that column are the same between the two files, what would be the best way to do that?
Example:
I have five columns of tab separated data, column two file1 is as follows:
234839
349583
444995
694038
785948
and in file2 column 2 is this:
123943
234839
338273
349583
785948
The expected output would be 3.
Depends, do you want to have a mapping between between values and counts, or is the value one of the inputs?
Either way you can probably do it by piping cat, cut, grep, wc -l

Working with complex CSV from Linux command line

I have a complex CSV file (here is external link because even a small part of it wouldn't look nice on SO) where a particular column may be composed of several columns separated by space.
reset,angle,sine,multiStepPredictions.actual,multiStepPredictions.1,anomalyScore,multiStepBestPredictions.actual,multiStepBestPredictions.1,anomalyLabel,multiStepBestPredictions:multiStep:errorMetric='altMAPE':steps=[1]:window=1000:field=sine,multiStepBestPredictions:multiStep:errorMetric='aae':steps=[1]:window=1000:field=sine
int,string,string,string,string,string,string,string,string,float,float
R,,,,,,,,,,
0,0.0,0.0,0.0,None,1.0,0.0,None,[],0,0
0,0.0314159265359,0.0314107590781,0.0314107590781,{0.0: 1.0},1.0,0.0314107590781,0.0,[],100.0,0.0314107590781
0,0.0628318530718,0.0627905195293,0.0627905195293,{0.0: 0.0039840637450199202 0.03141075907812829: 0.99601593625497931},1.0,0.0627905195293,0.0314107590781,[],66.6556977331,0.0313952597647
0,0.0942477796077,0.0941083133185,0.0941083133185,{0.03141075907812829: 1.0},1.0,0.0941083133185,0.0314107590781,[],66.63923621,0.0418293579232
0,0.125663706144,0.125333233564,0.125333233564,{0.06279051952931337: 0.98942669172932329 0.03141075907812829: 0.010573308270676691},1.0,0.125333233564,0.0627905195293,[],59.9506102238,0.0470076969512
0,0.157079632679,0.15643446504,0.15643446504,{0.03141075907812829: 0.0040463956041429626 0.09410831331851431: 0.94917381047888194 0.06279051952931337: 0.046779793916975114},1.0,0.15643446504,0.0941083133185,[],53.2586756624,0.0500713879053
0,0.188495559215,0.187381314586,0.187381314586,{0.12533323356430426: 0.85789473684210527 0.09410831331851431: 0.14210526315789476},1.0,0.187381314586,0.125333233564,[],47.5170631454,0.0520675034246
For viewing I am using this trick column -s,$'\t' -t < *.csv | less -#2 -N -S which is an upgraded version borrowed from Command line CSV viewer. If I'm using this trick is explicitly clear what is the 1st 2nd 3rd ... column and what is the data which are composed of several space separated data in particular column.
My question is if there is any trick to manipulating such complex CSV? I know that I can use awk to filter 5th column, then from this filtered column filter again 2nd column to get the desired portion of complex data, but I need to watch if there wasn't another composed column before 5th (so I need to get actually 6th not 5th column etc) some columns may contain also mix of composed and non composed data. So awk is probably not right tool.
The CSV viewer link mentions a tool called csvlook which adds to output pipes as a separator. This could be more easy to filter because pipes will delimit columns and white spaces will delimit composed data on one column. But I cannot run csvlook with multiple delimiters (comma and tab) as I did for column so it did not generate data properly. What is the most comfortable way of handling this?
As long as your input doesn't contain columns with escaped embedded , chars., you should be able to parse it with awk, using , as the field separator; e.g.:
awk -F, '{ n = split($5, subField, "[[:blank:]]+"); for (i=1;i<=n;++i) print subField[i] }' file.csv
The above splits the 5th field into sub-fields by whitespace, using the split() function.
Take a look at cut command. You can specify a list of fields, or a range of fields.

Resources