Combine first two columns of a single csv file into another column - linux

So I have a large CSV file (in Gb) where I have multiple columns, the first two columns are :
Invoice number|Line Item Number
I want a unix / linux /ubuntu command which can merge this two columns and create a new column which is separated by separator ':', so for eg : If invoice number is 64789544 and Line Item Number is 234533, then my Merged value should be
64789544:234533
Can it really be achieved, If yes can the merged column is possible to be added back to the source csv file.

You can use the following sed command:
$ cat large.csv
Invoice number|Line Item Number|Other1|Other2
64789544|234533|abc|134
64744123|232523|cde|awc
$ sed -i.bak 's/^\([^|]*\)|\([^|]*\)/\1:\2/' large.csv
$ cat large.csv
Invoice number:Line Item Number|Other1|Other2
64789544:234533|abc|134
64744123:232523|cde|awc
Just be aware that it will take a backup of your input file just in case so you need to have enough space in your file system.
Explanations:
s/^\([^|]*\)|\([^|]*\)/\1:\2/ this command will replace the first two field of your CSV separated by | and will replace the separator by : using back references what will merge the 2 columns.
If you are sure about what you are doing, you can change -i.bak in -i to avoid taking a backup of the CSV file.

Perhaps with this simple sed
sed 's/|/:/' infile

Related

Merging two excel file into a third one with the common heading

I have two excel file with common headings "StudentID" and "StudentName" in both of the excel files. I want to merge these two excel files in to a third excel containing all the records from the two excel along with the common heading. How can i do the same through linux commands.
I assumed it was csv files as it would be way more complicated with .xlsx files
cp first_file.csv third_file.csv
tail -n +2 second_file.csv >> third_file.csv
First line copies your first file into a new file called third_file.csv. Second line fills the new file with the content of the second file starting from the second line (escapes header).
Due to your requirement to do this with "Linux commands" I assume that you have two CSV files rather than XLSX files.
If so, the Linux join command is a good fit for a problem like this.
Imagine your two files are:
# file1.csv
Student ID,Student Name,City
1,John Smith,London
2,Arthur Dent,Newcastle
3,Sophie Smith,London
and:
# file2.csv
Student ID,Student Name,Subjects
1,John Smith,Maths
2,Arthur Dent,Philosophy
3,Sophie Smith,English
We want to do an equality join on the Student ID field (or we could use Student Name, it doesn't matter since both are common to each).
We can do this using the following command:
$ join -1 1 -2 1 -t, -o 1.1,1.2,1.3,2.3 file1.csv file2.csv
Student ID,Student Name,City,Subjects
1,John Smith,London,Maths
2,Arthur Dent,Newcastle,Philosophy
3,Sophie Smith,London,English
By way of explanation, this join command written as SQL would be something like:
SELECT `Student ID`, `Student Name`, `City`, `Subjects`
FROM `file1.csv`, `file2.csv`
WHERE `file1.Student ID` = `file2.Student ID`
The options to join mean:
The "SELECT" clause:
-o 1.1,1.2,1.3,2.3 means select the first file's first field, first file's second field, first file's third field,second file's third field.
The "FROM" clause:
file1.csv file2.csv, i.e. the two filename arguments passed to join.
The "WHERE" clause:
-1 1 means join from the 1st field from the Left table
-2 1 means join to the 1st field from the Right table (-1 = Left; -2 = Right)
Also:
-t, tells join to use the comma as the field separator
#Corentin Limier Thanks for the answer.
Was able to achieve the same through similar way below.
Let's say two files a.xls,b.xls and want to merge the same into the third file c.xls
cat a.xls > c.xls && tail -n +2 b.xls >> c.xls

How to select rows with information in a text file by using linux command

I just want to select certain info in separated rows in a text file. How do I deal with this?
For selecting row, which contains "SUBSCRIBERIDENTIFIER" and another row contains "LATEST_OFFLINE_TIME"
My output to look like following:
SUBSCRIBERIDENTIFIER=23481XXXXXX02
LATEST_OFFLINE_TIME=20170330191209
$ awk '/'"SUBSCRIBERIDENTIFIER"'/,/'"LATEST_OFFLINE_TIME"'/' file.txt
SUBSCRIBERIDENTIFIER=1923UIO1U23I1O
LATEST_OFFLINE_TIME=128390812903810983019
$ awk '/SUBSCRIBERIDENTIFIER/,/LATEST_OFFLINE_TIME/' file.txt

Working with complex CSV from Linux command line

I have a complex CSV file (here is external link because even a small part of it wouldn't look nice on SO) where a particular column may be composed of several columns separated by space.
reset,angle,sine,multiStepPredictions.actual,multiStepPredictions.1,anomalyScore,multiStepBestPredictions.actual,multiStepBestPredictions.1,anomalyLabel,multiStepBestPredictions:multiStep:errorMetric='altMAPE':steps=[1]:window=1000:field=sine,multiStepBestPredictions:multiStep:errorMetric='aae':steps=[1]:window=1000:field=sine
int,string,string,string,string,string,string,string,string,float,float
R,,,,,,,,,,
0,0.0,0.0,0.0,None,1.0,0.0,None,[],0,0
0,0.0314159265359,0.0314107590781,0.0314107590781,{0.0: 1.0},1.0,0.0314107590781,0.0,[],100.0,0.0314107590781
0,0.0628318530718,0.0627905195293,0.0627905195293,{0.0: 0.0039840637450199202 0.03141075907812829: 0.99601593625497931},1.0,0.0627905195293,0.0314107590781,[],66.6556977331,0.0313952597647
0,0.0942477796077,0.0941083133185,0.0941083133185,{0.03141075907812829: 1.0},1.0,0.0941083133185,0.0314107590781,[],66.63923621,0.0418293579232
0,0.125663706144,0.125333233564,0.125333233564,{0.06279051952931337: 0.98942669172932329 0.03141075907812829: 0.010573308270676691},1.0,0.125333233564,0.0627905195293,[],59.9506102238,0.0470076969512
0,0.157079632679,0.15643446504,0.15643446504,{0.03141075907812829: 0.0040463956041429626 0.09410831331851431: 0.94917381047888194 0.06279051952931337: 0.046779793916975114},1.0,0.15643446504,0.0941083133185,[],53.2586756624,0.0500713879053
0,0.188495559215,0.187381314586,0.187381314586,{0.12533323356430426: 0.85789473684210527 0.09410831331851431: 0.14210526315789476},1.0,0.187381314586,0.125333233564,[],47.5170631454,0.0520675034246
For viewing I am using this trick column -s,$'\t' -t < *.csv | less -#2 -N -S which is an upgraded version borrowed from Command line CSV viewer. If I'm using this trick is explicitly clear what is the 1st 2nd 3rd ... column and what is the data which are composed of several space separated data in particular column.
My question is if there is any trick to manipulating such complex CSV? I know that I can use awk to filter 5th column, then from this filtered column filter again 2nd column to get the desired portion of complex data, but I need to watch if there wasn't another composed column before 5th (so I need to get actually 6th not 5th column etc) some columns may contain also mix of composed and non composed data. So awk is probably not right tool.
The CSV viewer link mentions a tool called csvlook which adds to output pipes as a separator. This could be more easy to filter because pipes will delimit columns and white spaces will delimit composed data on one column. But I cannot run csvlook with multiple delimiters (comma and tab) as I did for column so it did not generate data properly. What is the most comfortable way of handling this?
As long as your input doesn't contain columns with escaped embedded , chars., you should be able to parse it with awk, using , as the field separator; e.g.:
awk -F, '{ n = split($5, subField, "[[:blank:]]+"); for (i=1;i<=n;++i) print subField[i] }' file.csv
The above splits the 5th field into sub-fields by whitespace, using the split() function.
Take a look at cut command. You can specify a list of fields, or a range of fields.

Adding null/Zero values to comma delimeted file using unix scripting

I have a requirment where, I get files from source with different number of delimeter data, i need to make them to one standard number of delimeted data.
source file1:
AA,BB,CC,0,0
AC,BD,DB,1,0
EE,ER,DR,0,0
What i want to do is appened an extra 3 zeros at the end for each row
AA,BB,CC,0,0,0,0,0
AC,BD,DB,1,0,0,0,0
EE,ER,DR,0,0,0,0,0
The source file always contains less number of column data . Can anyone help on this.
Thanks In Advance
Try this, it will add particular string after each line of mentioned file
sed '1,$ s/$/,0,0,0/' infile > outfile
Here is what I tried;
sed can do it in place with the -i flag
sed -i "s/$/,0,0,0/g" file

Chunk a large file based on regex (LInux)

I have a large text file and I want to chunk it to smaller files based on distinct value of a column , columns are separated by comma (it's a csv file) and there are lots of distinct values :
e.g.
1012739937,2006-11-28,d_02245211
1012739937,2006-11-28,d_02238545
1012739937,2006-11-28,d_02236564
1012739937,2006-11-28,d_01918338
1012739937,2006-11-28,d_02148765
1012739937,2006-11-28,d_00868949
1012739937,2006-11-28,d_01908448
1012740478,1998-06-26,d_01913689
1012740478,1998-06-26,i_4869
1012740478,1998-06-26,d_02174766
I want to chunk the file into smaller files such that each file contains records belonging to one year (one for records of 2006 , one for records of 1998 , etc)
(here we may have limited number of years , but I want to the same thing with larger number of distinct values of a specific column)
You can use awk:
awk -F, '{split($2,d,"-");print > d[1]}' file
Explanation:
-F, tells awk that input fields are separated by ','
split($2,d,"-") splits the second column (the date) by '-'
and puts the bits into the array 'd'
print > d[1] prints the whole input line into a file named after the year
A quick awk solution, if slightly fragile (assumes the second column, if it exists, always starts yyyy)
awk -F, '$2{print > (substr($2,0,4) ".csv")}' test.in
It will split input into files yyyy.csv; make sure they don't exist in your current directory or they will be overwritten.
A different awk take: use a slightly more complicated field separator:
awk -F '[,-]' '{print > $2}' file

Resources