I want to combine two .csv files based on the unique id that exists in both files.
First file consist of 17 columns and the second one in 2 columns where in both files the first column is the same unique id.
In the to be created file 3 i would like 18 columns.
I have been trying paste
paste -d ' ' SPOOL1.csv SPOOL2.csv > MERGED.csv
but that of course does not take the unique columns into consideration.
Not proficient in awk so all help is appreciated.
Thanks
sounds like if the files are sorted then
join SPOOL1 SPOOL2 > MERGED
should get you closer if you deal with the delimiters not shown
I have two excel file with common headings "StudentID" and "StudentName" in both of the excel files. I want to merge these two excel files in to a third excel containing all the records from the two excel along with the common heading. How can i do the same through linux commands.
I assumed it was csv files as it would be way more complicated with .xlsx files
cp first_file.csv third_file.csv
tail -n +2 second_file.csv >> third_file.csv
First line copies your first file into a new file called third_file.csv. Second line fills the new file with the content of the second file starting from the second line (escapes header).
Due to your requirement to do this with "Linux commands" I assume that you have two CSV files rather than XLSX files.
If so, the Linux join command is a good fit for a problem like this.
Imagine your two files are:
# file1.csv
Student ID,Student Name,City
1,John Smith,London
2,Arthur Dent,Newcastle
3,Sophie Smith,London
and:
# file2.csv
Student ID,Student Name,Subjects
1,John Smith,Maths
2,Arthur Dent,Philosophy
3,Sophie Smith,English
We want to do an equality join on the Student ID field (or we could use Student Name, it doesn't matter since both are common to each).
We can do this using the following command:
$ join -1 1 -2 1 -t, -o 1.1,1.2,1.3,2.3 file1.csv file2.csv
Student ID,Student Name,City,Subjects
1,John Smith,London,Maths
2,Arthur Dent,Newcastle,Philosophy
3,Sophie Smith,London,English
By way of explanation, this join command written as SQL would be something like:
SELECT `Student ID`, `Student Name`, `City`, `Subjects`
FROM `file1.csv`, `file2.csv`
WHERE `file1.Student ID` = `file2.Student ID`
The options to join mean:
The "SELECT" clause:
-o 1.1,1.2,1.3,2.3 means select the first file's first field, first file's second field, first file's third field,second file's third field.
The "FROM" clause:
file1.csv file2.csv, i.e. the two filename arguments passed to join.
The "WHERE" clause:
-1 1 means join from the 1st field from the Left table
-2 1 means join to the 1st field from the Right table (-1 = Left; -2 = Right)
Also:
-t, tells join to use the comma as the field separator
#Corentin Limier Thanks for the answer.
Was able to achieve the same through similar way below.
Let's say two files a.xls,b.xls and want to merge the same into the third file c.xls
cat a.xls > c.xls && tail -n +2 b.xls >> c.xls
So I have a large CSV file (in Gb) where I have multiple columns, the first two columns are :
Invoice number|Line Item Number
I want a unix / linux /ubuntu command which can merge this two columns and create a new column which is separated by separator ':', so for eg : If invoice number is 64789544 and Line Item Number is 234533, then my Merged value should be
64789544:234533
Can it really be achieved, If yes can the merged column is possible to be added back to the source csv file.
You can use the following sed command:
$ cat large.csv
Invoice number|Line Item Number|Other1|Other2
64789544|234533|abc|134
64744123|232523|cde|awc
$ sed -i.bak 's/^\([^|]*\)|\([^|]*\)/\1:\2/' large.csv
$ cat large.csv
Invoice number:Line Item Number|Other1|Other2
64789544:234533|abc|134
64744123:232523|cde|awc
Just be aware that it will take a backup of your input file just in case so you need to have enough space in your file system.
Explanations:
s/^\([^|]*\)|\([^|]*\)/\1:\2/ this command will replace the first two field of your CSV separated by | and will replace the separator by : using back references what will merge the 2 columns.
If you are sure about what you are doing, you can change -i.bak in -i to avoid taking a backup of the CSV file.
Perhaps with this simple sed
sed 's/|/:/' infile
If you have two files of the same tab separated format, and you want to get a count of how many values in that column are the same between the two files, what would be the best way to do that?
Example:
I have five columns of tab separated data, column two file1 is as follows:
234839
349583
444995
694038
785948
and in file2 column 2 is this:
123943
234839
338273
349583
785948
The expected output would be 3.
Depends, do you want to have a mapping between between values and counts, or is the value one of the inputs?
Either way you can probably do it by piping cat, cut, grep, wc -l
I have several files and I only want to take specific columns from it. At the moment, I am using the following code:
$cut -f 1,2,5 AD0062-C.vcf > cutAD0062.txt
However, to speed up the process I was wondering if I could cut the same columns (fields 1,2,5) in multiple files and then print the output to several different files. I.e columns 1,2,5 of files AD0063-C.vcf, AD0064-C.vcf, AD0065-C.vcf should output results to separate files: cutAD0063.txt, cutAD0064.txt, cutAD0065.txt?
You can write a for...loop:
for i in AD*-C.vcf
do
cut -f 1,2,5 $i > cut${i%-C.vcf}.txt
done