Merge two CSVs while resolving duplicates - linux

I have a couple of csv's that i need to merge. I want to consider entries which have the same first and second columns as duplicates.
I know the command for this is something like
sort -t"," -u -k 1,1 -k 2,2 file1 file2
I additionally want to resolve the duplicates in such a way that the entry from the second file is chosen everytime. What is the way to do that?

If the suggestion to reverse the order of files to the sort command doesn't work (see other answer), another way to do this would be to concatenate the files, file2 first, and then sort them with the -s switch.
cat file2 file1 | sort -t"," -u -k 1,1 -k 2,2 -s
-s forces a stable sort, meaning that identical lines will appear in the same relative order. Since the input to sort has all of the lines from file2 before file1, all of the duplicates in the output should come from file2.
The sort man page doesn't explicitly state that input files will be read in the order that they're supplied on the command line, so I guess it's possible that an implementation could read the files in reverse order, or alternating lines, or whatever. But if you concatenate the files first then there's no ambiguity.

Change the order of the two files and add -s(#Jim Mischel give the hit) would solve your problem.
sort -t"," -u -k 1,1 -k 2,2 -s file2 file1
man sort
-u, --unique
with -c, check for strict ordering; without -c, output only the
first of an equal run
-s, --stable
stabilize sort by disabling last-resort comparison
Short answer
awk -F"," '{out[$1$2]=$0} END {for(i in out) {print out[i]}}' file1 file2
A bit long answer:
awk 'BEGIN {
FS=OFS=","; # set ',' as field separator
}
{
out[$1$2]=$0; # save the value to dict, new value would replace old value.
}
END {
for (i in out) { # in the end, print all value of the dict
print out[i];
}
}' file1 file2

Related

Difference between using the uniq command with sort or without it in linux

When I use uniq -u data.txt lists the whole file and when I use sort data.txt | uniq -u it omits repeated lines. Why does this happen?
uniq man says that -u, --unique only prints unique lines. I don't understand why I need to use pipe to get correct output.
uniq removes adjacent duplicates. If you want to omit duplicates that are not adjacent, you'll have to sort the data first.

Find common lines between two files

File 1:
6
9219045
71608707
105853666
106000373
106000464
106000814
106001204
106001483
106002054
File 2:
6,rO0ABXNyADljb20uYW1hem9uLnBvaW50c3BsYXRmb3JtLnV0aWwuUG9pbnRzUGxhdGZvcm1DcnlwdE1lc3NhZ2Xio1+sC+m4CAIABFsACGNpcGhlcklWdAACW0JbAApjaXBoZXJUZXh0cQB+AAFMAAxtYXRlcmlhbE5hbWV0ABJMamF2YS9sYW5nL1N0cmluZztMAA5tYXRlcmlhbFNlcmlhbHQAEExqYXZhL2xhbmcvTG9uZzt4cHVyAAJbQqzzF/gGCFTgAgAAeHAAAAAQufMrUK+8A4e0iJV4ktLQgXVxAH4ABQAAAEBNoyuUZLYRLaBqLvsvzHxxv63pO+4UPsRqpp/oHURcBdT6NES2G5H6+Kc3yjZOXDIIhHN1efAxyM/iWD0qDev9dAAwY29tLmFtYXpvbi5wb2ludHMuZW5jcnlwdGlvbi5rZXkuYWNjb3VudHNzZXJ2aWNlc3IADmphdmEubGFuZy5Mb25nO4vkkMyPI98CAAFKAAV2YWx1ZXhyABBqYXZhLmxhbmcuTnVtYmVyhqyVHQuU4IsCAAB4cAAAAAAAAAAB,jp-points
55555,rO0ABXNyADljb20uYW1hem9uLnBvaW50c3BsYXRmb3JtLnV0aWwuUG9pbnRzUGxhdGZvcm1DcnlwdE1lc3NhZ2Xio1+sC+m4CAIABFsACGNpcGhlcklWdAACW0JbAApjaXBoZXJUZXh0cQB+AAFMAAxtYXRlcmlhbE5hbWV0ABJMamF2YS9sYW5nL1N0cmluZztMAA5tYXRlcmlhbFNlcmlhbHQAEExqYXZhL2xhbmcvTG9uZzt4cHVyAAJbQqzzF/gGCFTgAgAAeHAAAAAQ5C9LG75v8+ENmmteRa/bBHVxAH4ABQAAAFBgXjgKk6KvTg4FiPfWF/7Ittzk/MpmlBecYkc9Bc+3mAV7R58rcl1hGkFdk3MagFXjUsunbE0qcV+Gy+DwhUWpBYDpA3p9q9oO8zwDJfFqCHQAMGNvbS5hbWF6b24ucG9pbnRzLmVuY3J5cHRpb24ua2V5LmFjY291bnRzc2VydmljZXNyAA5qYXZhLmxhbmcuTG9uZzuL5JDMjyPfAgABSgAFdmFsdWV4cgAQamF2YS5sYW5nLk51bWJlcoaslR0LlOCLAgAAeHAAAAAAAAAAAQ==,jp-points
74292,rO0ABXNyADljb20uYW1hem9uLnBvaW50c3BsYXRmb3JtLnV0aWwuUG9pbnRzUGxhdGZvcm1DcnlwdE1lc3NhZ2Xio1+sC+m4CAIABFsACGNpcGhlcklWdAACW0JbAApjaXBoZXJUZXh0cQB+AAFMAAxtYXRlcmlhbE5hbWV0ABJMamF2YS9sYW5nL1N0cmluZztMAA5tYXRlcmlhbFNlcmlhbHQAEExqYXZhL2xhbmcvTG9uZzt4cHVyAAJbQqzzF/gGCFTgAgAAeHAAAAAQPxjL0KWZoaYxWY7clP57tnVxAH4ABQAAAFB6WiMY05SU2WiYqaC7CzwMP2kQ51ec9mkIPh7R4fz2LPwfT8VNpAwH0QLM3I497D2JLfK13S6S90dxpU1ny2VBwaU4imxVchwo7YrcvwvEZXQAMGNvbS5hbWF6b24ucG9pbnRzLmVuY3J5cHRpb24ua2V5LmFjY291bnRzc2VydmljZXNyAA5qYXZhLmxhbmcuTG9uZzuL5JDMjyPfAgABSgAFdmFsdWV4cgAQamF2YS5sYW5nLk51bWJlcoaslR0LlOCLAgAAeHAAAAAAAAAAAQ==,jp-points
File 1 has only one column and I am sorting the file with the command sort -n file1
File 2 has three columns and I am sorting the file with command sort -t "," -k 1n,1 file2 which is sorting on the basis of ist column.
Now, I want to find the rows in file2 that are starting from lines in file1
Commands that I have tried:
grep -w -f file1 file2
join -t "," -1 1 -2 1 -o 2.2 file1 file2
But, I am not getting desired results. Please provide me with alternate approach. File 1 has rows 7124458 and File 2 has row 42987432.
Use awk:
awk -F, 'FNR == NR { ++a[$0]; next } $1 in a' file1 file2
Output:
6,rO0ABXNyADljb20uYW1hem9uLnBvaW50c3BsYXRmb3JtLnV0aWwuUG9pbnRzUGxhdGZvcm1DcnlwdE1lc3NhZ2Xio1+sC+m4CAIABFsACGNpcGhlcklWdAACW0JbAApjaXBoZXJUZXh0cQB+AAFMAAxtYXRlcmlhbE5hbWV0ABJMamF2YS9sYW5nL1N0cmluZztMAA5tYXRlcmlhbFNlcmlhbHQAEExqYXZhL2xhbmcvTG9uZzt4cHVyAAJbQqzzF/gGCFTgAgAAeHAAAAAQufMrUK+8A4e0iJV4ktLQgXVxAH4ABQAAAEBNoyuUZLYRLaBqLvsvzHxxv63pO+4UPsRqpp/oHURcBdT6NES2G5H6+Kc3yjZOXDIIhHN1efAxyM/iWD0qDev9dAAwY29tLmFtYXpvbi5wb2ludHMuZW5jcnlwdGlvbi5rZXkuYWNjb3VudHNzZXJ2aWNlc3IADmphdmEubGFuZy5Mb25nO4vkkMyPI98CAAFKAAV2YWx1ZXhyABBqYXZhLmxhbmcuTnVtYmVyhqyVHQuU4IsCAAB4cAAAAAAAAAAB,jp-point
join(1) assumes both files are sorted alphabetically on the join fields. Try sorting the inputs without -n.
(To be more precise, it depends on the LC_COLLATE setting. If you are sorting for the benefit of two programs talking to each other, it is probably more reliable to set LC_ALL=C for both join and sort to avoid any glitches due to locale settings.)

how to subtract the two files in linux

I have two files like below:
file1
"Connect" CONNECT_ID="12"
"Connect" CONNECT_ID="11"
"Connect" CONNECT_ID="122"
"Connect" CONNECT_ID="109"
file2
"Quit" CONNECT_ID="12"
"Quit" CONNECT_ID="11"
The file contents are not exactly same but similar to above and the number of records are minimum 100,000.
Now i want to get the result as show below into file1 (means the final result should be there in file1)
"Connect" CONNECT_ID="122"
"Connect" CONNECT_ID="109"
I have used a while loop something like below:
awk {'print $2'} file2 | sed "s/CONNECTION_ID=//g" > sample.txt
while read actual; do
grep -w -v $actual file1 > file1_tmp
mv -f file1_tmp file1
done < sample.txt
Here I have adjusted my code according to example. So it may or may not work.
My problem is the loop is repeating for more than 1 hour to complete the process.
So can any one suggest me how to achieve the same with any other ways like using diff or comm or sed or awk or any other linux command which will run faster?
Here mainly I want to eliminate this big typical while loop.
Most UNIX tools are line based and as you don't have whole line matches that means grep, comm and diff are out the window. To extract field based information like you want awk is perfect:
$ awk 'NR==FNR{a[$2];next}!($2 in a)' file2 file1
"Connect" CONNECT_ID="122"
"Connect" CONNECT_ID="109"
To store the results back to file1 you'll need to redict the output to a temporary file and then move the file into file1 like so:
$ awk 'NR==FNR{a[$2];next}!($2 in a)' file2 file1 > tmp && mv tmp file1
Explanation:
The awk variable NR increments for every record read, that is each line in every file. The FNR variable increments for every record but gets reset for every file.
NR==FNR # This condition is only true when reading file1
a[$2] # Add the second field in file1 into array as a lookup table
next # Get the next line in file1 (skips any following blocks)
!($2 in a) # We are now looking at file2 if the second field not in the look up
# array execute the default block i.e print the line
To modify this command you just need to change the fields that matched. In your real case if you want to match field 1 from file1 with field 4 from file2 then you would do:
$ awk 'NR==FNR{a[$1];next}!($4 in a)' file2 file1
This might work for you (GNU sed):
sed -r 's|\S+\s+(\S+)|/\1/d|' file2 | sed -f - -i file1
The tool best suited to this job is join(1). It joins two files based on values in a given column of each file. Normally it just outputs the lines that match across the two files, but it also has a mode to output the lines from one of the files that do not match the other file.
join requires that the files be sorted on the field(s) you are joining on, so either pre-sort the files, or use process substitution (a bash feature - as in the example below) to do it on the one command line:
$ join -j 2 -v 1 -o "1.1 1.2" <(sort -k2,2 file1) <(sort -k2,2 file2)
"Connect" CONNECT_ID="122"
"Connect" CONNECT_ID="109"
-j 2 says to join the files on the second field for both files.
-v 1 says to only output fields from file 1 that do not match any in file 2
-o "1.1 1.2" says to order the output with the first field of file 1 (1.1) followed by the second field of file 1 (1.2). Without this, join will output the join column first followed by the remaining columns.
You may need to analyze file2 at fist, and append all ID which have appered to a cache(eg. memory)
Than scan file1 line by line to adjust whether the ID in the cache.
python code like this:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
p = re.compile(r'CONNECT_ID="(.*)"')
quit_ids = set([])
for line in open('file2'):
m = p.search(line)
if m:
quit_ids.add(m.group(1))
output = open('output_file', 'w')
for line in open('file1'):
m = p.search(line)
if m and m.group(1) not in quit_ids:
output.write(line)
output.close()
The main bottleneck is not really the while loop, but the fact that you rewrite the output file thousands of times.
In your particular case, you might be able to get away with just this:
cut -f2 file2 | grep -Fwvf - file1 >tmp
mv tmp file1
(I don't think the -w option to grep is useful here, but since you had it in your example, I retained it.)
This presupposes that file2 is tab-delimited; if not, the awk '{ print $2 }' file2 you had there is fine.

How do I randomly merge two input files to one output file using unix tools?

I have two text files, of different sizes, which I would like to merge into one file, but with the content mixed randomly; this is to create some realistic data for some unit tests. One text file contains the true cases, while the other the false.
I would like to use standard Unix tools to create the merged output. How can I do this?
Random sort using -R:
$ sort -R file1 file2 -o file3
My version of sort also does not support -R. So here is an alternative using awk by inserting a random number in front of each line and sorting according to those numbers, then strip off the number.
awk '{print int(rand()*1000), $0}' file1 file2 | sort -n | awk '{$1="";print $0}'
This adds a random number to the beginning of each line with awk, sorts based on that number, and then removes it. This will even work if you have duplicates (as pointed out by choroba) and is slightly more cross platform.
awk 'BEGIN { srand() } { print rand(), $0 }' file1 file2 |
sort -n |
cut -f2- -d" "

sort unix file by id

I want to sort a unix file by the id column but when I use sort -k4,4 or -k4,4n I do not get the expected result.
The column of interest should be sorted like this:
id1
id2
id3
id4
etc.
Instead it is sorted like this when i do sort -k4,4
id1
id10
id100
id1000
id10000
id10001
etc.
My unix version uses the following sort function:
sort --help
Usage: sort [OPTION]... [FILE]...
Write sorted concatenation of all FILE(s) to standard output.
Mandatory arguments to long options are mandatory for short options too.
Ordering options:
-b, --ignore-leading-blanks ignore leading blanks
-d, --dictionary-order consider only blanks and alphanumeric characters
-f, --ignore-case fold lower case to upper case characters
-g, --general-numeric-sort compare according to general numerical value
-i, --ignore-nonprinting consider only printable characters
-M, --month-sort compare (unknown) < `JAN' < ... < `DEC'
-n, --numeric-sort compare according to string numerical value
-r, --reverse reverse the result of comparisons
Other options:
-c, --check check whether input is sorted; do not sort
-k, --key=POS1[,POS2] start a key at POS1, end it at POS2 (origin 1)
-m, --merge merge already sorted files; do not sort
-o, --output=FILE write result to FILE instead of standard output
-s, --stable stabilize sort by disabling last-resort comparison
-S, --buffer-size=SIZE use SIZE for main memory buffer
-t, --field-separator=SEP use SEP instead of non-blank to blank transition
-T, --temporary-directory=DIR use DIR for temporaries, not $TMPDIR or /tmp;
multiple options specify multiple directories
-u, --unique with -c, check for strict ordering;
without -c, output only the first of an equal run
-z, --zero-terminated end lines with 0 byte, not newline
--help display this help and exit
--version output version information and exit
Use the -V or --version-sort option for version sort
sort -V -k4,4 file.txt
Example:
$ cat file.txt
id5
id3
id100
id1
id10
Ouput:
$ sort -V file.txt
id1
id3
id5
id10
id100
EDIT:
If your implementation of sort doesn't have the -V option then a work-around using sed to remove id so a numeric sort -n can be done and then replace id back with sed, like this:
sed -E 's/id([0-9]+)/\1/' file.txt | sort -n -k4,4 | sed -E 's/( *)([0-9]+)( *|$)/\1id\2\3/'
Note: this solution is dependent on the data, only works if no columns containing purely numbers are found before the ID column.
As sudo_o has already mentioned, the easiest would be to use --version-sort which does natural sorting of numbers that occur within text.
If your version of sort does not have that option, a hacky way to approach this would be to temporarily remove the "id" prefix before a sort, then replace them. Here's one way, using awk:
awk 'sub("^id", "", $4)' file.txt | sort -k4,4n | awk 'sub("^", "id", $4)'
If your sort supports it, you can also use the syntax F.C to use specific characters from a field.
This would sort on field 4, from chars 3 to 10, numerical value:
sort -bn -k 4.3,4.10 file
And this would sort on field 4, from chars 3 to end of field, numerical value:
sort -bn -k 4.3,4 file

Resources