Linux: Comparing two files but not caring what line only content - linux

I am trying to use comm or diff Linux commands to compare to different files. Each file has a list of volume names. File A has 1500 volumes and file B has those same 1500 volumes plus another 200 with a total of 1700. I am looking for away to just find those 200 volumes. I dont care if the volumes match and are on different lines, I only want the mismatched volumes but the diff and comm command seem to only compare line by line. Does anyone know another command or a way to use the comm or diff command to find these 200 volumes?
First 5 lines of both files: (BTW there is only one volume on each line so File A has 1500 lines and File B has 1700 lines)
File A:
B00004
B00007
B00010
B00011
B00013
File B:
B00003
B00004
B00007
B00008
B00010
So I would want the command to show me B00003 and B00008 just from the first 5 lines because those volumes are not in File A

awk can also help.
awk 'NR==FNR {a[$1]=$1; next}!($1 in a) {print $0}' fileA fileB

Try
comm -23 <( sort largerFile) <(sort smallerFile)
This assumes that your Vol name will be the first "field" in the data. If not, check man sort for ways to sort files on alternate fields (and combinations of fields).
The <( ....) construct is known as process substitution. If you're using a really old shell/unix or reduced functionality shell (dash?), process substitution may not be available. Then you'll have to sort your files before you run comm and manage what you do with the unsorted file.
Note that as comm -23 means "suppress output from 2nd file" (-2) and "suppress output from the two files in common" (-3), the remaining output is differences found in file1 that are not in file2. This is why I list largerFile first.
IHTH

Related

Is there a way to compare N files at once, and only leave lines unique to each file?

Background
I have five files that I am trying to make unique relative to each other. In other words, I want to make it so that the lines of text in each file have no commonality with each other.
Attempted solution
So far, I have been able to run the grep -vf command comparing one file with the other 4 as so:
grep -vf file2.txt file1.txt
grep -vf file3.txt file1.txt
...
This makes it print out the lines in file1 that are not in file2, nor file3, etc.. However, this becomes cumbersome because I would need to do this for the superset of all files. In otherwords, to truly reduce each file to lines of text only in that file, I would have to do every combination of files into the grep -vf command. Given that this sounds cumbersome to me, I wanted to know...
Question
What is the command/series of commands in linux to find the lines of text in each file that is mutually exclusive to all the other files?
You could just do:
awk '!a[$0]++ { out=sprintf("%s.out", FILENAME); print > out}' file*
This will write the lines that are uniq in file to file.out. Each line will be written to the output file of the associated input file in which it first appears, and subsequent duplicates of that same line will be suppressed.

Shell Script - How to merge two text files without repeating lines

my case is apparently easy, but I couldn't do it in a simple way and I need it because the real files is very large.
So, I have two txt files and I would like to generate a new file containing the both content of the two without duplicating the lines. Something like that:
file1.txt
192.168.0.100
192.168.0.101
192.168.0.102
file2.txt
192.168.0.100
192.168.0.101
192.168.1.200
192.168.1.201
I would like to merge these files above and generate another one like this:
result.txt
192.168.0.100
192.168.0.101
192.168.0.102
192.168.1.200
192.168.1.201
Any simple sugestions?
Thank you
If changing the order is not an issue:
sort -u file1.txt file2.txt > result.txt
First this sorts the lines of both files (in memory), then it runs through them and outputs each unique line only once (-u flag).
There's a semi-standard idiom in awk for removing duplicates:
awk '!a[$0]++ {print}' file1.txt file2.txt
The array a counts occurrences of each line, but only prints a line the first time it is added (i.e., when a[$0] is 0 before it is incremented).
This is asymptotically faster than sorting the input (and preserves the input order), but requires more memory.

Split large gz files while preserving rows

I have a larger .gz file (2.1G) that I am trying to load into R, but it is large enough that I have to split it into pieces and load each individually before recombining them. However, I am having difficulty in splitting the file in a way that preserves the structure of the data. The file itself, with the exception of the first two rows, is a 56318 x 9592 matrix with non-homogenous entries.
I'm using Ubuntu 16.04. First, I tried using the split command from terminal as suggested by this link (https://askubuntu.com/questions/54579/how-to-split-larger-files-into-smaller-parts?rq=1)
$ split --lines=10000 "originalFile.gct.gz" "originalFile.gct.gz.part-"
Doing this, though, creates far more files than I would have expected (since my matrix has 57000 rows, I was hoping to output 6 files, each 10000 rows in size). When reading one of these into R and investigating the dimensions, I see that each is a matrix of 62x9592, indicating that the columns have all been preserved, but I'm getting significantly less rows than I would have hoped. Further, when reading it in, I get an error specifying an unexpected end of file. My thought is that it's not reading in how I want it to.
I found a two possible alternatives here - https://superuser.com/questions/381394/unix-split-a-huge-gz-file-by-line
In particular, I've tried piping different arguments using gunzip, and then passing the output through to split (with the assumption that perhaps the file being compressed is what led to inconsistent end lines). I tried
$ zcat originalFile.gct.gz | split -l 10000 "originalFile.gct.gz" "originalFile.gct.gz.part-"
but, doing this, I ended up with the exact same splits that I had previously. I have the same problem replacing "zcat" with "gunzip -c", which should have sent the uncompressed output to the split command.
Another answer on that link suggested piping to head or tail with something like zcat, for example
$ zcat originalFile.gct.gz | head -n 10000 >> "originalFile.gct.gz.1"
With zcat, this works perfectly, and it's exactly what I want. The dimension for this ends up being 10000x9592, so this is the ideal solution. One thing that I'll note is that this output is an ASCII text file rather than a compressed file, and I'm perfectly OK with that.
However, I want to be able to do this until end up file, making an additional output file for each 10000 rows. For this particular case, it's not a huge deal to just make the six, but I have tens of files like this, some of which are >10gb. My question, then, is how can I use split command that will take the first 10000 lines of the unzipped file and then output them, automatically updating the suffix with each new file? Basically, I want the output that I got from using "head", but with "split" so that I can do it over the entire file.
Here is the solution that ended up working for me
$ zcat originalFile.gct.gz | split -l 10000 - "originalFile.gtc.gz-"
As Guido mentioned in the comment, my original command
$ zcat originalFile.gct.gz | split -l 10000 "originalFile.gct.gz" "originalFile.gct.gz.part-"
was discarding the output of zcat, and split was once again reading from the compressed data. By including the "-" after the split argument, I was able to pass the standard output from zcat into split, and now the piping works as I was expecting it to.
When you want to control your splitting better, you can use awk.
You mentioned that the first two rows were special.
Try something like
zcat originalFile.gct.gz |
awk 'BEGIN {j=1} NR<3 {next} {i++} i%5==0 {j++} {print > "originalFile.gct.part"j }'
When you want your outfiles compressed, modify the awk command: Let is print the completed files and use xargs to gzip them.
If spliting based on the content of the file works for you. Try:
zcat originalFile.gct.gz | awk -F$',' '{print $0 | "gzip > /tmp/name_"$1".gct.gz";}'
and example line of my file was: 2014,daniel,2,1,2,3
So I was splitting the files for the year (first column) using the
If spliting based on the content of the file works for you. Try:
zcat originalFile.gct.gz | awk -F$',' '{print $0 | "gzip > /tmp/file_"$1".gct.gz";}'
and example line of my file was:
2014,daniel,2,1,2,3
So I was splitting the files for the year (first column) using the variable $1
Getting and ouput of:
/tmp/file_2014.gct.gz
/tmp/file_2015.gct.gz
/tmp/file_2016.gct.gz
/tmp/file_2017.gct.gz
/tmp/file_2018.gct.gz

How to compare two text files for the same exact text using BASH?

Let's say I have two text files that I need to extract data out of. The text of the two files is as follows:
File 1:
1name - randomemail#email.com
2Name - superrandomemail#email.com
3Name - 123random#email.com
4Name - random123#email.com
File 2:
email.com
email.com
email.com
anotherwebsite.com
File 2 is File 1's list of domain names, extracted from the email addresses.
These are not the same domain names by any means, and are quite random.
How can I get the results of the domain names that match File 2 from File 1?
Thank you in advance!
Assuming that order does not matter,
grep -F -f FILE2 FILE1
should do the trick. (This works because of a little-known fact: the -F option to grep doesn't just mean "match this fixed string," it means "match any of these newline-separated fixed strings.")
The recipe:
join <(sed 's/^.*#//' file1|sort -u) <(sort -u file2)
it will output the intersection of all domain names in file1 and file2
See BashFAQ/036 for the list of usual solutions to this type of problem.
Use VimDIFF command, this gives a nice presentation of difference
If I got you right, you want to filter for all addresses with the host mentioned in File 2.
You could then just loop over File 2 and grep for #<line>, accumulating the result in a new file or something similar.
Example:
cat file2 | sort -u | while read host; do grep "#$host" file1; done > filtered

Diff-ing files with Linux command

What Linux command allow me to check if all the lines in file A exist in file B? (it's almost like a diff, but not quite). Also file A has uniq lines, as is the case with file B as well.
The comm command compares two sorted files, line by line, and is part of GNU coreutils.
Are you looking for a better diff tool?
https://stackoverflow.com/questions/12625/best-diff-tool
So, what if A has
a
a
b
and b has
a
b
What would you want the output to be(yes or no)?
Use diff command.
Here is a useful vide with complete usage of diff command under 3 min
Click Here
if cat A A B | sort | uniq -c | egrep -e '^[[:space:]]*2[[:space:]]' > /dev/null; then
echo "A has lines that are not in B."
fi
If you do not redirect the output, you will get a list of all the lines that are in A that are not in B (except each line will have a 2 in front if it). This relies on the lines in A being unique, and the lines in B being unique.
If they aren't, and you don't care about counting duplicates, it's relatively simple to transform each file into a list of unique lines using sort and uniq.

Resources