Comparing two different files - linux

say I have two data files that could look like this.
A dog 3
A cat 1
A mouse 4
A chicken 4
and
B tiger 2
B chicken 1
B dog 3
B wolf 2
How would I be able to look at only the animals that are common in both files? Ideally, I would like the output to look something like
dog 3 3
chicken 4 1
But even outputting just the ones along with its value that are common in both files is good enough for me. Thanks.

this one-liner should do:
awk 'NR==FNR{a[$2]=$2 FS $3;next}a[$2]{print a[$2],$3}' f1 f2

#Kent has done some serious one line magic. Anyway, I did a shell script you could try. Simply run ./script[file1] [file2]
#!/bin/bash
# Read input
words1=$(cat $1 | sed -r "s/.*\ (.*)\ .*/\1/")
val1=$(cat $1 | sed -r "s/.*\ .*\ (.*)/\1/")
words2=$(cat $2 | sed -r "s/.*\ (.*)\ .*/\1/")
val2=$(cat $2 | sed -r "s/.*\ .*\ (.*)/\1/")
# Convert to array
words1=($words1)
val1=($val1)
words2=($words2)
val2=($val2)
# Iterate and print result
for i in "${!words1[#]}"; do
for j in "${!words2[#]}"; do
if [ ${words1[i]} == ${words2[j]} ]; then
echo "${words1[i]} ${val1[i]} ${val2[j]}"
break
fi
done
done
exit 0

I'm not sure why this is a linux/unix question. It looks like what you need is a simple program that you'll need to write, as this isn't a basic compare-two-files issue that would be generally covered by applications like Beyond Compare.
Let's assume these files are basic text files that contain one record per line with space-delimited values. (Use space as the delimiter is dangerous, but that's what you have above). You'll need to read in each file, storing both files as [iterable collection], and have each object either be a string that you act on in each run of a loop, or that you break into pieces as you build from the file. You'll need to compare [linepart 1] from the first file to each [linepart 1] in the second file, and whenever you find a match, break and output [linepart 1] [A.linepart 2] [B.linepart 2].
I can't think of any existing program that would do this for you, but it's fairly simple (assuming you think file IO is simple) to handle with Java, C#, etc.

Related

How to reverse each word in a text file with linux commands without changing order of words

There's lots of questions indicating how to reverse each word in a sentence, and I could readily do this in Python or Javascript for example, but how can I do it with Linux commands? It looks like tac might be an option, but seems like this would likely reverse lines as well as words, rather than just words? What other tools can do this? I literally have no idea. I know rev and tac and awk all seem like contenders...
So I'd like to go from:
cat dog sleep
pillow green blue
to:
tac god peels
wollip neerg eulb
**slight followup
From this reference it looks like I could use awk to break each field up into an array of single characters and then write a for loop to reverse manually each word in this way. This is quite awkward. Surely there's a better/more succinct way to do this?
Try this on for size:
sed -e 's/\s+/ /g' -e 's/ /\n/g' < file.txt | rev | tr '\n' ' ' ; echo
It collapses all the space and counts punctuation as part of "words", but it looks like it (at least mostly) works. Hooray for sh!

How do I do a one way diff in Linux?

How do I do a one way diff in Linux?
Normal behavior of diff:
Normally, diff will tell you all the differences between a two files. For example, it will tell you anything that is in file A that is not in file B, and will also tell you everything that is in file B, but not in file A. For example:
File A contains:
cat
good dog
one
two
File B contains:
cat
some garbage
one
a whole bunch of garbage
something I don't want to know
If I do a regular diff as follows:
diff A B
the output would be something like:
2c2
< good dog
---
> some garbage
4c4,5
< two
---
> a whole bunch of garbage
> something I don't want to know
What I am looking for:
What I want is just the first part, for example, I want to know everything that is in File A, but not file B. However, I want it to ignore everything that is in file B, but not in file A.
What I want is the command, or series of commands:
???? A B
that produces the output:
2c2
< good dog
4c4,5
< two
I believe a solution could be achieved by piping the output of diff into sed or awk, but I am not familiar enough with those tools to come up with a solution. I basically want to remove all lines that begin with --- and >.
Edit: I edited the example to account for multiple words on a line.
Note: This is a "sub-question" of: Determine list of non-OS packages installed on a RedHat Linux machine
Note: This is similar to, but not the same as the question asked here (e.g. not a dupe):
One-way diff file
An alternative, if your files consist of single-line entities only, and the output order doesn't matter (the question as worded is unclear on this), would be:
comm -23 <(sort A) <(sort B)
comm requires its inputs to be sorted, and the -2 means "don't show me the lines that are unique to the second file", while -3 means "don't show me the lines that are common between the two files".
If you need the "differences" to be presented in the order they occur, though, the above diff / awk solution is ok (although the grep bit isn't really necessary - it could be diff A B | awk '/^</ { $1 = ""; print }'.
EDIT: fixed which set of lines to report - I read it backwards originally...
As stated in the comments, one mostly correct answer is
diff A B | grep '^<'
although this would give the output
< good dog
< two
rather than
2c2
< good dog
4c4,5
< two
diff A B|grep '^<'|awk '{print $2}'
grep '^<' means select rows start with <
awk '{print $2}' means select the second column
If you want to also see the files in question, in case of diffing folders, you can use
diff public_html temp_public_html/ | grep '^[^>]'
to match all but lines starting with >

Grep (a.txt - En word list, b.txt - one string in each line) Q: string from b.txt built only from words or not?

I have a list with English words (1 in each line, around 100.000)-> a.txt and a b.txt contains strings (around 50.000 line, one string in each line, can contain pure words, word+something, garbage). I would like to know which strings from b.txt contains English words only (without any additional chars).
Can I do this with grep?
Example:
a.txt:
apple
pie
b.txt:
applepie
applebs
bspie
bsabcbs
Output:
c.txt:
applepie
Since your question is underspecified, maybe this answer can help as a shot in the dark to clarify your question:
c='cat b.txt'
while IFS='' read -e line
do
c="$c | grep '$line'"
done < a.txt
eval "$c" > c.txt
But this would also match a line like this is my apply on a pie. I don't know if that's what you want.
This is another try:
re=''
while IFS='' read -e line
do
re="$re${re:+|}$line"
done < a.txt
grep -E "^($re)*$" b.txt > c.txt
This will let pass only the lines which have nothing but a concatenation of these words. But it will also let pass things like 'appleapplepieapplepiepieapple'. Again, I don't know if this is what you want.
Given your latest explanation in the question I would propose another approach (because building such a list out of 100000+ words is not going to work).
A working approach for this amount of words could be to remove all recognized words from the text and see which lines get emptied in the process. This can easily be done iteratively without exploding the memory usage or other resources. It will take time, though.
cp b.txt inprogress.txt
while IFS='' read -e line
do
sed -i "s/$line//g" inprogress.txt
done < a.txt
for lineNumber in $(grep -n '^$' inprogress.txt | sed 's/://')
do
sed -n "${lineNumber}p" b.txt
done
rm inprogress.txt
But this still would not really solve your issue; consider if you have the words to and potato in your list, and removing the to would occur first, then this would leave a word pota in your text file, and pota is not a word which would then be removed.
You could address that issue by sorting your word file by word length (longest words first) but that still would be problematic in some cases of compound words, e. g. redart (being red + art) but dart would be removed first, so re would remain. If that is not in your word list, you would not recognize this word.
Actually, your problem is one of logical programming and natural language processing and probably does not fit to SO. You should have a look at the language Prolog which is designed around such problems as yours.
I will post this as an answer as well since I feel this is the correct answer to your specific question.
Your requirement is to find non-English words in a file (b.txt) based on a word list ( a.txt) which contains a list of English words. Based on the example in your question said word list does not contain compound words (e.g. applepie) but you would still like to match the file against compound words based on words in your word list (e.g. apple and pie).
There are two problem you are facing:
Not every permutation of words in a.txt will be a valid English compound word so just based on this your problem is already impossible to solve.
If you, nonetheless, were to attempt building a list of compound words yourself by compiling a list of all possible permutations you cannot easily do this because of the size of your wordlist (and resulting memory problems). You would most probably have to store your words in a more complex data structure, e.g. a tree, and build permutations on the fly by traversing the tree which is not doable in shell scripting.
Because of these points and your actual question being "can this be done with grep?" the answer is no, this is not possible.

How to determine if the content of one file is included in the content of another file

First, my apologies for what is perhaps a rather stupid question that doesn't quite belong here.
Here's my problem: I have two large text files containing a lot of file names, let's call them A and B, and I want to determine if A is a subset of B, disregarding order, i.e. for each file name in A, find if file name is also in B, otherwise A is not a subset.
I know how to preprocess the files (to remove anything but the file name itself, removing different capitalization), but now I'm left to wonder if there is a simple way to perform the task with a shell command.
Diff probably doesn't work, right? Even if I 'sort' the two files first, so that at least the files that are present in both will be in the same order, since A is probably a proper subset of B, diff will just tell me that every line is different.
Again, my apologies if the question doesn't belong here, and in the end, if there is no easy way to do it I will just write a small program to do the job, but since I'm trying to get a better handle on the shell commands, I thought I'd ask here first.
Do this:
cat b | sort -u | wc
cat a b | sort -u | wc
If you get the same result, a is a subset of b.
Here's how to do it in awk
awk '
# read A, the supposed subset file
FNR == NR {a[$0]; next}
# process file B
$0 in a {delete a[$0]}
END {if (length(a) == 0) {print "A is a proper subset of B"}}
' A B
Test if an XSD file is a subset of a WSDL file:
xmllint --format file.wsdl | awk '{$1=$1};1' | sort -u | wc
xmllint --format file.wsdl file.xsd | awk '{$1=$1};1' | sort -u | wc
This adapts the elegant concept of RichieHindle's prior answer using:
xmllint --format instead of cat, to pretty print the XML so each XML element was on one line, as required by sort -u | wc. Other pretty printing commands might work here, e.g. jq . for json.
an awk command to normalise the whitespace: strip leading and trailing (because the indentation is different in both files), and collapse internal. Caveat: does not consider XML attribute order within the element.

sort across multiple files in linux

I have multiple (many) files; each very large:
file0.txt
file1.txt
file2.txt
I do not want to join them into a single file because the resulting file would be 10+ Gigs. Each line in each file contains a 40-byte string. The strings are fairly well ordered right now, (about 1:10 steps is a decrease in value instead of an increase).
I would like the lines ordered. (in-place if possible?) This means some of the lines from the end of file0.txt will be moved to the beginning of file1.txt and vice versa.
I am working on Linux and fairly new to it. I know about the sort command for a single file, but am wondering if there is a way to sort across multiple files. Or maybe there is a way to make a pseudo-file made from smaller files that linux will treat as a single file.
What I know can do:
I can sort each file individually and read into file1.txt to find the value larger than the largest in file0.txt (and similarly grab the lines from the end of file0.txt), join and then sort.. but this is a pain and assumes no values from file2.txt belong in file0.txt (however highly unlikely in my case)
Edit
To be clear, if the files look like this:
f0.txt
DDD
XXX
AAA
f1.txt
BBB
FFF
CCC
f2.txt
EEE
YYY
ZZZ
I want this:
f0.txt
AAA
BBB
CCC
f1.txt
DDD
EEE
FFF
f2.txt
XXX
YYY
ZZZ
I don't know about a command doing in-place sorting, but I think a faster "merge sort" is possible:
for file in *.txt; do
sort -o $file $file
done
sort -m *.txt | split -d -l 1000000 - output
The sort in the for loop makes sure the content of the input files is sorted. If you don't want to overwrite the original, simply change the value after the -o parameter. (If you expect the files to be sorted already, you could change the sort statement to "check-only": sort -c $file || exit 1)
The second sort does efficient merging of the input files, all while keeping the output sorted.
This is piped to the split command which will then write to suffixed output files. Notice the - character; this tells split to read from standard input (i.e. the pipe) instead of a file.
Also, here's a short summary of how the merge sort works:
sort reads a line from each file.
It orders these lines and selects the one which should come first. This line gets sent to the output, and a new line is read from the file which contained this line.
Repeat step 2 until there are no more lines in any file.
At this point, the output should be a perfectly sorted file.
Profit!
It isn't exactly what you asked for, but the sort(1) utility can help, a little, using the --merge option. Sort each file individually, then sort the resulting pile of files:
for f in file*.txt ; do sort -o $f < $f ; done
sort --merge file*.txt | split -l 100000 - sorted_file
(That's 100,000 lines per output file. Perhaps that's still way too small.)
I believe that this is your best bet, using stock linux utilities:
sort each file individually, e.g. for f in file*.txt; do sort $f > sorted_$f.txt; done
sort everything using sort -m sorted_file*.txt | split -d -l <lines> - <prefix>, where <lines> is the number of lines per file, and <prefix> is the filename prefix. (The -d tells split to use numeric suffixes).
The -m option to sort lets it know the input files are already sorted, so it can be smart.
mmap() the 3 files, as all lines are 40 bytes long, you can easily sort them in place (SIP :-). Don't forget the msync at the end.
If the files are sorted individually, then you can use sort -m file*.txt to merge them together - read the first line of each file, output the smallest one, and repeat.

Resources