Diff-ing files with Linux command - linux

What Linux command allow me to check if all the lines in file A exist in file B? (it's almost like a diff, but not quite). Also file A has uniq lines, as is the case with file B as well.

The comm command compares two sorted files, line by line, and is part of GNU coreutils.

Are you looking for a better diff tool?
https://stackoverflow.com/questions/12625/best-diff-tool

So, what if A has
a
a
b
and b has
a
b
What would you want the output to be(yes or no)?

Use diff command.
Here is a useful vide with complete usage of diff command under 3 min
Click Here

if cat A A B | sort | uniq -c | egrep -e '^[[:space:]]*2[[:space:]]' > /dev/null; then
echo "A has lines that are not in B."
fi
If you do not redirect the output, you will get a list of all the lines that are in A that are not in B (except each line will have a 2 in front if it). This relies on the lines in A being unique, and the lines in B being unique.
If they aren't, and you don't care about counting duplicates, it's relatively simple to transform each file into a list of unique lines using sort and uniq.

Related

What is the usage of sorted command?

I have read most of the example comes with sort command. How ever I am not sure what is the usage of sort command in this style?
sort <word> sorted
That would just be two file names, as in
sort file1 file2 file3...
If you pass multiple file names, sort concatenates them and sorts all of them together.
If you're asking how to sort a string with the sort command:
echo "tatoine" | grep -o . | sort | tr -d "\n"
aeinott
because sort operate on lines so you've got to cut the string in multiple lines with one letter on each (grep -o .) and after sorting you just delete the new lines with the tr command.
Are those < and > symbols explicit, or do they indicate a parameter that is to be replaced? If the latter, then you're reading from a file called "word", and writing the sorted data to a file called "sorted".
Are you trying to save the content in a sorted order?
Let's say you have a file name.txt with the following content.
Zoe
John
Amy
Mary
Mark
Peter
You can use the sort commmand "sort name.txt" and the output goes to the console
You can save the output using "sort name.txt -o sortedname.txt"
e.g.
Amy
John
Mark
mary
Peter
Zoe
You can found more option with the command "man sort" and "info sort"
rojomoke was right about the > and < commands. Those are redirection commands.
We usually read the data from standard input (stdin) and output goes to standard output aka the screen (stdout)
< means get the data from somewhere else. e.g. a file.
> means redirect the output to somewhere else e.g. a file.
So for the command above "sort name.txt -o sortedname.txt", I could have written as follow.
sort < name.txt > sortedname.txt
You can read more about the redirection in this wiki entry.
https://en.wikipedia.org/wiki/Redirection_(computing)
Commands like | >> will come in handy down the road.

How can I find which lines in a certain file are not started by lines from another file using bash?

I have two text files, A and B:
A:
a start
b stop
c start
e start
B:
b
c
How can I find which lines in A are not started by lines from B using shell(bash...) command. In this case, I want to get this answer:
a start
e start
Can I implement this using a single line of command?
This should do:
sed '/^$/d;s/^/^/' B | grep -vf - A
The sed command will take all non-empty lines (observe the /^$/d command) from the file B and prepend a caret ^ in front of each line (so as to obtain an anchor for grep's regexp), and spits all this to stdout. Then grep, with the -f option (which means take all patterns from a file, which happens to be stdin here, thanks to the - symbol) and does an invert matching (thanks to the -v option) on file A. Done.
I think this should do it:
sed 's/^/\^/g' B > C.tmp
grep -vEf C.tmp A
rm C.tmp
You can try using a combination of xargs, cat, and grep
Save the first letters of each line into FIRSTLETTERLIST. You can do this with some cat and sed work.
The idea is to take the blacklist and then match it against the interesting file.
cat file1.txt | xargs grep ^[^[$FIRSTLETTERLIST]]
This is untested, so I won't guarantee it will work, but it should point you in the right direction.

How to find the particular text stored in the file "data.txt" and it occurs only once

The line I seek is stored in the file data.txt and is the only line of text that occurs only once.
How do I go about finding that particular line using linux?
This is a little bit old, but I think you are looking for this...
cat data.txt | sort | uniq -u
This will show the unique values that only occur once in the file. I assume you are familiar with "over the wire" if you are asking?? If so, this is what you are looking for.
To provide some context (I need more rep to comment) this is a question that features in an online "wargame" called Bandit that involves using the command line to discover passwords on an online Linux server to advance up the levels.
For those who would like to see data.txt in full I've Pastebin'd it here however it looks like this:
NN4e37KW2tkIb3dC9ZHyOPdq1FqZwq9h
jpEYciZvDIs6MLPhYoOGWQHNIoQZzE5q
3rpovhi1CyT7RUTunW30goGek5Q5Fu66
JOaWd4uAPii4Jc19AP2McmBNRzBYDAkO
JOaWd4uAPii4Jc19AP2McmBNRzBYDAkO
9WV67QT4uZZK7JHwmOH0jnhurJMwoGZU
a2GjmWtTe3tTM0ARl7TQwraPGXgfkH4f
7yJ8imXc7NNiovDuAl1ZC6xb0O0mMBx1
UsvVyFSfZZWbi6wgC7dAFyFuR6jQQUhR
FcOJhZkHlnwqcD8QbvjRyn886rCrnWZ7
E3ugYDa6Wh2y8C8xQev7vOS8O3OgG1Hw
E3ugYDa6Wh2y8C8xQev7vOS8O3OgG1Hw
ME7nnzbId4W3dajsl6Xtviyl5uhmMenv
J5lN3Qe4s7ktiwvcCj9ZHWrAJcUWEhUq
aouHvjzagN8QT2BCMB6e9rlN4ffqZ0Qq
ZRF5dlSuwuVV9TLhHKvPvRDrQ2L5ODfD
9ZjR3NTHue4YR6n4DgG5e0qMQcJjTaiM
QT8Bw9ofH4x3MeRvYAVbYvV1e1zq3Xim
i6A6TL6nqvjCAPvOdXZWjlYgyvqxmB7k
tx7tQ6kgeJnC446CHbiJY7fyRwrwuhrs
One way to do it is to use:
sort data.txt | uniq -u
The sort command is like cat in that it displays the contents of the file however it sorts the file lexicographically by lines (it reorders them alphabetically so that matching ones are together).
The | is a pipe that redirects the output from one command into another.
The uniq command reports or omits repeated lines and by passing it the -u argument we tell it to report only unique lines.
Used together like this, the command will sort data.txt lexicographically by each line, find the unique line and print it back in the terminal for you.
sort -u data.txt | while read line; do if [ $(grep -c $line data.txt) == 1 ] ;then echo $line; fi; done
was mine solution, until I saw here easy one:
sort data.txt | uniq -u
Add more information to you post.
How data.txt look like?
Like this:
11111111
11111111
pass1111
11111111
Or like this
afawfdgd
password
somethin
gelse...
And, do you know the password is in file or you search for not repeat string.
If you know password, use something like this
cat data.txt | grep 'password'
If you don`t know the password and this password is only unique line in file you must create a script.
For example in Python
file = open("data.txt","r")
f = file.read()
for line in f:
if 'pass' in line:
print pass
Of course replace pass with something else.
For example some slice from line.
And one with only one tool in use, awk:
awk '{a[$1]++}END{for(i in a){if(a[i] == 1){print i} }}' data.txt
sort data.txt | uniq -c | grep 1\ ?*
and it will print the only text that occurs only one time
do not forget to put space after the backslash
sort data.txt | uniq -c | grep 1
you will find only one that accures one time

How to compare two text files for the same exact text using BASH?

Let's say I have two text files that I need to extract data out of. The text of the two files is as follows:
File 1:
1name - randomemail#email.com
2Name - superrandomemail#email.com
3Name - 123random#email.com
4Name - random123#email.com
File 2:
email.com
email.com
email.com
anotherwebsite.com
File 2 is File 1's list of domain names, extracted from the email addresses.
These are not the same domain names by any means, and are quite random.
How can I get the results of the domain names that match File 2 from File 1?
Thank you in advance!
Assuming that order does not matter,
grep -F -f FILE2 FILE1
should do the trick. (This works because of a little-known fact: the -F option to grep doesn't just mean "match this fixed string," it means "match any of these newline-separated fixed strings.")
The recipe:
join <(sed 's/^.*#//' file1|sort -u) <(sort -u file2)
it will output the intersection of all domain names in file1 and file2
See BashFAQ/036 for the list of usual solutions to this type of problem.
Use VimDIFF command, this gives a nice presentation of difference
If I got you right, you want to filter for all addresses with the host mentioned in File 2.
You could then just loop over File 2 and grep for #<line>, accumulating the result in a new file or something similar.
Example:
cat file2 | sort -u | while read host; do grep "#$host" file1; done > filtered

Sort & uniq in Linux shell

What is the difference between the following to commands?
sort -u FILE
sort FILE | uniq
Using sort -u does less I/O than sort | uniq, but the end result is the same. In particular, if the file is big enough that sort has to create intermediate files, there's a decent chance that sort -u will use slightly fewer or slightly smaller intermediate files as it could eliminate duplicates as it is sorting each set. If the data is highly duplicative, this could be beneficial; if there are few duplicates in fact, it won't make much difference (definitely a second order performance effect, compared to the first order effect of the pipe).
Note that there times when the piping is appropriate. For example:
sort FILE | uniq -c | sort -n
This sorts the file into order of the number of occurrences of each line in the file, with the most repeated lines appearing last. (It wouldn't surprise me to find that this combination, which is idiomatic for Unix or POSIX, can be squished into one complex 'sort' command with GNU sort.)
There are times when not using the pipe is important. For example:
sort -u -o FILE FILE
This sorts the file 'in situ'; that is, the output file is specified by -o FILE, and this operation is guaranteed safe (the file is read before being overwritten for output).
There is one slight difference: return code.
The thing is that unless shopt -o pipefail is set the return code of the piped command will be return code of the last one. And uniq always returns zero (success). Try examining exit code, and you'll see something like this (pipefail is not set here):
pavel#lonely ~ $ sort -u file_that_doesnt_exist ; echo $?
sort: open failed: file_that_doesnt_exist: No such file or directory
2
pavel#lonely ~ $ sort file_that_doesnt_exist | uniq ; echo $?
sort: open failed: file_that_doesnt_exist: No such file or directory
0
Other than this, the commands are equivalent.
Beware! While it's true that "sort -u" and "sort|uniq" are equivalent, any additional options to sort can break the equivalence. Here's an example from the coreutils manual:
For example, 'sort -n -u' inspects only the value of the initial numeric string when checking for uniqueness, whereas 'sort -n | uniq' inspects the entire line.
Similarly, if you sort on key fields, the uniqueness test used by sort won't necessarily look at the entire line anymore. After being bitten by that bug in the past, these days I tend to use "sort|uniq" when writing Bash scripts. I'd rather have higher I/O overhead than run the risk that someone else in the shop won't know about that particular pitfall when they modify my code to add additional sort parameters.
sort -u will be slightly faster, because it does not need to pipe the output between two commands
also see my question on the topic: calling uniq and sort in different orders in shell
I have worked on some servers where sort don't support '-u' option. there we have to use
sort xyz | uniq
Nothing, they will produce the same result

Resources