Inner join on two text files - linux

Looking to perform an inner join on two different text files. Basically I'm looking for the inner join equivalent of the GNU join program. Does such a thing exist? If not, an awk or sed solution would be most helpful, but my first choice would be a Linux command.
Here's an example of what I'm looking to do
file 1:
0|Alien Registration Card LUA|Checklist Update
1|Alien Registration Card LUA|Document App Plan
2|Alien Registration Card LUA|SA Application Nbr
3|Alien Registration Card LUA|tmp_preapp-DOB
0|App - CSCE Certificate LUA|Admit Type
1|App - CSCE Certificate LUA|Alias 1
2|App - CSCE Certificate LUA|Alias 2
3|App - CSCE Certificate LUA|Alias 3
4|App - CSCE Certificate LUA|Alias 4
file 2:
Alien Registration Card LUA
Results:
0|Alien Registration Card LUA|Checklist Update
1|Alien Registration Card LUA|Document App Plan
2|Alien Registration Card LUA|SA Application Nbr
3|Alien Registration Card LUA|tmp_preapp-DOB

Here's an awk option, so you can avoid the bash dependency (for portability):
$ awk -F'|' 'NR==FNR{check[$0];next} $2 in check' file2 file1
How does this work?
-F'|' -- sets the field separator
'NR==FNR{check[$0];next} -- if the total record number matches the file record number (i.e. we're reading the first file provided), then we populate an array and continue.
$2 in check -- If the second field was mentioned in the array we created, print the line (which is the default action if no actions are provided).
file2 file1 -- the files. Order is important due to the NR==FNR construct.

Should not the file2 contain LUA at the end?
If yes, you can still use join:
join -t'|' -12 <(sort -t'|' -k2 file1) file2

Looks like you just need
grep -F -f file2 file1

You may modify this script:
cat file2 | while read line; do
grep $line file1 # or whatever you want to do with the $line variable
done
while loop reads file2 line by line and gives that line to the grep command that greps that line in file1. There're some extra output that maybe removed with grep options.

You can use paste command to combine file :
paste [option] source files [>destination file]
for your example it would be
paste file1.txt file2.txt >result.txt

Related

How to simply keep up-to-date records in file

I would like to simply keep up-to-date records in file A given file B using bash on Linux.
Both A and B files have same structure.
There is a record on each line of file consists of public-key and comment separated by space. Comment is a composition of user#hostname and is unique in file.
Example:
B file
xxxxxx user1#hostname1
yyyyyy user2#hostname2
wwwwww user3#hostname3
A file
yxxxxx user1#hostname1
zzzzzz user4#hostname4
yyyyyy user2#hostname2
Which should result into:
A file
xxxxx user1#hostname1
zzzzz user4#hostname4
yyyyy user2#hostname2
wwwww user3#hostname3
I know I can read B file line by line and check whether file A contains a record by comment. If not append record. If yes, check whether to update. However it evolves a multiple lines of code in bash script.
Can it be done simpler?
A little awk script
awk '
NR == FNR {print; seen[$2]; next}
!($2 in seen)
' A B
And to save the changes back to file A, pick one of
awk '...' A B | sponge A # from the `moreutils` package
tmp=$(mktemp)
awk '...' A B > "$tmp" && mv "$tmp" A
Yet another way to get the same result records, only sorted:
join -a1 -a2 -j2 B <(sort A) | awk '{print $2, $1}'
I suppose A is the original list and B is the update list.
Find users in B for which updates exist.
$ cut -d ' ' -f 2 B
user1#hostname1
user2#hostname2
user3#hostname3
Take A and remove all lines with users from B. This are the lines of A, for which no update exists.
$ grep -v -f <(cut -d ' ' -f 2 B) A
zzzzzz user4#hostname4
Append B to the above list:
$ grep -v -f <(cut -d ' ' -f 2 B) A; cat B
zzzzzz user4#hostname4
xxxxxx user1#hostname1
yyyyyy user2#hostname2
wwwwww user3#hostname3
Notice: the above works only as long as no email is a sub-string of another email. If this can not be guaranteed, you have to use extended regular expressions with word boundaries.
You can use diff command with grep
diff a.txt b.txt | grep -Po "^(<|>) \K.*"
If you are looking for a way to manage SSH keys in an authorized_keys file, my suggestion would be to generate this file from the *.pub keys in the current directory. Now the problem is reduced to adding a new key file, or removing or renaming a key file you want to exclude, and rerunning
cat *.pub >authorized_keys
(perhaps by way of a Makefile if that is a mechanism your users are familiar and comfortable with).
Obviously, there is a usability problem for users who forget or are unaware of this mechanism; but in many environments, this is acceptable and manageable with documentation and training.
The general mechanism of splitting monolithic configuration or data files into individual smaller files with simple fragments you can enable or disable individually is a good one to know about anyway. It is used in many places e.g. in Debian (see also run-parts for example) and systemd, so it should be easily recognizable and appreciated by admins.

Delete lines from a file matching first 2 fields from a second file in shell script

Suppose I have setA.txt:
a|b|0.1
c|d|0.2
b|a|0.3
and I also have setB.txt:
c|d|200
a|b|100
Now I want to delete from setA.txt lines that have the same first 2 fields with setB.txt, so the output should be:
b|a|0.3
I tried:
comm -23 <(sort setA.txt) <(sort setB.txt)
But the equality is defined for whole line, so it won't work. How can I do this?
$ awk -F\| 'FNR==NR{seen[$1,$2]=1;next;} !seen[$1,$2]' setB.txt setA.txt
b|a|0.3
This reads through setB.txt just once, extracts the needed information from it, and then reads through setA.txt while deciding which lines to print.
How it works
-F\|
This sets the field separator to a vertical bar, |.
FNR==NR{seen[$1,$2]=1;next;}
FNR is the number of lines read so far from the current file and NR is the total number of lines read. Thus, when FNR==NR, we are reading the first file, setB.txt. If so, set the value of associative array seen to true, 1, for the key consisting of fields one and two. Lastly, skip the rest of the commands and start over on the next line.
!seen[$1,$2]
If we get to this command, we are working on the second file, setA.txt. Since ! means negation, the condition is true if seen[$1,$2] is false which means that this combination of fields one and two was not in setB.txt. If so, then the default action is performed which is to print the line.
This should work:
sed -n 's#\(^[^|]*|[^|]*\)|.*#/^\1/d#p' setB.txt |sed -f- setA.txt
How this works:
sed -n 's#\(^[^|]*|[^|]*\)|.*#/^\1/d#p'
generates an output:
/^c|d/d
/^a|b/d
which is then used as a sed script for the next sed after the pipe and outputs:
b|a|0.3
(IFS=$'|'; cat setA.txt | while read x y z; do grep -q -P "\Q$x|$y|\E" setB.txt || echo "$x|$y|$z"; done; )
explanation: grep -q means only test if grep can find the regexp, but do not output, -P means use Perl syntax, so that the | is matched as is because the \Q..\E struct.
IFS=$'|' will make bash to use | instead of the spaces (SPC, TAB, etc.) as token separator.

What is the usage of sorted command?

I have read most of the example comes with sort command. How ever I am not sure what is the usage of sort command in this style?
sort <word> sorted
That would just be two file names, as in
sort file1 file2 file3...
If you pass multiple file names, sort concatenates them and sorts all of them together.
If you're asking how to sort a string with the sort command:
echo "tatoine" | grep -o . | sort | tr -d "\n"
aeinott
because sort operate on lines so you've got to cut the string in multiple lines with one letter on each (grep -o .) and after sorting you just delete the new lines with the tr command.
Are those < and > symbols explicit, or do they indicate a parameter that is to be replaced? If the latter, then you're reading from a file called "word", and writing the sorted data to a file called "sorted".
Are you trying to save the content in a sorted order?
Let's say you have a file name.txt with the following content.
Zoe
John
Amy
Mary
Mark
Peter
You can use the sort commmand "sort name.txt" and the output goes to the console
You can save the output using "sort name.txt -o sortedname.txt"
e.g.
Amy
John
Mark
mary
Peter
Zoe
You can found more option with the command "man sort" and "info sort"
rojomoke was right about the > and < commands. Those are redirection commands.
We usually read the data from standard input (stdin) and output goes to standard output aka the screen (stdout)
< means get the data from somewhere else. e.g. a file.
> means redirect the output to somewhere else e.g. a file.
So for the command above "sort name.txt -o sortedname.txt", I could have written as follow.
sort < name.txt > sortedname.txt
You can read more about the redirection in this wiki entry.
https://en.wikipedia.org/wiki/Redirection_(computing)
Commands like | >> will come in handy down the road.

How to find the particular text stored in the file "data.txt" and it occurs only once

The line I seek is stored in the file data.txt and is the only line of text that occurs only once.
How do I go about finding that particular line using linux?
This is a little bit old, but I think you are looking for this...
cat data.txt | sort | uniq -u
This will show the unique values that only occur once in the file. I assume you are familiar with "over the wire" if you are asking?? If so, this is what you are looking for.
To provide some context (I need more rep to comment) this is a question that features in an online "wargame" called Bandit that involves using the command line to discover passwords on an online Linux server to advance up the levels.
For those who would like to see data.txt in full I've Pastebin'd it here however it looks like this:
NN4e37KW2tkIb3dC9ZHyOPdq1FqZwq9h
jpEYciZvDIs6MLPhYoOGWQHNIoQZzE5q
3rpovhi1CyT7RUTunW30goGek5Q5Fu66
JOaWd4uAPii4Jc19AP2McmBNRzBYDAkO
JOaWd4uAPii4Jc19AP2McmBNRzBYDAkO
9WV67QT4uZZK7JHwmOH0jnhurJMwoGZU
a2GjmWtTe3tTM0ARl7TQwraPGXgfkH4f
7yJ8imXc7NNiovDuAl1ZC6xb0O0mMBx1
UsvVyFSfZZWbi6wgC7dAFyFuR6jQQUhR
FcOJhZkHlnwqcD8QbvjRyn886rCrnWZ7
E3ugYDa6Wh2y8C8xQev7vOS8O3OgG1Hw
E3ugYDa6Wh2y8C8xQev7vOS8O3OgG1Hw
ME7nnzbId4W3dajsl6Xtviyl5uhmMenv
J5lN3Qe4s7ktiwvcCj9ZHWrAJcUWEhUq
aouHvjzagN8QT2BCMB6e9rlN4ffqZ0Qq
ZRF5dlSuwuVV9TLhHKvPvRDrQ2L5ODfD
9ZjR3NTHue4YR6n4DgG5e0qMQcJjTaiM
QT8Bw9ofH4x3MeRvYAVbYvV1e1zq3Xim
i6A6TL6nqvjCAPvOdXZWjlYgyvqxmB7k
tx7tQ6kgeJnC446CHbiJY7fyRwrwuhrs
One way to do it is to use:
sort data.txt | uniq -u
The sort command is like cat in that it displays the contents of the file however it sorts the file lexicographically by lines (it reorders them alphabetically so that matching ones are together).
The | is a pipe that redirects the output from one command into another.
The uniq command reports or omits repeated lines and by passing it the -u argument we tell it to report only unique lines.
Used together like this, the command will sort data.txt lexicographically by each line, find the unique line and print it back in the terminal for you.
sort -u data.txt | while read line; do if [ $(grep -c $line data.txt) == 1 ] ;then echo $line; fi; done
was mine solution, until I saw here easy one:
sort data.txt | uniq -u
Add more information to you post.
How data.txt look like?
Like this:
11111111
11111111
pass1111
11111111
Or like this
afawfdgd
password
somethin
gelse...
And, do you know the password is in file or you search for not repeat string.
If you know password, use something like this
cat data.txt | grep 'password'
If you don`t know the password and this password is only unique line in file you must create a script.
For example in Python
file = open("data.txt","r")
f = file.read()
for line in f:
if 'pass' in line:
print pass
Of course replace pass with something else.
For example some slice from line.
And one with only one tool in use, awk:
awk '{a[$1]++}END{for(i in a){if(a[i] == 1){print i} }}' data.txt
sort data.txt | uniq -c | grep 1\ ?*
and it will print the only text that occurs only one time
do not forget to put space after the backslash
sort data.txt | uniq -c | grep 1
you will find only one that accures one time

How to compare two text files for the same exact text using BASH?

Let's say I have two text files that I need to extract data out of. The text of the two files is as follows:
File 1:
1name - randomemail#email.com
2Name - superrandomemail#email.com
3Name - 123random#email.com
4Name - random123#email.com
File 2:
email.com
email.com
email.com
anotherwebsite.com
File 2 is File 1's list of domain names, extracted from the email addresses.
These are not the same domain names by any means, and are quite random.
How can I get the results of the domain names that match File 2 from File 1?
Thank you in advance!
Assuming that order does not matter,
grep -F -f FILE2 FILE1
should do the trick. (This works because of a little-known fact: the -F option to grep doesn't just mean "match this fixed string," it means "match any of these newline-separated fixed strings.")
The recipe:
join <(sed 's/^.*#//' file1|sort -u) <(sort -u file2)
it will output the intersection of all domain names in file1 and file2
See BashFAQ/036 for the list of usual solutions to this type of problem.
Use VimDIFF command, this gives a nice presentation of difference
If I got you right, you want to filter for all addresses with the host mentioned in File 2.
You could then just loop over File 2 and grep for #<line>, accumulating the result in a new file or something similar.
Example:
cat file2 | sort -u | while read host; do grep "#$host" file1; done > filtered

Resources