How to determine if the content of one file is included in the content of another file - linux

First, my apologies for what is perhaps a rather stupid question that doesn't quite belong here.
Here's my problem: I have two large text files containing a lot of file names, let's call them A and B, and I want to determine if A is a subset of B, disregarding order, i.e. for each file name in A, find if file name is also in B, otherwise A is not a subset.
I know how to preprocess the files (to remove anything but the file name itself, removing different capitalization), but now I'm left to wonder if there is a simple way to perform the task with a shell command.
Diff probably doesn't work, right? Even if I 'sort' the two files first, so that at least the files that are present in both will be in the same order, since A is probably a proper subset of B, diff will just tell me that every line is different.
Again, my apologies if the question doesn't belong here, and in the end, if there is no easy way to do it I will just write a small program to do the job, but since I'm trying to get a better handle on the shell commands, I thought I'd ask here first.

Do this:
cat b | sort -u | wc
cat a b | sort -u | wc
If you get the same result, a is a subset of b.

Here's how to do it in awk
awk '
# read A, the supposed subset file
FNR == NR {a[$0]; next}
# process file B
$0 in a {delete a[$0]}
END {if (length(a) == 0) {print "A is a proper subset of B"}}
' A B

Test if an XSD file is a subset of a WSDL file:
xmllint --format file.wsdl | awk '{$1=$1};1' | sort -u | wc
xmllint --format file.wsdl file.xsd | awk '{$1=$1};1' | sort -u | wc
This adapts the elegant concept of RichieHindle's prior answer using:
xmllint --format instead of cat, to pretty print the XML so each XML element was on one line, as required by sort -u | wc. Other pretty printing commands might work here, e.g. jq . for json.
an awk command to normalise the whitespace: strip leading and trailing (because the indentation is different in both files), and collapse internal. Caveat: does not consider XML attribute order within the element.

Related

Grep multiple expressions in one pass, output matches to each expression in separate file

I want to use grep on the top utility. Iterating through top 5 times, here are my criteria:
grep two different expressions in a single pass on the top output
expression 1: grep for the line on overall cpu usage, then output it to file: cpu_stats.txt
expression 2: grep for the line on overall memory usage, then output it to file: memory_stats.txt
Here is what I have now:
top -b -n 5 | egrep "\%Cpu\(s\):|KiB Mem :" > both_cpu_and_memory.txt
This successfully grabs the desired top output, but notice it is putting both expression matches in the exact same file.
Where I am stuck: I do not know how to, in a single pass, output the matches from one expression to one file, and how to output the matches from the other expression to another file.
Is this possible? Is it possible to, in one pass, grep multiple expressions, and matches for each expression is outputted to a separate file?
Unless you do something convoluted with saving the output in a temporary file and making a couple of passes over it, you can't do what you want with grep. It's really easy with awk, though:
top -b -n 5 | awk '/%Cpu\(s\):/ { print > "cpu_stats.txt" }
/KiB Mem :/ { print > "memory_stats.txt" }'
grep cannot do what you ask. This is a job for awk, or indeed any other scripting language capable of parsing or capable of using regular expressions.

Prefix search names to output in bash

I have a simple egrep command searching for multiple strings in a text file which outputs either null or a value. Below is the command and the output.
cat Output.txt|egrep -i "abc|def|efg"|cut -d ':' -f 2
Output is:-
xxx
(null)
yyy
Now, i am trying to prefix my search texts to the output like below.
abc:xxx
def:
efg:yyy
Any help on the code to achieve this or where to start would be appreciated.
-Abhi
Since I do not know exactly your input file content (not specified properly in the question), I will put some hypothesis in order to answer your question.
Case 1: the patterns you are looking for are always located in the same column
If it is the case, the answer is quite straightforward:
$ cat grep_file.in
abc:xxx:uvw
def:::
efg:yyy:toto
xyz:lol:hey
$ egrep -i "abc|def|efg" grep_file.in | cut -d':' -f1,2
abc:xxx
def:
efg:yyy
After the grep just use the cut with the 2 columns that you are looking for (here it is 1 and 2)
REMARK:
Do not cat the file, pipe it and then grep it, since this is doing the work twice!!! Your grep command will already read the file so do not read it twice, it might not be that important on small files but you will feel the difference on 10GB files for example!
Case 2: the patterns you are looking for are NOT located in the same column
In this case it is a bit more tricky, but not impossible. There are many ways of doing, here I will detail the awk way:
$ cat grep_file2.in
abc:xxx:uvw
::def:
efg:yyy:toto
xyz:lol:hey
If your input file is in this format; with your pattern that could be located anywhere:
$ awk 'BEGIN{FS=":";ORS=FS}{tmp=0;for(i=1;i<=NF;i++){tmp=match($i,/abc|def|efg/);if(tmp){print $i;break}}if(tmp){printf "%s\n", $2}}' grep_file
2.in
abc:xxx
def:
efg:yyy
Explanations:
FS=":";ORS=FS define your input/output field separator at : Then on each line you define a test variable that will become true when you reach your pattern, you loop on all the fields of the line until you reach it if it is the case you print it, break the loop and print the second field + an EOL char.
If you do not meet your pattern you do nothing.
If you prefer the sed way, you can use the following command:
$ sed -n '/abc\|def\|efg/{h;s/.*\(abc\|def\|efg\).*/\1:/;x;s/^[^:]*:\([^:]*\):.*/\1/;H;x;s/\n//p}' grep_file2.in
abc:xxx
def:
efg:yyy
Explanations:
/abc\|def\|efg/{} is used to filter the lines that contain only one of the patterns provided, then you execute the instructions in the block. h;s/.*\(abc\|def\|efg\).*/\1:/; save the line in the hold space and replace the line with one of the 3 patterns, x;s/^[^:]*:\([^:]*\):.*/\1/; is used to exchange the pattern and hold space and extract the 2nd column element. Last but not least, H;x;s/\n//p is used to regroup both extracted elements on 1 line and print it.
try this
$ egrep -io "(abc|def|efg):[^:]*" file
will print the match and the next token after delimiter.
If we can assume that there are only two fields, that abc etc will always match in the first field, and that getting the last match on a line which contains multiple matches is acceptable, a very simple sed script could work.
sed -n 's/^[^:]*\(abc\|def\|efg\)[^:]*:\([^:]*\)/\1:\2/p' file
If other but similar conditions apply (e.g. there are three fields or more but we don't care about matches in the first two) the required modifications are trivial. If not, you really need to clarify your question.

How do I do a one way diff in Linux?

How do I do a one way diff in Linux?
Normal behavior of diff:
Normally, diff will tell you all the differences between a two files. For example, it will tell you anything that is in file A that is not in file B, and will also tell you everything that is in file B, but not in file A. For example:
File A contains:
cat
good dog
one
two
File B contains:
cat
some garbage
one
a whole bunch of garbage
something I don't want to know
If I do a regular diff as follows:
diff A B
the output would be something like:
2c2
< good dog
---
> some garbage
4c4,5
< two
---
> a whole bunch of garbage
> something I don't want to know
What I am looking for:
What I want is just the first part, for example, I want to know everything that is in File A, but not file B. However, I want it to ignore everything that is in file B, but not in file A.
What I want is the command, or series of commands:
???? A B
that produces the output:
2c2
< good dog
4c4,5
< two
I believe a solution could be achieved by piping the output of diff into sed or awk, but I am not familiar enough with those tools to come up with a solution. I basically want to remove all lines that begin with --- and >.
Edit: I edited the example to account for multiple words on a line.
Note: This is a "sub-question" of: Determine list of non-OS packages installed on a RedHat Linux machine
Note: This is similar to, but not the same as the question asked here (e.g. not a dupe):
One-way diff file
An alternative, if your files consist of single-line entities only, and the output order doesn't matter (the question as worded is unclear on this), would be:
comm -23 <(sort A) <(sort B)
comm requires its inputs to be sorted, and the -2 means "don't show me the lines that are unique to the second file", while -3 means "don't show me the lines that are common between the two files".
If you need the "differences" to be presented in the order they occur, though, the above diff / awk solution is ok (although the grep bit isn't really necessary - it could be diff A B | awk '/^</ { $1 = ""; print }'.
EDIT: fixed which set of lines to report - I read it backwards originally...
As stated in the comments, one mostly correct answer is
diff A B | grep '^<'
although this would give the output
< good dog
< two
rather than
2c2
< good dog
4c4,5
< two
diff A B|grep '^<'|awk '{print $2}'
grep '^<' means select rows start with <
awk '{print $2}' means select the second column
If you want to also see the files in question, in case of diffing folders, you can use
diff public_html temp_public_html/ | grep '^[^>]'
to match all but lines starting with >

egrep not writing to a file

I am using the following command in order to extract domain names & the full domain extension from a file. Ex: www.abc.yahoo.com, www.efg.yahoo.com.us.
[a-z0-9\-]+\.com(\.[a-z]{2})?' source.txt | sort | uniq | sed -e 's/www.//'
> dest.txt
The command write correctly when I specify small maximum parameter -m 100 after the source.txt. The problem if I didn't specify, or if I specified a huge number. Although, I could write to files with grep (not egrep) before with huge numbers similar to what I'm trying now and that was successful. I also check the last modified date and time during the command being executed, and it seems there is no modification happening in the destination file. What could be the problem ?
As I mentioned in your earlier question, it's probably not an issue with egrep, but that your file is too big and that sort won't output anything (to uniq) until egrep is done. I suggested that you split the files into manageable chucks using the split command. Something like this:
split -l 10000000 source.txt split_source.
This will split the source.txt file into 10 million line chunks called split_source.a, split_source.b, split_source.c etc. And you can run the entire command on each one of those files (and maybe changing the pipe to append at the end: >> dest.txt).
The problem here is that you can get duplicates across multiple files, so at the end you may need to run
sort dest.txt | uniq > dest_uniq.txt
Your question is missing information.
That aside, a few thoughts. First, to debug and isolate your problem:
Run the egrep <params> | less so you can see what egreps doing, and eliminate any problem from sort, uniq, or sed (my bets on sort).
How big is your input? Any chance sort is dying from too much input?
Gonna need to see the full command to make further comments.
Second, to improve your script:
You may want to sort | uniq AFTER sed, otherwise you could end up with duplicates in your result set, AND an unsorted result set. Maybe that's what you want.
Consider wrapping your regular expressions with "^...$", if it's appropriate to establish beginning of line (^) and end of line ($) anchors. Otherwise you'll be matching portions in the middle of a line.

Sorting on the last field of a line

What is the simplest way to sort a list of lines, sorting on the last field of each line? Each line may have a variable number of fields.
Something like
sort -k -1
is what I want, but sort(1) does not take negative numbers to select fields from the end instead of the start.
I'd also like to be able to choose the field delimiter too.
Edit: To add some specificity to the question: The list I want to sort is a list of pathnames. The pathnames may be of arbitrary depth hence the variable number of fields. I want to sort on the filename component.
This additional information may change how one manipulates the line to extract the last field (basename(1) may be used), but does not change sorting requirements.
e.g.
/a/b/c/10-foo
/a/b/c/20-bar
/a/b/c/50-baz
/a/d/30-bob
/a/e/f/g/h/01-do-this-first
/a/e/f/g/h/99-local
I want this list sorted on the filenames, which all start with numbers indicating the order the files should be read.
I've added my answer below which is how I am currently doing it. I had hoped there was a simpler way - maybe a different sort utility - perhaps without needing to manipulate the data.
awk '{print $NF,$0}' file | sort | cut -f2- -d' '
Basically, this command does:
Repeat the last field at the beginning, separated with a whitespace (default OFS)
Sort, resolve the duplicated filenames using the full path ($0) for sorting
Cut the repeated first field, f2- means from the second field to the last
Here's a Perl command line (note that your shell may require you to escape the $s):
perl -e "print sort {(split '/', $a)[-1] <=> (split '/', $b)[-1]} <>"
Just pipe the list into it or, if the list is in a file, put the filename at the end of the command line.
Note that this script does not actually change the data, so you don't have to be careful about what delimeter you use.
Here's sample output:
>perl -e "print sort {(split '/', $a)[-1] <=> (split '/', $b)[-1]} " files.txt
/a/e/f/g/h/01-do-this-first
/a/b/c/10-foo
/a/b/c/20-bar
/a/d/30-bob
/a/b/c/50-baz
/a/e/f/g/h/99-local
something like this
awk '{print $NF"|"$0}' file | sort -t"|" -k1 | awk -F"|" '{print $NF }'
A one-liner in perl for reversing the order of the fields in a line:
perl -lne 'print join " ", reverse split / /'
You could use it once, pipe the output to sort, then pipe it back and you'd achieve what you want. You can change / / to / +/ so it squeezes spaces. And you're of course free to use whatever regular expression you want to split the lines.
I think the only solution would be to use awk:
Put the last field to the front using awk.
Sort lines.
Put the first field to the end again.
Replace the last delimiter on the line with another delimiter that does not otherwise appear in the list, sort on the second field using that other delimiter as the sort(1) delimiter, and then revert the delimiter change.
delim=/
new_delim=" "
cat $list \
| sed "s|\(.*\)$delim|\1$new_delim|" \
| sort -t"$new_delim" -k 2,2 \
| sed "s|$new_delim|$delim|"
The problem is knowing what delimiter to use that does not appear in the list. You can make multiple passes over the list and then grep for a succession of potential delimiters, but it's all rather nasty - particularly when the concept of "sort on the last field of a line" is so simply expressed, yet the solution is not.
Edit: One safe delimiter to use for $new_delim is NUL since that cannot appear in filenames, but I don't know how to put a NUL character into a bourne/POSIX shell script (not bash) and whether sort and sed will properly handle it.
#!/usr/bin/ruby
f = ARGF.read
lines = f.lines
broken = lines.map {|l| l.split(/:/) }
sorted = broken.sort {|a, b|
a[-1] <=> b[-1]
}
fixed = sorted.map {|s| s.join(":") }
puts fixed
If all the answers involve perl or awk, might as well solve the whole thing in the scripting language. (Incidentally, I tried in Perl first and quickly remembered that I dislike Perl's lists-of-lists. I'd love to see a Perl guru's version.)
I want this list sorted on the filenames, which all start with numbers
indicating the order the files should be read.
find . | sed 's#.*/##' | sort
the sed replaces all parts of the list of results that ends in slashes. the filenames are whats left, and you sort on that.
Here is a python oneliner version, note that it assumes the field is integer, you can change that as needed.
echo file.txt | python3 -c 'import sys; list(map(sys.stdout.write, sorted(sys.stdin, key=lambda x: int(x.rsplit(" ", 1)[-1]))))'
| sed "s#(.*)/#\1"\\$'\x7F'\# \
| sort -t\\$'\x7F' -k2,2 \
| sed s\#\\$'\x7F'"#/#"
Still way worse than simple negative field indexes for sort(1) but using the DEL character as delimiter shouldn’t cause any problem in this case.
I also like how symmetrical it is.
sort allows you to specify the delimiter with the -t option, if I remember it well. To compute the last field, you can do something like counting the number of delimiters in a line and sum one. For instance something like this (assuming the ":" delimiter):
d=`head -1 FILE | tr -cd : | wc -c`
d=`expr $d + 1`
($d now contains the last field index).

Resources