Check file's whole raw text content exists as part of another file

Check file's whole raw text content exists as part of another file - linux

big1.txt:
a
b
c
d
e
big2.txt:
f
c
g
h
i
b
small.txt:
b
c
Within a bash script, how can I tell that the whole ordered content of small.txt exists in another file?
Example:
??? small.txt big1.txt should return true
??? small.txt big2.txt should return false

If big1.txt and big2.txt not too big (can be loaded in memory). Following test may be sufficient.
# to store file content into variables
big1=$(< big1.txt)
big2=$(< big2.txt)
small=$(< small.txt)
# to run from test case
big1=$'a\nb\nc\nd\ne\n'
big2=$'f\nc\ng\nh\ni\nb\n'
small=$'b\nc\n'
if [[ ${big1} = *${small}* ]]; then echo "big1"; fi
if [[ ${big2} = *${small}* ]]; then echo "big2"; fi

Sometimes the way to discover that two complicated things are 'equal' is to do some cheap test that is true if they are equal and rarely true if they are not. Those that pass this hueristic test are then checked more carefully ... but only rarely so full equality test can be expensive and yet not be triggered on every compare.
What I would do in this circumstance is take all the files, and sort their lines. (You might want to suppress blank lines if you are looking for matching text, and strip lines with trailing blanks, but that's your choice). Probably useful to delete duplicate lines.
Now compare each file, to all longer files to see if it is a prefix. (Can't be a prefix if the other file is shorter, thus we get rid of 1/2 the compares just based on sizes). If sorted file A is a prefix of sorted file B, then you can run a more complicated test to see if the real file A is embeded in file B (which I suspect will be true with high probability if the sorted files pass the prefix test).
Having had this idea, we can now optimize it. Instead of storing lines of text, we take each file, and hash each line, giving a file of hash codes. Sort these. Follow the rest of the procedure.
Next trick: decide our hash codes are 8 bits or 16 bits in size. This makes them fit into a character of your favorite programming language. Now your prefix-compare test can consist of collecting the character-sized hash codes per file, and doing a string compare of shorter ones against longer ones. At this point we've moved the problem from reading the disk to comparing efficiently in memory; we probably can't speed it up much because disk reads are very expensive compared to in memory computations.

$ diff small big1.txt | grep -q '^<'
$ echo $?
1
$ diff small big2.txt | grep -q '^<'
$ echo $?
0
$ ! (diff small big1.txt | grep -q '^<')
$ echo $?
0
$ ! (diff small big2.txt | grep -q '^<')
$ echo $?
1
$ if diff small big1.txt | grep -q '^<'; then echo "does not exit"; else echo "does exist"; fi
does exist
$ if diff small big2.txt | grep -q '^<'; then echo "does not exit"; else echo "does exist"; fi
does not exit

check this please
if perl -0777 -e '$n = <>; $h = <>; exit(index($h,$n)<0)' small.txt big.txt
then echo small.txt is found in big.txt
fi

Related

linux - simplify semantic versioning script

I have semantic versioning for an app component and this is how I update the major version number and then store it back in version.txt. This seems like a lot of lines for a simple operation. Could someone please help me trim this down? No use of bc as the docker python image I'm on doesn't seem to have that command.
This is extracted from a yml file and version.txt only contains a major and minor number. 1.3 for example. The code below updates only the major number (1) and resets the minor number to 0. So if I ran the code on 1.3, I would get 2.
- echo $(<version.txt) 1 | awk '{print $1 + $2}' > version.txt
VERSION=$(<version.txt)
VERSION=${VERSION%.*}
echo $VERSION > version.txt
echo "New version = $(<version.txt)"

About Simplicity
"Simple" and "short" are not the same thing. echo $foo is shorter than echo "$foo", but it actually does far more things: It splits the value of foo apart on characters in IFS, evaluates each result of that split as a glob expression, and then recombines them.
Similarly, making your code simpler -- as in, limiting the number of steps in the process it goes through -- is not at all the same thing as making it shorter.
Incrementing One Piece, Leaving Others Unmodified
if IFS=. read -r major rest <version.txt || [ -n "$major" ]; then
echo "$((major + 1)).$rest" >"version.txt.$$" && mv "version.txt.$$" version.txt
else
echo "ERROR: Unable to read version number from version.txt" >&2
exit 1
fi
Incrementing Major Version, Discarding Others
if IFS=. read -r major rest <version.txt || [ -n "$major" ]; then
echo "$((major + 1))" >"version.txt.$$" && mv "version.txt.$$" "version.txt"
else
echo "ERROR: Unable to read version number from version.txt" >&2
exit 1
fi
Rationale
Both of the above are POSIX-compliant, and avoid relying on any capabilities not built into the shell.
IFS=. read -r first second third <input reads the first line of input, and splits it on .s into the shell variables first, second and third; notably, the third column in this example includes everything after the first two, so if you had a.b.c.d.e.f, you would get first=a; second=b; third=d.e.f -- hence the name rest to make this clear. See BashFAQ #1 for a detailed explanation.
$(( ... )) creates an arithmetic context in all POSIX-compliant shells. It's only useful for integer math, but since we split the pieces out with the read, we only need integer math. See http://wiki.bash-hackers.org/syntax/arith_expr
Writing to version.txt.$$ and renaming if that write is successful prevents version.txt from being left empty or corrupt if a failure takes place between the open and the write. (A version that was worried about symlink attacks would use mktemp, instead of relying on $$ to generate a unique tempfile name).
Proceeding through to the write only if the read succeeds or [ -n "$major" ] is true prevents the code from resetting the version to 1 (by adding 1 to an empty string, which evaluates in an arithmetic context as 0) if the read fails.

Finding a line that shows in a file only once

Assuming that I have files with 100 lines. There are a lot of lines that repeat themselves in the file, and only one line that does not.
I want to find the line that shows only once. Is there a command for that or do I have to build some complicated loop as below?
My code so far:
#!/bin/bash
filename="repeat_lines.txt"
var="$(wc -l <$filename )"
echo "length:" $var
#cp ex4.txt ex4_copy.txt
for((index=0; index < var; index++));
do
one="$(head -n $index $filename | tail -1)"
counter=0
for((index2=0; index2 < var; index2++));
do
two="$(head -n $index2 $filename | tail -1)"
if [ "$one" == "$two" ]; then
counter=$((counter+1))
fi
done
echo $one"is "$counter" times in the text: "
done

If I understood your question correctly, then
sort repeat_lines.txt | uniq -u should do the trick.
e.g. for file containing:
a
b
a
c
b
it will output c.
For further reference, see sort manpage, uniq manpage.

You've got a reasonable answer that uses standard shell tools sort and uniq. That's probably the solution you want to use, if you want something that is portable and doesn't require bash.
But an alternative would be to use functionality built into your bash shell. One method might be to use an associative array, which is a feature of bash 4 and above.
$ cat file.txt
a
b
c
a
b
$ declare -A lines
$ while read -r x; do ((lines[$x]++)); done < file.txt
$ for x in "${!lines[#]}"; do [[ ${lines["$x"]} -gt 1 ]] && unset lines["$x"]; done
$ declare -p lines
declare -A lines='([c]="1" )'
What we're doing here is:
declare -A creates the associative array. This is the bash 4 feature I mentioned.
The while loop reads each line of the file, and increments a counter that uses the content of a line of the file as the key in the associative array.
The for loop steps through the array, deleting any element whose counter is greater than 1.
declare -p prints the details of an array in a predictable, re-usable format. You could alternately use another for loop to step through the remaining array elements (of which there might be only one) in order to do something with them.
Note that this solution, while fine for small files (say, up to a few thousand lines), may not scale well for very large files of, say, millions of lines. Bash isn't the fastest at reading input this way, and one must be cognizant of memory limits when using arrays.
The sort alternative has the benefit of memory optimization using files on disk for extremely large files, at the expense of speed.
If you're dealing with files of only a few hundred lines, then it's hard to predict which solution will be faster. In the end, the form of output may dictate your choice of solution. The sort | uniq pipe generates a list to standard output. The bash solution above generates the same list as keys in an array. Otherwise, they are functionally equivalent.

split file with output file with numeric suffix but without begin zero

Suppose I have a file temp.txt with 100 lines. I would like to split into 10 parts.
I use following command
split a 1 -l 10 -d temp.txt temp_
But I got temp_0, temp_1, temp_2,...,temp_9. I want output like this temp_1,temp_2,..,temp_10.
From man split
I got
-d, --numeric-suffixes
use numeric suffixes instead of alphabetic
I tried to use
split -l 10 --suffix-length=1 --numeric-suffixes=1 Temp.txt temp_
It says split: option '--numeric-suffixes' doesn't allow an argument
Then, I tried to use
split -l 10 --suffix-length=1 --numeric-suffixes 1 Temp.txt temp_
It says
split: extra operandtemp_'`
The output of split --version is
split (GNU coreutils) 8.4
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Torbj�rn Granlund and Richard M. Stallman.

I tried to use split -a 1 -l 10 -d 1 Temp.txt temp_. But it shows error split: extra operand temp_' `
-d doesn't have an argument. It should be written as you originally tried;
split -a 1 -l 10 -d Temp.txt temp_
But, forgetting the syntax variations for a moment;
you're asking it to split a 100 line file into 10 parts, with a suffix length of 1, starting at 1.
^- This scenario is erroneous as it is asking the command to process 100 lines and giving it fixed parameters restricting it to processing only 90 lines.
If you're willing to extend your allowable suffix length to 2, then you will at least get a uniform two digit temp file starting at 01;
split -a 1 -l 10 --numeric-suffixes=1 -d Temp.txt temp_
Will create: temp_01 thru temp_10
You can actually negate the -a and -d argument altogether;
split -l 10 --numeric-suffixes=1 Temp.txt temp_
Will also create: temp_01 thru temp_10
If for some reason this was a fixed and absolute requirement or a permanent solution (i.e. integrating to something else you have no control of), and it was always going to be an exactly 100 line file, then you could always do it in two passes;
cat Temp.txt | head -n90 | split -a 1 -l 10 --numeric-suffixes=1 - temp_
cat Temp.txt | tail -n10 | split -a 2 -l 10 --numeric-suffixes=10 - temp_
Then you would get temp_1 thru temp_10

Just to throw out a possible alternative, you can accomplish this task manually by running a couple of loops. The outer loop iterates over the file chunks and the inner loop iterates over the lines within the chunk.
{
suf=1;
read -r; rc=$?;
while [[ $rc -eq 0 || -n "$REPLY" ]]; do
line=0;
while [[ ($rc -eq 0 || -n "$REPLY") && line -lt 10 ]]; do
printf '%s\n' "$REPLY";
read -r; rc=$?;
let ++line;
done >temp_$suf;
let ++suf;
done;
} <temp.txt;
Notes:
The test $rc -eq 0 || -n "$REPLY" is necessary to continue processing if either we've not yet reached end-of-file (in which case $rc eq 0 is true) or we have reached end-of-file but there was a non-empty final line in the input file (in which case -n "$REPLY" is true). It's good to try to support the case of a non-empty final line with no end-of-line delimiter, which sometimes happens. In this case read will return a failing status code but will still correctly set $REPLY to contain the non-empty final line content. I've tested the split utility and it correctly handles this case as well.
By calling read once prior to the outer loop and then once after each print, we ensure that we always test if the read was successful prior to printing the resulting line. A more naïve design might read and print in immediate succession with no check in between, which would be incorrect.
I've used the -r option of read to prevent backslash interpolation, which you probably don't want; I assume you want to preserve the contents of temp.txt verbatim.
Obviously there are tradeoffs in this solution. On the one hand, it demands a fair amount of complexity and code verbosity (13 lines the way I've written it). But the advantage is complete control over the behavior of the split operation; you can customize the script to your liking, such as dynamically changing the suffix based on the line number, using a prefix or infix or combination thereof, or even taking into account the contents of the individual file lines in $REPLY.

Find files not in numerical list

I have a giant list of files that are all currently numbered in sequential order with different file extensions.
3400.PDF
3401.xls
3402.doc
There are roughly 1400 of these files in a directory. What I would like to know is how to find numbers that do not exist in the sequence.
I've tried to write a bash script for this but my bash-fu is weak.
I can get a list of the files without their extensions by using
FILES=$(ls -1 | sed -e 's/\..*$//')
but a few places I've seen say to not use ls in this manner.
(15 days after asking, I couldn't relocate where I read this, if it existed at all...)
I can also get the first file via ls | head -n 1 but Im pretty sure I'm making this a whole lot more complicated that I need to.

Sounds like you want to do something like this:
shopt -s nullglob
for i in {1..1400}; do
files=($i.*)
(( ${#files[#]} > 0 )) || echo "no files beginning with $i";
done
This uses a glob to make an array of all files 1.*, 2.* etc. It then compares the length of the array to 0. If there are no files matching the pattern, the message is printed.
Enabling nullglob is important as otherwise, when there are no files matching the array will contain one element: the literal value '1.*'.

Based on deleted answer that was largely correct:
for i in $(seq 1 1400); do ls $i.* > /dev/null 2>&1 || echo $i; done

ls [0-9]* \
| awk -F. ' !seen[$1]++ { ++N }
END { for (n=1; N ; ++n) if (!seen[n]) print n; else --N }
'
Will stop when it's filled the last gap, sub in N>0 || n < 3000 to go at least that far.

Why do seemingly empty files and strings produce md5sums?

Consider the following:
% md5sum /dev/null
d41d8cd98f00b204e9800998ecf8427e /dev/null
% touch empty; md5sum empty
d41d8cd98f00b204e9800998ecf8427e empty
% echo '' | md5sum
68b329da9893e34099c7d8ad5cb9c940 -
% perl -e 'print chr(0)' | md5sum
93b885adfe0da089cdf634904fd59f71 -
% md5sum ''
md5sum: : No such file or directory
First of all, I'm surprised by the output of all these commands. If anything, I would expect the sum to be the same for all of them.

The md5sum of "nothing" (a zero-length stream of characters) is d41d8cd98f00b204e9800998ecf8427e, which you're seeing in your first two examples.
The third and fourth examples are processing a single character. In the "echo" case, it's a newline, i.e.
$ echo -ne '\n' | md5sum
68b329da9893e34099c7d8ad5cb9c940 -
In the perl example, it's a single byte with value 0x00, i.e.
$ echo -ne '\x00' | md5sum
93b885adfe0da089cdf634904fd59f71 -
You can reproduce the empty checksum using "echo" as follows:
$ echo -n '' | md5sum
d41d8cd98f00b204e9800998ecf8427e -
...and using Perl as follows:
$ perl -e 'print ""' | md5sum
d41d8cd98f00b204e9800998ecf8427e -
In all four cases, you should expect the same output from checksumming the same data, but different data should produce a wildly different checksum (that's the whole point -- even if it's only a single character that differs.)

Why do seemingly empty files and strings produce md5sums?
Because the "sum" in the md5sum is somewhat misleading. It's not like e.g. CRC32 checksum, that is zero for the empty file.
MD5 is one of message digest algorithms. You can imagine it as a box that produces fixed-length random-looking value (hash) depending on its internal state. You change the internal state by feeding in the data.
And that box internal state is predefined, such that that it yields randomly looking hash value even before any data is fed in. For MD5, it happens to be d41d8cd98f00b204e9800998ecf8427e.

No need for surprise. The first two produce true empty inputs to md5sum. The echo produces a newline (echo -n '' should produce an empty output; I don't have a linux machine here to check). The perl produces a single zero byte (not to be confused with C where a zero byte marks end of string). The last command is looking for a file with the empty string as its file name.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string