i'm trying to preset zlib's dictionary for compression. as of python 3.3 zlib.compressobj function offers the option. the docs say it should be some bytesarray or a bytes object e.g. b"often-found".
now: how to pass multiple strings ordered ascending by their likeliness to occur as suggested in the docs? is there a secret delimiter e.g. b"likely,more-likely,most-likely"?
No, there is no delimiter needed. All the dictionary is is a resource in which to look for strings that match portions of the data to be compressed. Therefore strings that are likely to occur can simply be concatenated. Or even overlapped if starts and ends match. For example if you want the words lighthouse and household to be available, you can just put lighthousehold in the dictionary.
Since it takes more bits to represent matches that are further back, you would put the most likely matches at the end of the dictionary.
Related
Assumed that I want thing to be automatic and default, I can do something like this.
with h5py.File('store_str_2.hdf5','w') as hf:
variable_length_str = ['abcd', 'bce', 'cd']
hf.create_dataset('variable_length_str', data=variable_length_str)
But on the Internet, I can find solution like:
with h5py.File('store_str.hdf5','w') as hf:
dt = h5py.special_dtype(vlen=str)
variable_length_str = np.array(['abcd', 'bce', 'cd'], dtype=dt)
hf.create_dataset('variable_length_str', data=variable_length_str)
So what is the difference between the two? Why don't just use the simple one to store list of variable-length strings? Will it cause some consequences like it will take more spaces, etc?
Another question is if I want to save space(with compression), what will be the better way to store list of strings in hdf5?
Q1: What is the difference between the two?
h5py is designed to use NumPy arrays to hold HDF5 data. So, typical behavior is fixed length strings (S10 for example). The dtype you found is the older h5py implementation to support variable length strings. The current implementation uses h5py.string_dtype(encoding= , length=), with length=None for variable-length strings. Note: The same limitation applies to variable length (aka 'ragged') arrays with the associated.
Q2: Why don't just use the simple one to store list of variable-length strings?
Q3: Will it cause some consequences like it will take more spaces, etc?
You can use the simple string dtype, but all saved strings will be the same length. You will have to allocate to save the longest string you want to save -- shorter strings will be padded with spaces.
For details, see h5py documentation here:
h5py: Variable-length strings
Note, the API was updated in h5py 2.10. The older API is documented here: h5py: Older vlength API
I am reading about LCP arrays and their use, in conjunction with suffix arrays, in solving the "Longest common substring" problem. This video states that the sentinels used to separate individual strings must be unique, and not be contained in any of the strings themselves.
Unless I am mistaken, the reason for this is so when we construct the LCP array (by comparing how many characters adjacent suffixes have in common) we don't count the sentinel value in the case where two sentinels happen to be at the same index in both the suffixes we are comparing.
This means we can write code like this:
for each character c in the shortest suffix
if suffix_1[c] == suffix_2[c]
increment count of common characters
However, in order to facilitate this, we need to jump through some hoops to ensure we use unique sentinels, which I asked about here.
However, would a simpler (to implement) solution not be to simply count the number of characters in common, stopping when we reach the (single, unique) sentinel character, like this:
set sentinel = '#'
for each character c in the shortest suffix
if suffix_1[c] == suffix_2[c]
if suffix_1[c] != sentinel
increment count of common characters
else
return
Or, am I missing something fundamental here?
Actually I just devised an algorithm that doesn't use sentinels at all: https://github.com/BurntSushi/suffix/issues/14
When concatenating the strings, also record the boundary indexes (e.g. for 3 string of length 4, 2, 5, the boundaries 4, 6, and 11 will be recorded, so we know that concatenated_string[5] belongs to the second original string because 4<= 5 < 6).
Then, to identify which original string every suffix belongs to, just do a binary search.
The short version is "this is mostly an artifact of how suffix array construction algorithms work and has nothing to do with LCP calculations, so provided your suffix array building algorithm doesn't need those sentinels, you can safely skip them."
The longer answer:
At a high level, the basic algorithm described in the video goes like this:
Construct a generalized suffix array for the strings T1 and T2.
Construct an LCP array for that resulting suffix array.
Iterate across the LCP array, looking for adjacent pairs of suffixes that come from different strings.
Find the largest LCP between any two such strings; call it k.
Extract the first k characters from either of the two suffixes.
So, where do sentinels appear in here? They mostly come up in steps (1) and (2). The video alludes to using a linear-time suffix array construction algorithm (SACA). Most fast SACAs for generating suffix arrays for two or more strings assume, as part of their operation, that there are distinct endmarkers at the ends of those strings, and often the internal correctness of the algorithm relies on this. So in that sense, the endmarkers might need to get added in purely to use a fast SACA, completely independent of any later use you might have.
(Why do SACAs need this? Some of the fastest SACAs, such as the SA-IS algorithm, assume the last character of the string is unique, lexicographically precedes everything, and doesn't appear anywhere else. In order to use that algorithm with multiple strings, you need some sort of internal delimiter to mark where one string ends and another starts. That character needs to act as a strong "and we're now done with the first string" character, which is why it needs to lexicographically precede all the other characters.)
Assuming you're using a SACA as a black box this way, from this point forward, those sentinels are completely unnecessary. They aren't used to tell which suffix comes from which string (this should be provided by the SACA), and they can't be a part of the overlap between adjacent strings.
So in that sense, you can think of these sentinels as an implementation detail needed to use a fast SACA, which you'd need to do in order to get the fast runtime.
I was recently asked a question in an interview. How will you find the top 10 longest strings in a list of a billion strings?
My Answer was that we need to write a Comparator that compares the lengths of 2 strings and then Use the TreeSet(Comparator) constructor.
Once you start adding the strings in the Treeset it will sort as per the sorting order of the comparator defined.
Then just pop the top 10 elements of the Treeset.
The Interviewer wasn't happy with that. The argument was that, to hold billion strings I will have to use a super computer.
Is there any other data stucture than can deal with this kind of data?
Given what you stated about the interviewer saying you would need a super computer, I am going to assume that the strings would come in a stream one string at a time.
Given the immense size due to no knowledge of how large the individual strings are (they could be whole books), I would read them in one at a time from the stream. I would then compare the current string to an ordered list of the top ten longest strings found before it and place it accordingly in the ordered list. I would then remove the smallest length one from the list and proceed to read the next string. That would mean only 11 strings were being stored at one time, the current top 10 and the one currently being processed.
Most languages have a built in sort that is pretty speedy.
stringList.sort(key=len)
in python would work. Then just grab the first 10 elements.
Also your interviewer does sounds behind the times. One billion strings is pretty small now a days
I remember studying similar data structure for such scenarios called as Trie
The height of the tree will give the longest string always.
A special kind of trie, called a suffix tree, can be used to index all suffixes in a text in order to carry out fast full text searches.
The point is you do not need to STORE all strings.
Let's think a simplified version: Find the longest 2 string (assuming no tie case)
You can always do a online algorithm like using 2 variables s1 & s2, where s1 is longest string you encountered so far, s2 is the second longest
Then you use O(N) to read the strings one by one, replace s1 or s2 when it can. This use O(2N) = O(N)
For top 10 strings, it is as dumb as the top 2 case. You can still do it in O(10N) = O(N) and store only 10 strings.
There is a faster way describe as follow but for given constant like 2 or 10, you may not need it.
For top-K strings in general, you can use structure like set in C++ (with longer having higher priority) to store the top-K strings, when a new string comes, you simply insert it, and remove the last one, both use O(lg K). So total you can do it in O(N lg K) with O(K) space.
I am implementing a different string representation where accessing a string in non-sequential manner is very costly. To avoid this I try to implement certain position caches or character blocks so one can jump to certain locations and scan from there.
In order to do so, I need a list of algorithms where scanning a string from right to left or random access of its characters is required, so I have a set of test cases to do some actual benchmarking and to create a model I can use to find a local/global optimum for my efforts.
Basically I know of:
String.charAt
String.lastIndexOf
String.endsWith
One scenario where one needs right to left access of strings is extracting the file extension and the file name (item) of paths.
For random access i find no algorithm at all unless one has prefix tables and access the string more randomly checking all those positions for longer than prefix strings.
Does anyone know other algorithms with either right to left or random access of string characters is required?
[Update]
The calculation of the hash-code of a String is calculated using every character and accessed from left to right along the value is stored in a local primary variable. So this is not something for random access.
Also the MD5 or CRC algorithm also all process the complete string. So I do not find any random access examples at all.
One interesting algorithm is Boyer-Moore searching, which involves both skipping forward by a variable number of characters and comparing backwards. If those two operations are not O(1), then KMP searching becomes more attractive, but BM searching is much faster for long search patterns (except in rare cases where the search pattern contains lots of repetitions of its own prefix). For example, BM shines for patterns which must be matched at word-boundaries.
BM can be implemented for certain variable-length encodings. In particular, it works fine with UTF-8 because misaligned false positives are impossible. With a larger class of variable-length encodings, you might still be able to implement a variant of BM which allows forward skips.
There are a number of algorithms which require the ability to reset the string pointer to a previously encountered point; one example is word-wrapping an input to a specific line length. Those won't be impeded by your encoding provided your API allows for saving a copy of an iterator.
Say I have millions of lines of unique strings spread across hundreds of text files (the "dataset"). Now I want to check to see if any of those text files contain any of 2 million unique strings that are listed in another text file ("tofind"). What would be the most efficient way to go about this? Some extra application-specific info:
must be case sensitive
the string to find would match the found string in full (ie, it is NOT a substring)
each text file in the "dataset" contains approx 700K lines and is 50MB, though some can be several hundred MB.
again, the strings in both the "dataset" and "tofind" are unique. Indexing won't help.
There is no need to be able to search live (ie, as someone starts typing). I just want to output any matches to a text file with the match and the the file it was found in.
I have 32GB of RAM and an i7 3930K
My options include using simple command line/batch "findstr", etc, or possibly writing a search program in vbscript or c# (Java or Python if necessary, but I'm not as familiar with them). What would be the most efficient solution for this particular application?
If you have enough memory to load all the strings from tofind into memory, then you could create a set of pairs, the key being the length of the strings and the value being the set of strings. Load all your strings from tofind into this structure by storing them based on their length: A string of 5 characters would be stored in the value of the pair having 5 as key, a string of 10 characters would be stored in the value of the pair having 10 as key (you can refine this even more using the same style of grouping with the first character, but I won't describe this here, as I want to share the idea in the simplest possible way).
Then you can load the other strings and search for their occurrences. A string having a length of 10 would be searched in your pair having 10 as key, for instance.
If the size of the data set is just too large, then you can do the same by loading a batch of strings at a time and then purge the structure and rebuild it with the next batch.
Since you do not have to do this in real time it gives you a lot of lattitude in
designing a search process. I have not thought this through very carefully but it seems to
me that you could do this in a couple of steps:
Step 1
Eliminate those strings from the dataset that you know do not match any of the strings
in the tofind string list. A Bloom Filter is a very effective way to accomplish this.
It has a zero false negative rate, that is, if there isn't a hit on the bloom filter
then none of the strings match and the string can be eliminated.
The strings that hit on the Bloom Filter then need to be verified to ensure you did not
get a false positive. Bloom Filters are prone to false positives. However if you pay
close attentention to selecting good hashing functions and allocate a large enough filter, the
false positive rate can be quite low.
For each string where there is a hit on the Bloom Filter, save that string and the position in
the string where the hit was made. This information is passsed to Step 2.
Step 2
Verify strings that hit on the Bloom Filter. Now you need to verify all 1M tofind strings
using an efficient exact string matching function. A Trie seems like a good candidate for this
function. Load the tofind strings into the Trie and then search it starting at the position
found by the Bloom Filter. At this point you will either have a hit, in which case
a match was found, or a miss in which case the Bloom Filter reported a false positive.
Note: This process assumes that Step 1 can eliminate a significant number of strings from
the dataset. If you expect that most dataset strings will contain a match in tofind then
it might not be worth the effort.