Implementations for Pattern/String mining using Suffix Arrays/Trees

Implementations for Pattern/String mining using Suffix Arrays/Trees - string

I am trying to solve a pattern mining problem for strings and I think that suffix trees or arrays might be a good option to solve this problem.
I will quickly outline the problem:
I have a set strings of different lengths (quotation are just to mark repetitions for the explanation):
C"BECB"ECCECCECEEB"BECB"FCCECCECCECCECCFCBFCBFCC
DCBBDCDDCCECBCEC"BECB""BECB"BECB"BECB"BECB"EDB"BECB""BECB"BEC
etc.
I now would like to find repeated patterns within each string and repeated patterns that are common between the strings. A repeated pattern in string (1) would be "BECB". Moreover the pattern "BECB" is also repeated in string (2). Of course there are several criteria that need to be decided on as for example the minimum number of repetitions or the minimum number of symbols in a pattern.
From the book by Dan Gusfield "Algorithms on Strings, Trees and Sequences" I know that it is possible to find such repeats (e.g. maximal pairs, maximal repetitive structures etc.) using certain algorithms and a suffix tree data structure. This comes in handy as I would like to use probabilistic suffix trees to also calculate some predictions on these sequences. (But this is not the topic of this post.)
Unfortunately I can't find any implementations of these algorithms. Hence, I am wondering if I am even on the right path using suffix trees to solve the mentioned problem. It seems very strange to me that for such a well described problem no packages are available (in R or Python for example).
Are there any alternatives (with existing packages) that solve my problem?
Or do you know any implementation of algorithms for suffix trees?

Here is an implementation in C++ that follows the approach of Dan Gusfield's book : https://cp-algorithms.com/string/suffix-tree-ukkonen.html
You could rewrite it in python. Such algorithms are quite specialised for high performance applications, so it's quite normal that they don't appear in any standard library; nonetheless they are still well known so you can usually find good implementations on the net.

Both suffix trees and suffix arrays are good data structures to help in solving the kinds of problems you want to solve.
Building a (multi-string) suffix tree in python is probably not a good idea -- it involves a lot of operations on individual characters and the resulting data structure consumes a lot of memory unless you spend a lot of code avoiding that.
Building a suffix array in python is more approachable, and the resulting data structure (probably just an array of integers) is reasonably compact.
It doesn't seem too difficult to find suffix array code in python on the web, and there's a nice explanation here: https://louisabraham.github.io/articles/suffix-arrays
It would be more difficult to find one that already supports multiple strings, so you would have to decide how you want to do that. In any case, the prefix doubling algorithm is easy to implement if you leverage the standard built-in sort(), and in python that will probably produce the fastest result.

Related

Regular expression-like languages to search and replace in tree-like structures or even graphs

First such example is XSLT. Second example may be a hypothetical language that is basically well-known regular expression language, but additionally has a special construct that matches any number of _balanced_ parenthesis. (Note the difference of approaches of first and second example — first one transforms trees, while the second one transforms strings that are treated as trees. Also, this hypothetical language seems to me to be very useful.)
I do know that there are a lot of such languages, but the importance of tree-like structures in programmers' job and convenience of match and replace approach justifies such a broad question.
Please do not tell about languages with closed source implementation or implementation, severely restricted in other ways. However, if you know really nice language without working implementation, it may be worth to mention. Thanks.

Another language, which is an alternative to XSLT for searching, transforming, and generating XML data, is XQuery (with XQuery Update).

Suffix tries vs dynamic programming for string algorithms

It seems that many difficult string algorithms can be solved both using suffix tries(trees) and Dynamic Programming.
But I am not sure which approach is best to use and when.
Additionally which approach is better to master on the specific area of algorithms and have it in your arsenal in the area of job interviews? I assume it would be the one that would be used more frequently by a programmer in any task or something like that?
This is more of which algorithmic technique is more useful to master as most frequent to use in your job than simply comparing asymptotic notations

Think of a problem requiring the Lexicographically nth substring of a given string : A suffix array is just what you need...and it is easy to learn the bare essentials for solving most problems involving suffix arrays..
On the other hand DP is an algorithmic technique..MASTER IT and you will be able to solve a HUGE number of problems..not only strings.
For an interview though i will take DP anyday...for interviewers, a DP problem lets them make it knotty that is almost impossible to solve without DP (within given constraints) but the solution would mean that you give them a basic recursion and how DP helps you solve it.If it were a suffix-array-only-problem that would mean that they are assessing you over a single data structure( easy once learned) rather than an more general technique which requires mastery.
PS: I had put off learning DP until recently when i got fed up trying to solve problems (that require DP ) using any advanced data structures and would invariably fail ( Case in point : UVA 1394 -- simple problem now that i know how to solve it using DP but instead went on to study segment trees and achieved a O(nlgn) whereas DP gave me O(n). So final advice : if one hasn't studied DP drop everything else and go for it.

honestly, for job interview, no suffix tree is needed. that's too difficult and beyond the scope. however, DP is widely used in interviews for some famous companies like google and facebook.
suffix tree has limitation for solving problems compared with DP. usually it is used to solve string related problems. but DP can solve many different areas.

Text comparison algorithm

We have a requirement in the project that we have to compare two texts (update1, update2) and come up with an algorithm to define how many words and how many sentences have changed.
Are there any algorithms that I can use?
I am not even looking for code. If I know the algorithm, I can code it in Java.

Typically this is accomplished by finding the Longest Common Subsequence (commonly called the LCS problem). This is how tools like diff work. Of course, diff is a line-oriented tool, and it sounds like your needs are somewhat different. However, I'm assuming that you've already constructed some way to compare words and sentences.

An O(NP) Sequence Comparison Algorithm is used by subversion's diff engine.
For your information, there are implementations with various programming languages by myself in following page of github.
https://github.com/cubicdaiya/onp

Some kind of diff variant might be helpful, eg wdiff
If you decide to devise your own algorithm, you're going to have to address the situation where a sentence has been inserted. For example for the following two documents:
The men are bad. I hate the men
and
The men are bad. John likes the men. I hate the men
Your tool should be able to look ahead to recognise that in the second, I hate the men has not been replaced by John likes the men but instead is untouched, and a new sentence inserted before it. i.e. it should report the insertion of a sentence, not the changing of four words followed by a new sentence.

The specific algorithm used by diff and most other comparison utilities is Eugene Myer's An O(ND) Difference Algorithm and Its Variations. There's a Java implementation of it available in the java-diff-utils package.

Here are two papers that describe other text comparison algorithms that should generally output 'better' (e.g. smaller, more meaningful) differences:
Tichy, Walter F., "The String-to-String Correction Problem with Block Moves" (1983). Computer Science Technical Reports. Paper 378.
Paul Heckel, "A Technique for Isolating Differences Between Files", Communications of the ACM, April 1978, Volume 21, Number 4
The first paper cites the second and mentions this about its algorithm:
Heckel[3] pointed out similar problems with LCS techniques and proposed a
linear-lime algorithm to detect block moves. The algorithm performs adequately
if there are few duplicate symbols in the strings. However, the algorithm gives
poor results otherwise. For example, given the two strings aabb and bbaa,
Heckel's algorithm fails to discover any common substring.
The first paper was mentioned in this answer and the second in this answer, both to the similar SO question:
Is there a diff-like algorithm that handles moving block of lines? - Stack Overflow

The difficulty comes when comparing large files efficiently and with good performance. I therefore implemented a variation of Myers O(ND) diff algorithm - which performs quite well and accurate (and supports filtering based on regular expression):
Algorithm can be tested out here: becke.ch compare tool web application
And a little bit more information on the home page: becke.ch compare tool

The most famous algorithm is O(ND) Difference Algorithm, also used in Notepad++ compare plugin (written in C++) and GNU diff(1). You can find a C# implementation here:
http://www.mathertel.de/Diff/default.aspx

Software to Tune/Calibrate Properties for Heuristic Algorithms

Today I read that there is a software called WinCalibra (scroll a bit down) which can take a text file with properties as input.
This program can then optimize the input properties based on the output values of your algorithm. See this paper or the user documentation for more information (see link above; sadly doc is a zipped exe).
Do you know other software which can do the same which runs under Linux? (preferable Open Source)
EDIT: Since I need this for a java application: should I invest my research in java libraries like gaul or watchmaker? The problem is that I don't want to roll out my own solution nor I have time to do so. Do you have pointers to an out-of-the-box applications like Calibra? (internet searches weren't successfull; I only found libraries)
I decided to give away the bounty (otherwise no one would have a benefit) although I didn't found a satisfactory solution :-( (out-of-the-box application)

Some kind of (Metropolis algorithm-like) probability selected random walk is a possibility in this instance. Perhaps with simulated annealing to improve the final selection. Though the timing parameters you've supplied are not optimal for getting a really great result this way.
It works like this:
You start at some point. Use your existing data to pick one that look promising (like the highest value you've got). Set o to the output value at this point.
You propose a randomly selected step in the input space, assign the output value there to n.
Accept the step (that is update the working position) if 1) n>o or 2) the new value is lower, but a random number on [0,1) is less than f(n/o) for some monotonically increasing f() with range and domain on [0,1).
Repeat steps 2 and 3 as long as you can afford, collecting statistics at each step.
Finally compute the result. In your case an average of all points is probably sufficient.
Important frill: This approach has trouble if the space has many local maxima with deep dips between them unless the step size is big enough to get past the dips; but big steps makes the whole thing slow to converge. To fix this you do two things:
Do simulated annealing (start with a large step size and gradually reduce it, thus allowing the walker to move between local maxima early on, but trapping it in one region later to accumulate precision results.
Use several (many if you can afford it) independent walkers so that they can get trapped in different local maxima. The more you use, and the bigger the difference in output values, the more likely you are to get the best maxima.
This is not necessary if you know that you only have one, big, broad, nicely behaved local extreme.
Finally, the selection of f(). You can just use f(x) = x, but you'll get optimal convergence if you use f(x) = exp(-(1/x)).
Again, you don't have enough time for a great many steps (though if you have multiple computers, you can run separate instances to get the multiple walkers effect, which will help), so you might be better off with some kind of deterministic approach. But that is not a subject I know enough about to offer any advice.

There are a lot of genetic algorithm based software that can do exactly that. Wrote a PHD about it a decade or two ago.
A google for Genetic Algorithms Linux shows a load of starting points.

Intrigued by the question, I did a bit of poking around, trying to get a better understanding of the nature of CALIBRA, its standing in academic circles and the existence of similar software of projects, in the Open Source and Linux world.
Please be kind (and, please, edit directly, or suggest editing) for the likely instances where my assertions are incomplete, inexact and even flat-out incorrect. While working in related fields, I'm by no mean an Operational Research (OR) authority!
[Algorithm] Parameter tuning problem is a relatively well defined problem, typically framed as one of a solution search problem whereby, the combination of all possible parameter values constitute a solution space and the parameter tuning logic's aim is to "navigate" [portions of] this space in search of an optimal (or locally optimal) set of parameters.
The optimality of a given solution is measured in various ways and such metrics help direct the search. In the case of the Parameter Tuning problem, the validity of a given solution is measured, directly or through a function, from the output of the algorithm [i.e. the algorithm being tuned not the algorithm of the tuning logic!].
Framed as a search problem, the discipline of Algorithm Parameter Tuning doesn't differ significantly from other other Solution Search problems where the solution space is defined by something else than the parameters to a given algorithm. But because it works on algorithms which are in themselves solutions of sorts, this discipline is sometimes referred as Metaheuristics or Metasearch. (A metaheuristics approach can be applied to various algorihms)
Certainly there are many specific features of the parameter tuning problem as compared to the other optimization applications but with regard to the solution searching per-se, the approaches and problems are generally the same.
Indeed, while well defined, the search problem is generally still broadly unsolved, and is the object of active research in very many different directions, for many different domains. Various approaches offer mixed success depending on the specific conditions and requirements of the domain, and this vibrant and diverse mix of academic research and practical applications is a common trait to Metaheuristics and to Optimization at large.
So... back to CALIBRA...
From its own authors' admission, Calibra has several limitations
Limit of 5 parameters, maximum
Requirement of a range of values for [some of ?] the parameters
Works better when the parameters are relatively independent (but... wait, when that is the case, isn't the whole search problem much easier ;-) )
CALIBRA is based on a combination of approaches, which are repeated in a sequence. A mix of guided search and local optimization.
The paper where CALIBRA was presented is dated 2006. Since then, there's been relatively few references to this paper and to CALIBRA at large. Its two authors have since published several other papers in various disciplines related to Operational Research (OR).
This may be indicative that CALIBRA hasn't been perceived as a breakthrough.

State of the art in that area ("parameter tuning", "algorithm configuration") is the SPOT package in R. You can connect external fitness functions using a language of your choice. It is really powerful.
I am working on adapters for e.g. C++ and Java that simplify the experimental setup, which requires some getting used to in SPOT. The project goes under name InPUT, and a first version of the tuning part will be up soon.

what is the fastest word search on index?

i'm coding a query engine to search through a very large sorted index file. so here is my plan, use binary search scan together with Levenshtein distance word comparison for a match. is there a better or faster ways than this? thanks.

You may want to look into Tries, and in many cases they are faster than binary search.

If you were searching for exact words, I'd suggest a big hash table, which would give you results in a single lookup.
Since you're looking at similar words, maybe you can group the words into many files by something like their soundex, giving you much shorter lists of words to compute the distances to. http://en.wikipedia.org/wiki/Soundex

In your shoes, I would not reinvent the wheel - rather I'd reach for the appropriate version of the Berkeley DB (now owned by Oracle, but still open-source just as it was back when it was owned and developed by the UC at Berkeley, and later when it was owned and developed by Sleepycat;-).
The native interfaces are C and Java (haven't tried the latter actually), but the Python interface is also pretty good (actually better now that it's not in Python's standard library any more, as it can better keep pace with upstream development;-), C++ is of course not a problem, etc etc -- I'm pretty sure you can use if from most any language.
And, you get your choice of "BTree" (actually more like a B*Tree) and hash (as well as other approaches that don't help in your case) -- benchmark both with realistic data, btw, you might be surprised (one way or another) at performance and storage costs.
If you need to throw multiple machines at your indexing problem (because it becomes too large and heavy for a single one), a distributed hash table is a good idea -- the original one was Chord but there are many others now (unfortunately my first-hand experience is currently limited to proprietary ones so I can't really advise you here).

after your comment on David's answer, I'd say that you need two different indexes:
the 'inverted index', where you keep all the words, each with a list of places found
an index into that file, to quickly find any word. Should easily fit in RAM, so it can be a very efficient structure, like a Hash table or a Red/Black tree. I guess the first index isn't updated frequently, so maybe it's possible to get a perfect hash.
or, just use Xapian, Lucene, or any other such library. There are several widely used and optimized.
Edit: I don't know much about word-comparison algorithms but I guess most aren't compatible with hashing. In that case, R/B Trees or Tries might be the best way.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string