Detection of similar sequences in ordered event lists

Detection of similar sequences in ordered event lists - linux

I have logs from a bunch (millions) of small experiments.
Each log contains a list (tens to hundreds) of entries. Each entry is a timestamp and an event ID (there are several thousands of event IDs, each of may occur many times in logs):
1403973044 alpha
1403973045 beta
1403973070 gamma
1403973070 alpha
1403973098 delta
I know that one event may trigger other events later.
I am researching this dataset. I am looking for "stable" sequences of events that occur often enough in the experiments.
Is there a way to do this without writing too much code and without using proprietary software? The solution should be scalable enough, and work on large datasets.
I think that this task is similar to what bioinformatics does — finding sequences in a DNA and such. Only my task includes many more than four letters in an alphabet... (Update, thanks to #JayInNyc: proteomics deals with larger alphabets than mine.)
(Note, BTW, that I do not know beforehand how stable and similar I want my sequences, what is the minimal sequence length etc. I'm researching the dataset, and will have to figure this out on the go.)
Anyway, any suggestions on the approaches/tools/libraries I could use?
Update: Some answers to the questions in comments:
Stable sequences: found often enough across the experiments. (How often is enough? Don't know yet. Looks like I need to calculate a top of the chains, and discard rarest.)
Similar sequences: sequences that look similar. "Are the sequences 'A B C D E' and 'A B C E D' (minor difference in sequence) similar according to you? Are the sequences 'A B C D E' and 'A B C 1 D E' (sequence of occurrence of selected events is same) also similar according to you?" — Yes to both questions. More drastic mutations are probably also OK. Again, I'd like to be able to calculate a top and discard the most dissimilar...
Timing: I can discard timing information for now (but not order). But it would be cool to have it in a similarity index formula.
Update 2: Expected output.
In the end I would like to have a rating of most popular longest stablest chains. A combination of all three factors should have effect in the calculation of the rating score.
A chain in such rating is, obviously, rather a cluster of similar enough chains.
A synthetic example of a chain-cluster:
alpha
beta
gamma
[garbage]
[garbage]
delta
another:
alpha
beta
gamma|zeta|epsilon
delta
(or whatever variant did not came to my mind right now.)
So, the end output would be something like that (numbers are completely random in this example):
Chain cluster ID | Times found | Time stab. factor | Chain stab. factor | Length | Score
A | 12345 | 123 | 3 | 5 | 100000
B | 54321 | 12 | 30 | 3 | 700000

I have thought about this setup for the past day or so -- how to do it in a sane scalable way in bash, etc.. The answer is really driven by the relational information you are wanting to draw from the data and the apparent size of the dataset you currently have. The xleanest solution will be to load you datasets into a relational database (MariaDB would by my recommendation)
Since your data already exists in an fairly clean format, your options for getting the data into a database are 2. (1) if the files have the data in a usable rowxcol setup, then you can simply use LOAD DATA INFILE to bring your data into the database; or (2) parse the files with bash in a while read line; do scenario, parse the data to get the data in the table format you desire, and use mysql batch mode to directly load the information into mysql in a single pass. The general form of the bash command would be mysql -uUser -hHost database -Bse "your insert command".
Once in a relational database, you then have the proper tool for the job of being able to run flexible queries against your data in a sane manner instead of continually writing/re-writing bash snippets to handle your data in a different way each time. That is probably the best scalable solution you are looking for. A little more work up-front, but a lot better setup going forward.

Wikipedia defines algorithm as 'a precise list of precise steps': 'I am looking for "stable" sequences of events that occur often enough in the experiments.' "Stable" and "often enough" without definition makes the task of giving you an algorithm impossible.
So I give you the trivial one to calculate the frequency of sequences of length 2. I will ignore the time stamp. Here is the awk code (pW stands for previous Word, pcs stands for pair counters):
#!/usr/bin/awk -f
BEGIN { getline; pW=$2; }
{ pcs[pW, $2]++; pW=$2; }
END {
for (i in pcs)
print i, pcs[i];
}
I duplicated your sample to show something meaningful looking
1403973044 alpha
1403973045 beta
1403973070 gamma
1403973070 alpha
1403973098 delta
1403973044 alpha
1403973045 beta
1403973070 gamma
1403973070 beta
1403973098 delta
1403973044 alpha
1403973045 beta
1403973070 gamma
1403973070 beta
1403973098 delta
1403973044 alpha
1403973045 beta
1403973070 gamma
1403973070 beta
1403973098 delta
Running the code above on it gives:
gammaalpha 1
alphabeta 4
gammabeta 3
deltaalpha 3
betagamma 4
alphadelta 1
betadelta 3
which can be interpreted as alpha followed by beta and beta followed by gamma are the most frequent length two sequences each occurring 4 times in the sample. I guess that would be your definition of stable sequence occurring often enough.
What's next?
(1) You can easily adopt the code above to sequences of length N and to find sequences occurring often enough you can sort (-k2nr) the output on the second column.
(2) To put a limit on N you can stipulate that no event triggers itself, that provides you with a cut-off point. Or you can place a limit on the timestamp ie the difference between consecutive events.
(3) So far those sequences were really strings and I used exact matching between them (CLRS terminology). Nothing prevents you from using your favourite similarity measure instead:
{ pcs[CLIFY(pW, $2)]++; pW=$2; }
CLIFY would be a function which takes k consecutive events and puts them into a bin ie maybe you want ABCDE and ABDCE to go to the same bin. CLIFY could of course take as an additional argument the set of bins so far.
The choice of awk is for convenience. It wouldn't fly, but you can easily run them in parallel.
It is unclear what you want to use this for but a google search for Markov chains, Mark V Shaney would probably help.

Related

Transportation problem to minimize the cost using genetic algorithm

I am new to Genetic Algorithm and Here is a simple part of what i am working on
There are factories (1,2,3) and they can server any of the following customers(ABC) and the transportation costs are given in the table below. There are some fixed cost for A,B,C (2,4,1)
A B C
1 5 2 3
2 2 4 6
3 8 5 5
How to solve the transportation problem to minimize the cost using a genetic algorithm

First of all, you should understand what is a genetic algorithm and why we call it like that. Because we act like a single cell organism and making cross overs and mutations to reach a better state.
So, you need to implement your chromosome first. In your situation, let's take a side, customers or factories. Let's take customers. Your solution will look like
1 -> A
2 -> B
3 -> C
So, your example chromosome is "ABC". Then create another chromosome ("BCA" for example)
Now you need a fitting function which you wish to minimize/maximize.
This function will calculate your chromosomes' breeding chance. In your situation, that'll be the total cost.
Write a function that calculates the cost for given factory and given customer.
Now, what you're going to do is,
Pick 2 chromosomes weighted randomly. (Weights are calculated by fitting function)
Pick an index from 2 chromosomes and create new chromosomes via using their switched parts.
If new chromosomes have invalid parts (Such as "ABA" in your situation), make a fixing move (Make one of "A"s, "C" for example). We call it a "mutation".
Add your new chromosome to the chromosome set if it wasn't there before.
Go to first process again.
You'll do this for some iterations. You may have thousands of chromosomes. When you think "it's enough", stop the process and sort the chromosome set ascending/descending. First chromosome will be your result.
I'm aware that makes the process time/chromosome dependent. I'm aware you may or may not find an optimum (fittest according to biology) chromosome if you do not run it enough. But that's called genetic algorithm. Even your first run and second run may or may not produce the same results and that's fine.
Just for your situation, possible chromosome set is very small, so I guarantee that you will find an optimum in a second or two. Because the entire chromosome set is ["ABC", "BCA", "CAB", "BAC", "CBA", "ACB"] for you.
In summary, you need 3 informations for applying a genetic algorithm:
How should my chromosome be? (And initial chromosome set)
What is my fitting function?
How to make cross-overs in my chromosomes?
There are some other things to care about this problem:
Without mutation, genetical algorithm can stuck to a local optimum. It still can be used for optimization problems with constraints.
Even if a chromosome exists with a very low chance to be picked for cross-over, you shouldn't sort and truncate the chromosome set till the end of iterations. Otherwise, you may stuck at a local extremum or worse, you may get an ordinary solution candidate instead of global optimum.
To fasten your process, pick non-similar initial chromosomes. Without enough mutation rate, finding global optimum could be a real pain.

As mentioned in nejdetckenobi's answer, in this case the solution search space is too small, i.e. only 8 feasible solutions ["ABC", "BCA", "CAB", "BAC", "CBA", "ACB"]. I assume this is only a simplified version of your problem, and your problem actually contains more factories and customers (but the numbers of factories and customers are equal). In this case, you can just make use of special mutation and crossover to avoid infeasible solution with repeating customers, e.g. ["ABA", 'CCB', etc.].
For mutation, I suggest to use a swap mutation, i.e. randomly pick two customers, swap their corresponding factory (position):
ABC mutate to ACB
ABC mutate to CBA

Create (mathematical) function from set of predefined values

I want to create an excel table that will help me when estimating implementation times for tasks that I am given. To do so, I derived 4 categories in which I individually rate the task from 1 to 10.
Those are: Complexity of system (simple scripts or entire business systems), State of requirements (well defined or very soft), Knowledge about system (how much I know about the system and the code base) and Plan for implementation (do I know what to do or don't I have any plan what to do or where to start).
After rating each task in these categories, I want to have a resulting factor of how expensive and how long the task will likely take, as a very rough estimate that I can tell my bosses.
What I thought about doing
I thought to create a function where I define the inputs and then get the result in form of a number, see:
| a | b | c | d | Result |
| 1 | 1 | 1 | 1 | 160 |
| 5 | 5 | 5 | 5 | 80 |
| 10 | 10 | 10 | 10 | 2 |
And I want to create a function that, when given a, b, c, d will produce the results above for the extreme cases (max, min, avg) and of course any values (float) in between.
How can I go about doing this? I imagine this is some form of polynomial problem, but how can I actually create the function that creates these results?
I have tasks like this often, so it would be cool to have a sort of pattern to follow whenever I need to create such functions for any amount of parameters and results needed.
I tried using wolfram alphas interpolate polynomial command for this, but the result is just a mess of extremely large fractions...
How can I create this function properly with reasonable results?
While writing this edit, I realize this may be better suited over at programmers.SE - If no one answers here, I will move the question there.

You don't have enough data as it is. The simplest formula which takes into account all your four explanatory variables would be linear:
x0 + x1*a + x2*b + x3*c + x4*d
If you formulate a set of equations for this, you have three equations but five unknowns, which means that you don't have a unique solution. On the other hand, the data points which you did provide are proof of the fact that the relation between scores and time is not exactly linear. So you might have to look at some family of functions which is even more complex, and therefore has even more parameters to tune. While it would be easy to tune parameters to match the input, that choice would be pretty arbitrary, and therefore without predictive power.
So while your system of four distinct scores might be useful in the long run, I'd not use that at the moment. I'd suggest you collect some more data points, see how long a given task actually did take you, and only use that fine-grained a model once you have enough data points to fit all of its parameters.
In the meantime, aggregate all four numbers into a single number. E.g. by taking their average. Then decide on a formula to choose. E.g. a quadratic one:
182 - 22.9*a + 0.49*a*a
That's a fair fit for your requirements, and not too complex or messy. But the choice of function, i.e. a polynomial one, is still pretty arbitrary. So revisit that choice once you have more data. Note that this polynomial is almost the one Wolfram Alpha found for your data:
1642/9 - 344/15*a + 22/45*a*a
I only converted these rational numbers to decimal notation, which I truncated pretty early on since all of this is very rough in any case.
On the whole, this question appears more suited to CrossValidated than to Programmers SE, in my opinion. But don't bother them unless you have sufficient data to actually fit a model.

build shortest string from a collection of its subsequence

Given a collection of sub-sequences from a string.
For example:
abc
acd
bcd
The problem is, how to determine the shortest string from these sequences?
For the above example, the shortest string is abcd.
Here sub-sequences means part of a string but not necessarily consecutive. like acd is a sub-sequence of string abcd.
Edit: This problem actually comes from Project Euler problem 79, in that problem, we have 50 subsequences, each has the length of 3. and all characters are digits.

This problem is well studied and coined "Shortest common supersequence"
For two strings, it is in O(n). Suppose n is the maximum length of string. O(n) is used to build the suffix array and then O(n) to solve the Longest Common Subsequence problem.
For more than two strings, it is NP-Complete.

Complexity
In common case, as mentioned above, this problem will be NP-hard. There is a way to resolve a matter with suffix structure, but also you can use directed graph to do that. I can not say for sure if that is better in some sense - but may be some advantage may be found in difficult corner cases.
Graph solution
To realize how you can use graph - you need just to build it properly. Let letters of strings be vertexes and order of letter will define edges. That means, ab will mean vertexes a and b and connection between a and b. Of course, connections are directed (i.e. if a is connected to b it doesn't mean that b is connected to a, so ab is a --> b)
After these definitions you'll be able to build all graph. For your sample it's:
-simple enough. But abcd can also be represented with strings of two length as ab, ac, ad, bc, bd, cd - so I'll show graph for that too (it's just for more clarity):
Then, to find your solution, you need in this graph to find a path with maximal length. Obviously, there's the place from which NP-hardness is "derived". From both cases above maximum length is 4 which will be reached if we'll start from a vertex and traverse graph till found a->b->c->d path.
Corner cases
Non-unique solution: in fact, you may face string sets which can't strictly define a superset. Example: ab, ad, bd, ac, cd. Both abcd and acbd will fit those substrings. But, actually, this is not too bad problem. Look at the graph:
(I've chosen that with reason: it's like second graph, but without one connection, that's why result is ambiguous). The maximum path length is now 3 - but it can be reached with two paths: abd and acd. So how to restore at least one solution? Simple: since result strings will have same length (that comes from definition of way which we've found them) - we can just walk from start of first string and check symbols in second string. If they're matching, then it's a strict symbol position. But if they not - then we're free to chose any order between current symbols. So that will be:
[a] matches [a], so current result is a
[b] mismatches [c], so we can place either bc or cb. Let it be first. Then result is abc
[d] matches [d], so current result is abc+d, i.e. abcd
This is kind of "merge" where we are free to choose any result. However, this is a bit twisted case, because now we can't use just any found path - we should find all paths with maximum length (otherwise we'll not be able to restore full supersequence)
Next, non-existent solution: there may be cases when order of symbols in some strings can not be used to reproduce supersequence. Obviously, that means that one order is violating other order, thus, there will be two strings in which some two symbols have different order. Again, simplest case: abc, cbd. On the graph:
-so imminent consequence will be graph loop - it may not be so simple (like in graph above) - but it always be if order is broken. Thus, all you need to realize that is to find graph cycle. In fact. this isn't a thing that you must do separately. You'll just add cycle detection in graph longest path search.
Third: repeated symbols: this is the most tricky case. If string contains repeated symbols, then graph creation is more complicated, but still can be done. For example, let we have three conditions: aac, abc, bbc. Solution for this is aabbc. How to build graph? Now we can't just add links (because of loops at least). But I suggest following way:
Let we process all our strings, assigning indexes to symbols. Index is reset per string. If symbol appears only once, index will be 1. For our strings that means: a1a2c1, a1b1c1 and b1b2c1. After that, store maximum index for each symbol we've found. For sample above that will be 2 for a, 2 for b and 1 for c
If two connected indexed symbols have same original symbol, then connection is done "as is". For example, a1a2 will produce only one connection from a1 to a2
If two connected indexed symbols have different original symbols, then any first indexed symbol may be connected to any second indexed symbol. For example, b1c1 will result in (b1 to c1) connection and (b2 to c1) connection. How do we know about how many indexed symbols may be? That we've done on first step (i.e. found maximum indexes)
Thus, graph for sample above will look like:
+------------+
| |
| |
+------>[a2]------+ |
| | | |
| V V |
[a1]---->[c1]<----[b2] |
| ^ ^ |
| | | |
+------>[b1]------+ |
^ |
| |
+------------+
-so we'll have maximum length 5 for paths a1a2b2b1c1 and a1a2b1b2c1. We can chose any of them, ignoring indexes in result string aabbc.

Maybe there is no general efficient algorithm this kind of problem. But this is a solution for this particular problemprojecteuler 79. You want to observe the input file. You will find 7 only appears at the beginning of all sequences, so 7 should be put at the beginning. Then you will find digit 3 is the only character at the beginning, then you put 3 on the second position. So on and so forth, you can get 73162890. The special case is the last 2 digits, both 0 and 9 are at the beginning, then you have 2 choices. then you try both and get 90 is the optimal solution.
More generally, you can use the depth-first-search algorithm. The trick is when you find there is one character which only appears at the beginning, just choose it, it will
path to the optimal solution.

I think we can start with graphs .
I am not sure of correctness , but say if we build a graph with a->b if b comes after a , with all paths of length 1.
Now , we have to find longest distance (DFS can help) .
I will try to provide example .
say strings are :
abc
acd
bcd
bce
we form a graph :
Now main thing left will be to combine nodes e and d because the required string can be abcde or abced.
This is what i am not sure how to do , so maybe somebody can help !
Sorry for posting this as answer , comments can't include pictures.

Construct the graph out of the subsequences and then find the topological sort of the graph in O(E) using DFS to get the desired string of shortest length which has all subsequence in it. But as you would have noticed the topo-sort is invalid if the graph has cycles in that cases there are repetitions need for characters in the cycles which is difficult to solve.
Conclusion:- You get lucky if there are no cycles in graph and solve it in O(E) else get unlucky and end up doing brute force.

Learning/Detecting Mutatable Parts of a URL in Logs

Say you have a webserver log (apache, nginx, whatever). From it you extract a large list of URLs:
/article/1/view
/article/2/view
/article/1/view
/article/1323/view
/article/1/edit
/help
/article/1/view
/contact
/contact/thank-you
/article/8/edit
...
or
/blog/2012/06/01/how-i-will-spend-my-summer-vacation
/blog/2012/08/30/how-i-wasted-my-summer-vacation
...
You explode these urls into their pieces such that you have ['article', '1323', 'view'] or ['blog', '2012', '08', '30', 'how-i-wasted-my-summer-vacation'].
How would one go about analyzing and comparing these urls to detect and call out "variables" in the url path. That is to say, you would want to recognize things like /article/XXX/view, /article/XXX/edit, and /blog/XXX/XXX/XXX/XXX such that you can summarize information about those lines in the logs.
I assume that there will need to be some statistical threshold for the number of differences that constitute a mutable piece vs a similar looking but different template. I am also unsure as to what data structure would make this analysis quick and easy.
I would like the output of the script to output what it thinks are all the url templates that are present on the server, possibly with some confidence value if appropriate.

A simple solution would be to count path occurrences and learn which values correspond to templates. Assume that the file input contains the URLs from your first snippet. Then compute the per-path visits:
awk -F '/' '{ for (i=2; i<=NF; ++i) { for (j=2; j<=i; ++j) printf "/%s", $j; printf "\n" }}' input \
| sort \
| uniq -c \
| sort -rn
This yields:
7 /article
4 /article/1
3 /article/1/view
2 /contact
1 /help
1 /contact/thank-you
1 /article/8/edit
1 /article/8
1 /article/2/view
1 /article/2
1 /article/1323/view
1 /article/1323
1 /article/1/edit
Now you have a weight for each path which you can feed into a score function f(x, y), where x represents the count and y the depth of the path. For example, the first line would result in the invocation f(7,2) and may return a value in [0,1], say 0.8, to tell you that the given parametrization corresponds to a template with 80%. Of course, all the magic happens in f and you would have to come up with reasonable values based on the paths that you see being accessed. To develop a good f, you could use logistic regression on some a small data set and see if it predicts well the binary feature of being a template or not.
You can also take a mundane route: just drop the tail, e.g., all values <= 1.

How about using a DAWG? Except the nodes would store not letters, but the URI pieces. Like this:
This is a very nice data structure: it has pretty minimal memory requirements, it's easy to traverse, and, being a DAG, there are plenty of easy and well-researched algorithms for it. It also happens to describe the state machine that accepts all URLs in the sample and rejects all others (so we might actually build a regular expression out of it, which is very neat, but I'm not clever enough to know how to go about it from there).
Anyhow, with a structure like this, your problem translates into that of finding the "bottlenecks". I'd guess there are proper algorithms for that, but with a large enough sample where variables vary wildly enough, it's basically this: the more nodes there are at a certain depth, the more likely it's a mutable part.
A probably naive approach to do it would be like this: keeping separate DAWGs for every starting part, I'd find the mean width of the DAWG (possibly weighted based on the depth). And if a level's width is above that mean, I'd consider it a variable with the probability depending on how far away it is from the mean. You may very well unleash the power of statistics at this point. modeling the distribution of the width.
This approach wouldn't fare well with independent patterns starting with the same part, like "shop/?/?" and "shop/admin/?/edit". This could be perhaps mitigated by examining the DAWG-s in a more dynamic fashion, using a sliding window of sorts, always examining only a part of the DAWG at once, but I don't know how. Oh and, the whole thing fails horribly if the very first part is a variable, but that's thankfully rare.
You may also look out for certain little things like all nodes of the same level having numerical values (more likely to be a variable), and I'd certainly check for common date patterns in the sample before building the DAWGs, factoring them out would make handling the blog-like patterns easier.
(Oh and, adding the "algorithm" tag would probably attract more attention to the question.)

cluster short, homogeneous strings (DNA) according to common sub-patterns and extract consensus of classes

Task:
to cluster a large pool of short DNA fragments in classes that share common sub-sequence-patterns and find the consensus sequence of each class.
Pool: ca. 300 sequence fragments
8 - 20 letters per fragment
4 possible letters: a,g,t,c
each fragment is structured in three regions:
5 generic letters
8 or more positions of g's and c's
5 generic letters
(As regex that would be [gcta]{5}[gc]{8,}[gcta]{5})
Plan:
to perform a multiple alignment (i.e. withClustalW2) to find classes that share common sequences in region 2 and their consensus sequences.
Questions:
Are my fragments too short, and would it help to increase their size?
Is region 2 too homogeneous, with only two allowed letter types, for showing patterns in its sequence?
Which alternative methods or tools can you suggest for this task?
Best regards,
Simon

Yes, 300 is FAR TOO FEW considering that this is the human genome and you're essentially just looking for a particular 8-mer. There are 65,536 possible 8-mers and 3,000,000,000 unique bases in the genome (assuming you're looking at the entire genome and not just genic or coding regions). You'll find G/C containing sequences 3,000,000,000 / 65,536 * 2^8 =~ 12,000,000 times (and probably much more since the genome is full of CpG islands compared to other things). Why only choose 300?
You don't want to use regex's for this task. Just start at chromosome 1, look for the first CG or GC and extend until you get your first non-G-or-C. Then take that sequence, its context and save it (in a DB). Rinse and repeat.
For this project, Clustal may be overkill -- but I don't know your objectives so I can't be sure. If you're only interested in the GC region, then you can do some simple clustering like so:
Make a database entry for each G/C 8-mer (2^8 = 256 in all).
Take each GC-region and walk it to see which 8-mers it contains.
Tag each GC-region with the sequences it contains.
Now, for each 8-mer, you have thousands of sequences which contain it. I'll leave the analysis of the data up to your own objectives.

Your region two, with the 2 letters, may end up a bit too similar, increasing length or variability (e.g. more letters) could help.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string