Commutative (operational transform) diffs for databases - etherpad

What Unix program generates "diff"s between text files (or
INSERT/UPDATE/DELETEs for databases) in such a way that the order that the "diff"s are applied in is irrelevant, and the result is the same
regardless of order.
Etherpad used to do something like this.
Example (for a given document or database):
% Adam makes a change X, then Bob makes a change Y, then Adam makes
another change Z.
% However, because of network latency, Adam sees the changes in this
order: XZY, while Bob sees them in this order: YXZ.
% However, the code/changes are written so that XYZ and YXZ yield the
same result.
Note: ideally, this can be done without having to do X/Y/Z inverse at any
point.
I have read Operational Transformation library?
but I'm not sure this really does what I want.

Git (or any smart version control system) will provide this functionality.

Related

How would you implement a heuristic search in Haskell?

In Haskell or some other functional programming language, how would you implement a heuristic search?
Take as an example search space, the nine-puzzle, that is a 3x3 grid with 8 tiles and 1 hole, and you move tiles into the hole until you have correctly assembled a picture. The heuristic is the "Manhattan heuristic", which evaluates a board position adding up the distance each tile is from its target position, taking as the distance the number of squares horizontally plus the number of squares vertically each tile needs to be moved to get to the correct location.
I have been reading John Hughes paper on pretty printing as I know that pretty printer back-tracks to find better solutions. I am trying to understand how to generalise a heuristic search along these lines.
===
Note that my ultimate aim here is not to write a solver for the 9-puzzle, but to learn some general techniques for writing efficient heuristic searches in FP languages. I am also interested to learn if there is code that can be generalised and re-used across a wider class of such problems, rather than solving any specific problem.
For example, a search space can be characterised by a function that maps a State to a List of States together with some 'operation' that describes how one state is transitioned into another. There could also be a goal function, mapping a State to Bool, indicating when a goal State has been reached. And of course, the heuristic function mapping a State to a Number reflecting how well it is estimated to score. Other descriptions of the search are possible.
I don't think it's necessarily very specific to FP or Haskell (unless you utilize lists as "multiple possibility" monads, as in Learn You A Haskell For Great Good).
One way to do it would be by writing a recursive function taking the following:
the current state (that is the board configuration)
possibly some path metadata, e.g., the number of steps from the initial configuration (which is just the recursion depth), or a memoization-map of all the states already considered
possibly some decision, metadata, e.g., a pesudo-random number generator
Within each recursive call, the function would take the state, and check if it is the required result. If not it would
if it uses a memoization map, check if a choice was already considered
If it uses a recursive-step count, check whether to pursue the choices further
If it decides to recursively call itself on the possible choices emanating from this state (e.g., if there are different tiles which can be pushed into the hole), it could do so in the order based on the heuristic (or possibly pseudo-randomly based on the order based on the heuristic)
The function would return whether it succeeded, and, if they are used, updated versions of the memoization map and/or pseudo-random number generator.

Algorithm to detect if a file(or string) have been patched

This question is related to string algorithm, not version control tools or management tools.
I learnt the diff algorithm and tried to implement one. That is, given string A and string B, the diff calculate a sequence of actions that can convert A into B.
I wonder, if it possible, given a string S, and a sequence of actions that diff algorithm can produce, the algorithm will tell if the string S is (a) the origin string A, (b) the patched string B, (c) unrelated string. And what if S is only one of A and B.
Actuallly, what I'm really doing is researching a method that can tell if a patch have been applied (source code level or binary code level). I tried google some time, but didn't find something useful.
It's pretty complicated, but it can be done, on some level.
Essentially, you parse the source level into tokens, after that, you build the abstract syntax tree. Once that is done, you must build a diff tool that can do semantic differential analysis between abstract syntax trees. SemanticMerge for example, does that.
Once that is done, you have semantical difference between two source codes, and then you need to define what exactly consists of a patch.
Some of the rules can be:
1) Variable content was changed
2) A if check was added
The bottom line is, differenting between patch and new functionality is not an easy task. The most reliable way is to probably check the binary file version numbers, and understand the versioning schema.
Eg, only minor version is updated, if patches are applied.

Learning/Detecting Mutatable Parts of a URL in Logs

Say you have a webserver log (apache, nginx, whatever). From it you extract a large list of URLs:
/article/1/view
/article/2/view
/article/1/view
/article/1323/view
/article/1/edit
/help
/article/1/view
/contact
/contact/thank-you
/article/8/edit
...
or
/blog/2012/06/01/how-i-will-spend-my-summer-vacation
/blog/2012/08/30/how-i-wasted-my-summer-vacation
...
You explode these urls into their pieces such that you have ['article', '1323', 'view'] or ['blog', '2012', '08', '30', 'how-i-wasted-my-summer-vacation'].
How would one go about analyzing and comparing these urls to detect and call out "variables" in the url path. That is to say, you would want to recognize things like /article/XXX/view, /article/XXX/edit, and /blog/XXX/XXX/XXX/XXX such that you can summarize information about those lines in the logs.
I assume that there will need to be some statistical threshold for the number of differences that constitute a mutable piece vs a similar looking but different template. I am also unsure as to what data structure would make this analysis quick and easy.
I would like the output of the script to output what it thinks are all the url templates that are present on the server, possibly with some confidence value if appropriate.
A simple solution would be to count path occurrences and learn which values correspond to templates. Assume that the file input contains the URLs from your first snippet. Then compute the per-path visits:
awk -F '/' '{ for (i=2; i<=NF; ++i) { for (j=2; j<=i; ++j) printf "/%s", $j; printf "\n" }}' input \
| sort \
| uniq -c \
| sort -rn
This yields:
7 /article
4 /article/1
3 /article/1/view
2 /contact
1 /help
1 /contact/thank-you
1 /article/8/edit
1 /article/8
1 /article/2/view
1 /article/2
1 /article/1323/view
1 /article/1323
1 /article/1/edit
Now you have a weight for each path which you can feed into a score function f(x, y), where x represents the count and y the depth of the path. For example, the first line would result in the invocation f(7,2) and may return a value in [0,1], say 0.8, to tell you that the given parametrization corresponds to a template with 80%. Of course, all the magic happens in f and you would have to come up with reasonable values based on the paths that you see being accessed. To develop a good f, you could use logistic regression on some a small data set and see if it predicts well the binary feature of being a template or not.
You can also take a mundane route: just drop the tail, e.g., all values <= 1.
How about using a DAWG? Except the nodes would store not letters, but the URI pieces. Like this:
This is a very nice data structure: it has pretty minimal memory requirements, it's easy to traverse, and, being a DAG, there are plenty of easy and well-researched algorithms for it. It also happens to describe the state machine that accepts all URLs in the sample and rejects all others (so we might actually build a regular expression out of it, which is very neat, but I'm not clever enough to know how to go about it from there).
Anyhow, with a structure like this, your problem translates into that of finding the "bottlenecks". I'd guess there are proper algorithms for that, but with a large enough sample where variables vary wildly enough, it's basically this: the more nodes there are at a certain depth, the more likely it's a mutable part.
A probably naive approach to do it would be like this: keeping separate DAWGs for every starting part, I'd find the mean width of the DAWG (possibly weighted based on the depth). And if a level's width is above that mean, I'd consider it a variable with the probability depending on how far away it is from the mean. You may very well unleash the power of statistics at this point. modeling the distribution of the width.
This approach wouldn't fare well with independent patterns starting with the same part, like "shop/?/?" and "shop/admin/?/edit". This could be perhaps mitigated by examining the DAWG-s in a more dynamic fashion, using a sliding window of sorts, always examining only a part of the DAWG at once, but I don't know how. Oh and, the whole thing fails horribly if the very first part is a variable, but that's thankfully rare.
You may also look out for certain little things like all nodes of the same level having numerical values (more likely to be a variable), and I'd certainly check for common date patterns in the sample before building the DAWGs, factoring them out would make handling the blog-like patterns easier.
(Oh and, adding the "algorithm" tag would probably attract more attention to the question.)

A reverse inference engine (find a random X for which foo(X) is true)

I am aware that languages like Prolog allow you to write things like the following:
mortal(X) :- man(X). % All men are mortal
man(socrates). % Socrates is a man
?- mortal(socrates). % Is Socrates mortal?
yes
What I want is something like this, but backwards. Suppose I have this:
mortal(X) :- man(X).
man(socrates).
man(plato).
man(aristotle).
I then ask it to give me a random X for which mortal(X) is true (thus it should give me one of 'socrates', 'plato', or 'aristotle' according to some random seed).
My questions are:
Does this sort of reverse inference have a name?
Are there any languages or libraries that support it?
EDIT
As somebody below pointed out, you can simply ask mortal(X) and it will return all X, from which you can simply pick a random one from the list. What if, however, that list would be very large, perhaps in the billions? Obviously in that case it wouldn't do to generate every possible result before picking one.
To see how this would be a practical problem, imagine a simple grammar that generated a random sentence of the form "adjective1 noun1 adverb transitive_verb adjective2 noun2". If the lists of adjectives, nouns, verbs, etc. are very large, you can see how the combinatorial explosion is a problem. If each list had 1000 words, you'd have 1000^6 possible sentences.
Instead of the deep-first search of Prolog, a randomized deep-first search strategy could be easyly implemented. All that is required is to randomize the program flow at choice points so that every time a disjunction is reached a random pole on the search tree (= prolog program) is selected instead of the first.
Though, note that this approach does not guarantees that all the solutions will be equally probable. To guarantee that, it is required to known in advance how many solutions will be generated by every pole to weight the randomization accordingly.
I've never used Prolog or anything similar, but judging by what Wikipedia says on the subject, asking
?- mortal(X).
should list everything for which mortal is true. After that, just pick one of the results.
So to answer your questions,
I'd go with "a query with a variable in it"
From what I can tell, Prolog itself should support it quite fine.
I dont think that you can calculate the nth solution directly but you can calculate the n first solutions (n randomly picked) and pick the last. Of course this would be problematic if n=10^(big_number)...
You could also do something like
mortal(ID,X) :- man(ID,X).
man(X):- random(1,4,ID), man(ID,X).
man(1,socrates).
man(2,plato).
man(3,aristotle).
but the problem is that if not every man was mortal, for example if only 1 out of 1000000 was mortal you would have to search a lot. It would be like searching for solutions for an equation by trying random numbers till you find one.
You could develop some sort of heuristic to find a solution close to the number but that may affect (negatively) the randomness.
I suspect that there is no way to do it more efficiently: you either have to calculate the set of solutions and pick one or pick one member of the superset of all solutions till you find one solution. But don't take my word for it xd

Update the quantile for a dataset when a new datapoint is added

Suppose I have a list of numbers and I've computed the q-quantile (using Quantile).
Now a new datapoint comes along and I want to update my q-quantile, without having stored the whole list of previous datapoints.
What would you recommend?
Perhaps it can't be done exactly without, in the worst case, storing all previous datapoints.
In that case, can you think of something that would work well enough?
One idea I had, if you can assume normality, is to use the inverse CDF instead of the q-quantile.
Keep track of the sample variance as you go and then you can compute InverseCDF[NormalDistribution[sampleMean,sampleVariance], q] which should be the value such that a fraction q of the values are smaller, which is what the q-quantile is.
(I see belisarius was thinking along the same lines.
Here's the link he pointed to: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm )
Unless you know that your underlying data comes from some distribution, it is not possible to update arbitrary quantiles without retaining the original data. You can, as others suggested, assume that the data has some sort of distribution and store the quantiles this way, but this is a rather restrictive approach.
Alternately, have you thought of programming this somewhere besides Mathematica? For example, you could create a class for your datapoints that contains (1) the Double value and (2) some timestamp for when the data came in. In a SortedList of these datapoints classes (which compares based on value), you could get the quantile very fast by simply referencing the index of the datapoints. Want to get a historical quantile? Simply filter on the timestamps in your sorted list.

Resources