Identifying frequent formulas in a codebase - programming-languages

My company maintains a domain-specific language that syntactically resembles the Excel formula language. We're considering adding new builtins to the language. One way to do this is to identify verbose commands that are repeatedly used in our codebase. For example, if we see people always write the same 100-character command to trim whitespace from the beginning and end of a string, that suggests we should add a trim function.
Seeing a list of frequent substrings in the codebase would be a good start (though sometimes the frequently used commands differ by a few characters because of different variable names used).
I know there are well-established algorithms for doing this, but first I want to see if I can avoid reinventing the wheel. For example, I know this concept is the basis of many compression algorithms, so is there a compression module that lets me retrieve the dictionary of frequent substrings? Any other ideas would be appreciated.

The string matching is just the low hanging fruit, the obvious cases. The harder cases are where you're doing similar things but in different order. For example suppose you have:
X+Y
Y+X
Your string matching approach won't realize that those are effectively the same. If you want to go a bit deeper I think you need to parse the formulas into an AST and actually compare the AST's. If you did that you could see that the tree's are actually the same since the binary operator '+' is commutative.
You could also apply reduction rules so you could evaluate complex functions into simpler ones, for example:
(X * A) + ( X * B)
X * ( A + B )
Those are also the same! String matching won't help you there.
Parse into AST
Reduce and Optimize the functions
Compare the resulting AST to other ASTs
If you find a match then replace them with a call to a shared function.

I would think you could use an existing full-text indexer like Lucene, and implement your own Analyzer and Tokenizer that is specific to your formula language.
You then would be able to run queries, and be able to see the most used formulas, which ones appear next to each other, etc.
Here's a quick article to get you started:
Lucene Analyzer, Tokenizer and TokenFilter

You might want to look into tag-cloud generators. I couldn't find any source in the minute that I spent looking, but here's an online one:
http://tagcloud.oclc.org/tagcloud/TagCloudDemo which probably won't work since it uses spaces as delimiters.

Related

Ansible: Any way to do a non-lexicographic sorting of strings containing integers?

I have multiple versions of scripts files within the same directory. I want ansible via the find: module, through some sort of sorting comparison to always choose the highest version file it can. The issue is that the integer sorting I desire won't work because the strings in question aren't a pure integer comparison, and the lexicographic sorting won't give me the proper expected version. These filenames have a naming "convention" but there is no exact hard-coded filename versioning to work with, the versions are random within each project, but if integer ordering could be used, I/we could determine which file to use for each task.
Example, lets say I have the following 3 strings:
script_v3.sh
script_v9.sh
script_v10.sh
The normal sorting methods/filters within ansible (such as the sort filter) will try to do a lexicographic comparison of these strings in order. This means the one with the "highest" value will be script_v9.sh which is bad as I/we expect it should be script_v10.sh this is no good, as it will try to use script_v9.sh and then cause the rest of that process/task to fail. I would love to turn them into integers to do an integer comparison/sort, but as there are other non-numerical characters in the string every attempt so far to do so has been a failure. Note that I/we also need to occasionally use the lowest version in some tasks as well, which also screws up utilizing our example if lexicographic sorting is used.
I would like to know if this is possible to accomplish through some convenient comparison method, or filter which I have overlooked, or if anyone has any better ideas? The only thing I could possibly come up with is to use a regex to strip out the integers, compare them by themselves as integers, and then finally match up the result to the filename which contains the highest 10 value and then have the task use that. However, I'm horrible at regular expressions, and I'm not even certain that's the most elegant way to approach this. Anyone who could help me out would be highly appreciated.

Find any values from a set/list within a string

Long time lurker, first time poster. I'm hoping to get some advice from the brilliant minds in this community. In the project I'm working in, the goal is to look at a user-provided string and determine if the content of that string contains any (one or many) matches to a list of match criteria. For example:
User-provided string: "I like thing a and thing b"
Match List:
Match Criteria
Match Type
Category
Foo
Exact (Case Insensitive)
Bar
Thing a
Contains (Case Insensitive)
Things
Thing b
Contains (Case Insensitive)
Stuff
In this case, it would return the following matches:
Thing a > Things
Thing b > Stuff
As of now, my approach is to iterate through the match criteria list and check each list item against the user-supplied string using the Match Type specified (Exact, Contains, Regular Expression), returning a list of the matches and then doing some stuff with that list. This approach works, even when matching ~100 rules and handling a 200-record batch, but it seems obvious that the performance will be pretty terrible if a large number of rules is introduced.
Is there a better way to do this that would be supported in Apex called by a trigger? I would love to learn a more sophisticated approach if there is one.
Thanks in advance!
What do you need it for. Is it a pure apex exercise or is it "close" to certain standard sObjects? There are lots of built-in features around "fuzzy matching".
In no specific order...
Have you looked into "all things Einstein", from categorising leads to predicting how likely this opportunity is to close. Might not be direction you expected to take but who knows
Obviously SOSL comes to mind, like what powers the global search. It automatically does some substitutions for you like Mike -> Michael
Matching rules, duplicate rules. You'd have a limit of say 5 active rules but you could hook to them up from Apex, including creative abuse of the system. "Dear salesforce, let's pretend I'm making such and such Opportunity, can you find me similar Opportunities?" (plot twist- you're not making an Oppty at all, you're creating account of some venture capitalist looking for investments that match his preferences). Give matching rules a go, if not for everything then at least for more creative fuzzy matching. You really don't want to implement soundex, levenshtein etc manually...
tags? There's somewhat forgotten feature from SF classic, it creates bunch of tables (AccountTag, ContactTag). This plus SOSL could be close to what you need.
Additionally if you need this for anything close to Knowledge Base:
Data Categories come to mind
KB supports synonyms, letting you define your (not very intuitive) "thing b => stuff" mapping.
and it should survive translations

Data.Text vs String

While the general opinion of the Haskell community seems to be that it's always better to use Text instead of String, the fact that still the APIs of most of maintained libraries are String-oriented confuses the hell out of me. On the other hand, there are notable projects, which consider String as a mistake altogether and provide a Prelude with all instances of String-oriented functions replaced with their Text-counterparts.
So are there any reasons for people to keep writing String-oriented APIs except backwards- and standard Prelude-compatibility and the "switch-making inertia"?
Are there possibly any other drawbacks to Text as compared to String?
Particularly, I'm interested in this because I'm designing a library and trying to decide which type to use to express error messages.
My unqualified guess is that most library writers don't want to add more dependencies than necessary. Since strings are part of literally every Haskell distribution (it's part of the language standard!), it is a lot easier to get adopted if you use strings and don't require your users to sort out Text distributions from hackage.
It's one of those "design mistakes" that you just have to live with unless you can convince most of the community to switch over night. Just look at how long it has taken to get Applicative to be a superclass of Monad – a relatively minor but much wanted change – and imagine how long it would take to replace all the String things with Text.
To answer your more specific question: I would go with String unless you get noticeable performance benefits by using Text. Error messages are usually rather small one-off things so it shouldn't be a big problem to use String.
On the other hand, if you are the kind of ideological purist that eschews pragmatism for idealism, go with Text.
* I put design mistakes in scare quotes because strings as a list-of-chars is a neat property that makes them easy to reason about and integrate with other existing list-operating functions.
If your API is targeted at processing large amounts of character oriented data and/or various encodings, then your API should use Text.
If your API is primarily for dealing with small one-off strings, then using the built-in String type should be fine.
Using String for large amounts of text will make applications using your API consume significantly more memory. Using it with foreign encodings could seriously complicate usage depending on how your API works.
String is quite expensive (at least 5N words where N is the number of Char in the String). A word is same number of bits as the processor architecture (ex. 32 bits or 64 bits):
http://blog.johantibell.com/2011/06/memory-footprints-of-some-common-data.html
There are at least three reasons to use [Char] in small projects.
[Char] does not rely on any arcane staff, like foreign pointers, raw memory, raw arrays, etc that may work differently on different platforms or even be unavailable altogether
[Char] is the lingua franka in haskell. There are at least three 'efficient' ways to handle unicode data in haskell: utf8-bytestring, Data.Text.Text and Data.Vector.Unboxed.Vector Char, each requiring dealing with extra package.
by using [Char] one gains access to all power of [] monad, including many specific functions (alternative string packages do try to help with it, but still)
Personally, I consider utf16-based Data.Text one of the most questionable desicions of the haskell community, since utf16 combines flaws of both utf8 and utf32 encoding while having none of their benefits.
I wonder if Data.Text is always more efficient than Data.String???
"cons" for instance is O(1) for Strings and O(n) for Text. Append is O(n) for Strings and O(n+m) for strict Text's. Likewise,
let foo = "foo" ++ bigchunk
bar = "bar" ++ bigchunk
is more space efficient for Strings than for strict Texts.
Other issue not related to efficiency is pattern matching (perspicuous code) and lazyness (predictably per-character in Strings, somehow implementation dependent in lazy Text).
Text's are obviously good for static character sequences and for in-place modification. For other forms of structural editing, Data.String might have advantages.
I do not think there is a single technical reason for String to remain.
And I can see several ones for it to go.
Overall I would first argue that in the Text/String case there is only one best solution :
String performances are bad, everyone agrees on that
Text is not difficult to use. All functions commonly used on String are available on Text, plus some useful more in the context of strings (substitution, padding, encoding)
having two solutions creates unnecessary complexity unless all base functions are made polymorphic. Proof : there are SO questions on the subject of automatic conversions. So this is a problem.
So one solution is less complex than two, and the shortcomings of String will make it disappear eventually. The sooner the better !

Search with attribute values correspondence in Lucene

Here's a text with ambiguous words:
"A man saw an elephant."
Each word has attributes: lemma, part of speech, and various grammatical attributes depending on its part of speech.
For "saw" it is like:
{lemma: see, pos: verb, tense: past}, {lemma: saw, pos: noun, number: singular}
All this attributes come from the 3rd party tools, Lucene itself is not involved in the word disambiguation.
I want to perform a query like "pos=verb & number=singular" and NOT to get "saw" in the result.
I thought of encoding distinct grammatical annotations into strings like "l:see;pos:verb;t:past|l:saw;pos:noun;n:sg" and searching for regexp "pos\:verb[^\|]+n\:sg", but I definitely can't afford regexp queries due to performance issues.
Maybe some hacks with posting list payloads can be applied?
UPD: A draft of my solution
Here are the specifics of my project: there is a fixed maximum of parses a word can have (say, 8).
So, I thought of inserting the parse number in each attribute's payload and use this payload at the posting lists intersectiion stage.
E.g., we have a posting list for 'pos = Verb' like ...|...|1.1234|...|..., and a posting list for 'number = Singular': ...|...|2.1234|...|...
While processing a query like 'pos = Verb AND number = singular' at all stages of posting list processing the 'x.1234' entries would be accepted until the intersection stage where they would be rejected because of non-corresponding parse numbers.
I think this is a pretty compact solution, but how hard would be incorporating it into Lucene?
So... the cheater way of doing this is (indeed) to control how you build the lucene index.
When constructing the lucene index, modify each word before Lucene indexes it so that it includes all the necessary attributes of the word. If you index things this way, you must do a lookup in the same way.
One way:
This means for each type of query you do, you must also build an index in the same way.
Example:
saw becomes noun-saw -- index it as that.
saw also becomes noun-past-see -- index it as that.
saw also becomes noun-past-singular-see -- index it as that.
The other way:
If you want attribute based lookup in a single index, you'd probably have to do something like permutation completion on the word 'saw' so that instead of noun-saw, you'd have all possible permutations of the attributes necessary in a big logic statement.
Not sure if this is a good answer, but that's all I could think of.

A reverse inference engine (find a random X for which foo(X) is true)

I am aware that languages like Prolog allow you to write things like the following:
mortal(X) :- man(X). % All men are mortal
man(socrates). % Socrates is a man
?- mortal(socrates). % Is Socrates mortal?
yes
What I want is something like this, but backwards. Suppose I have this:
mortal(X) :- man(X).
man(socrates).
man(plato).
man(aristotle).
I then ask it to give me a random X for which mortal(X) is true (thus it should give me one of 'socrates', 'plato', or 'aristotle' according to some random seed).
My questions are:
Does this sort of reverse inference have a name?
Are there any languages or libraries that support it?
EDIT
As somebody below pointed out, you can simply ask mortal(X) and it will return all X, from which you can simply pick a random one from the list. What if, however, that list would be very large, perhaps in the billions? Obviously in that case it wouldn't do to generate every possible result before picking one.
To see how this would be a practical problem, imagine a simple grammar that generated a random sentence of the form "adjective1 noun1 adverb transitive_verb adjective2 noun2". If the lists of adjectives, nouns, verbs, etc. are very large, you can see how the combinatorial explosion is a problem. If each list had 1000 words, you'd have 1000^6 possible sentences.
Instead of the deep-first search of Prolog, a randomized deep-first search strategy could be easyly implemented. All that is required is to randomize the program flow at choice points so that every time a disjunction is reached a random pole on the search tree (= prolog program) is selected instead of the first.
Though, note that this approach does not guarantees that all the solutions will be equally probable. To guarantee that, it is required to known in advance how many solutions will be generated by every pole to weight the randomization accordingly.
I've never used Prolog or anything similar, but judging by what Wikipedia says on the subject, asking
?- mortal(X).
should list everything for which mortal is true. After that, just pick one of the results.
So to answer your questions,
I'd go with "a query with a variable in it"
From what I can tell, Prolog itself should support it quite fine.
I dont think that you can calculate the nth solution directly but you can calculate the n first solutions (n randomly picked) and pick the last. Of course this would be problematic if n=10^(big_number)...
You could also do something like
mortal(ID,X) :- man(ID,X).
man(X):- random(1,4,ID), man(ID,X).
man(1,socrates).
man(2,plato).
man(3,aristotle).
but the problem is that if not every man was mortal, for example if only 1 out of 1000000 was mortal you would have to search a lot. It would be like searching for solutions for an equation by trying random numbers till you find one.
You could develop some sort of heuristic to find a solution close to the number but that may affect (negatively) the randomness.
I suspect that there is no way to do it more efficiently: you either have to calculate the set of solutions and pick one or pick one member of the superset of all solutions till you find one solution. But don't take my word for it xd

Resources