Algorithm to detect if a file(or string) have been patched - string

This question is related to string algorithm, not version control tools or management tools.
I learnt the diff algorithm and tried to implement one. That is, given string A and string B, the diff calculate a sequence of actions that can convert A into B.
I wonder, if it possible, given a string S, and a sequence of actions that diff algorithm can produce, the algorithm will tell if the string S is (a) the origin string A, (b) the patched string B, (c) unrelated string. And what if S is only one of A and B.
Actuallly, what I'm really doing is researching a method that can tell if a patch have been applied (source code level or binary code level). I tried google some time, but didn't find something useful.

It's pretty complicated, but it can be done, on some level.
Essentially, you parse the source level into tokens, after that, you build the abstract syntax tree. Once that is done, you must build a diff tool that can do semantic differential analysis between abstract syntax trees. SemanticMerge for example, does that.
Once that is done, you have semantical difference between two source codes, and then you need to define what exactly consists of a patch.
Some of the rules can be:
1) Variable content was changed
2) A if check was added
The bottom line is, differenting between patch and new functionality is not an easy task. The most reliable way is to probably check the binary file version numbers, and understand the versioning schema.
Eg, only minor version is updated, if patches are applied.

Related

How to implement source map in a compiler?

I'm implementing a compiler compiling a source language to a target language (assembly like) in Haskell.
For debugging purpose, a source map is needed to map target language assembly instruction to its corresponding source position (line and column).
I've searched extensively compiler implementation, but none includes a source map.
Can anyone please point me in the right direction on how to generate a source map?
Code samples, books, etc. Haskell is preferred, other languages are also welcome.
Details depend on a compilation technique you're applying.
If you're doing it via a sequence of transforms of intermediate languages, as most sane compilers do these days, your options are following:
Annotate all intermediate representation (IR) nodes with source location information. Introduce special nodes for preserving variable names (they'll all go after you do, say, an SSA-transform, so you need to track their origins separately)
Inject tons of intrinsic function calls (see how it's done in LLVM IR) instead of annotating each node
Do a mixture of the above
The first option can even be done nearly automatically - if each transform preserves source location of an original node in all nodes it creates from it, you'd only have to manually change some non-trivial annotations.
Also you must keep in mind that some optimisations may render your source location information absolutely meaningless. E.g., a value numbering would collapse a number of similar expressions into one, probably preserving a source location information for one random origin. Same for rematerialisation.
With Haskell, the first approach will result in a lot of boilerplate in your ADT definitions and pattern matching, even if you sugar coat it with something like Scrap Your Boilerplate (SYB), so I'd recommend the second approach, which is extensively documented and nicely demonstrated in LLVM IR.

HashMap implementation - RPGLE

Is it feasible to implement a sort of hash map in RPGLE?
How would you begin thinking it?
Should I look at the Java source code and "copy" that style?
HashMap should ultimately be compatibile with every data type.
I'd start here:Implementing a HashMap
Should be able to use C code as a basis for an RPGLE version.
Or you could just build the procedures in C and call it from RPGLE.
Depending on your needs (if you don't need a specific order of your elements) you could also use a tree based map which already exists, http://rpgnextgen.com/index.php?content=libtree . It uses the red-black-tree implementation from the libtree project on github (which is wonderfully compatible C code. congrats to the developer).
The project on RPG Next Gen provides wrappers for character and integer keys. You can store any value in it as you pass a pointer and a length for it.
And yes, there is a need for data structures like lists and maps and trees. I use them often for passing data between procedures where I don't know how many elements may be returned. And in most programming languages lists and maps and trees are part of the language or at least part of the runtime library. Sadly not so in RPG.
In the end I did my own implementation.
You can find it here:
GitHub - HASHMAP.RPGLE
It is based on the JDK implementation, but the hash code is calculated from a SHA1 hash, and a module operation is used instead of bit shifting.

Convert string version information to integer for easy comparison

I am in a process of designing a program that runs end-user scripts written in Lua. The program defines an interface that tells how many parameters are being passed to which function, their order and so on. At some point in the future, the interface may change so I would like to pass a version information of the host program to a script. My common practice is to use MAJOR.MINOR.PATCH format.
The goal is to keep the representation simple as much as possible, so that the information can be processed using arithmetic operations, like so:
if program.version < SCRIPT_SUPPORTS_VERSION then
error ("This version is not supported, please upgrade...")
end
Common sense dictates to use integers written in base of sixteen, where each byte represents a part in the version format. Like so:
-- Version 1.3.2
-- Compatible with versions >=1.2.0
--
local program = {}
program.version = 0x010302
local SCRIPT_SUPPORTS_VERSION = 0x010200
As I mentioned before, this is the first thing that came to my head when thinking about the problem and it seems to work. On other hand, hexadecimal numbers may seem scary to some.
I would like to know your opinion, please propose alternative solutions.

Haskell: should I use Data.Text.Lazy.Builder to construct my Text values?

I'm working on a large application that constructs a lot of Data.Text values on the fly. I've been building all my Text values using (<>) and Data.Text.concat.
I only recently learned of the existence of the Builder type. The Beginning Haskell book has this to say about it:
Every time two elements are concatenated, a new Text value has to be created, and this comes with some overhead to allocate memory, to copy data, and also to keep track of the value and release it when it's no longer needed... Both the text and bytestring packages provide a Builder data type that can be used to efficiently generate large text values. [pg 240]
However, the book doesn't give any indication of exactly what is meant by "large text values."
So, I'm wondering whether or not I should refactor my code to use Builder. Maybe you can help me make that decision.
Specifically, I have these questions:
1) Are there any guidelines or "best practices" regarding when one should choose Builder over concatenation? Or, how do I know that a given Text value is "large" enough that it merits using Builder?
2) Is using Builder a "no brainer," or would it be worthwhile doing some profiling to confirm its benefits before undertaking a large-scale refactoring?
Thanks!
Data.Text.concat is an O(n+m) operation where n and m are the lengths of the strings you want to concat. This is because a new memory buffer of size n + m must be allocated to store the result of the concatenation.
Builder is specifically optimized for the mappend operation. It's a cheap O(1) operation (function composition, which is also excellently optimized by GHC). With Builder you are essentially building up the instructions for how to produce the final string result, but delaying the actual creation until you do some Builder -> Text transformation.
To answer your questions, you should choose Builder if you have profiled your application and discovered that Text.concat are dominating the run time. This will obviously depend on your needs and application. There is no general rule for when you should use Builder but for short Text literals there is probably no need.
Profiling would definitely be worthwhile if using Builder would involve "undertaking a large-scale refactoring". Although it goes without saying that Haskell will naturally make this kind of refactoring much less painful than you might be used to with less developer friendly languages, so it might not be such a difficult undertaking after all.

Identifying frequent formulas in a codebase

My company maintains a domain-specific language that syntactically resembles the Excel formula language. We're considering adding new builtins to the language. One way to do this is to identify verbose commands that are repeatedly used in our codebase. For example, if we see people always write the same 100-character command to trim whitespace from the beginning and end of a string, that suggests we should add a trim function.
Seeing a list of frequent substrings in the codebase would be a good start (though sometimes the frequently used commands differ by a few characters because of different variable names used).
I know there are well-established algorithms for doing this, but first I want to see if I can avoid reinventing the wheel. For example, I know this concept is the basis of many compression algorithms, so is there a compression module that lets me retrieve the dictionary of frequent substrings? Any other ideas would be appreciated.
The string matching is just the low hanging fruit, the obvious cases. The harder cases are where you're doing similar things but in different order. For example suppose you have:
X+Y
Y+X
Your string matching approach won't realize that those are effectively the same. If you want to go a bit deeper I think you need to parse the formulas into an AST and actually compare the AST's. If you did that you could see that the tree's are actually the same since the binary operator '+' is commutative.
You could also apply reduction rules so you could evaluate complex functions into simpler ones, for example:
(X * A) + ( X * B)
X * ( A + B )
Those are also the same! String matching won't help you there.
Parse into AST
Reduce and Optimize the functions
Compare the resulting AST to other ASTs
If you find a match then replace them with a call to a shared function.
I would think you could use an existing full-text indexer like Lucene, and implement your own Analyzer and Tokenizer that is specific to your formula language.
You then would be able to run queries, and be able to see the most used formulas, which ones appear next to each other, etc.
Here's a quick article to get you started:
Lucene Analyzer, Tokenizer and TokenFilter
You might want to look into tag-cloud generators. I couldn't find any source in the minute that I spent looking, but here's an online one:
http://tagcloud.oclc.org/tagcloud/TagCloudDemo which probably won't work since it uses spaces as delimiters.

Resources