How do Iget information about branches in Java using ASM? - java-bytecode-asm

I want to collect some information about branches in ASM such as:
Start line
End line
Branch type (part of "if", a boolean expression in a larger expressions an and/or, while, return, etc).
I'll appreciate any hints on ow to access the problem what's the best approach to write reliable tests.
Thanks.

Related

By means of what language design principle is a string instantiated either as a string or as a variable in GNU Octave?

Having an Octave script (in the sense of dynamic languages here) move.m defining function move(direction), it can be invoked from another script (alternatively from the command line) in different ways: move left, move('left') or move(left). While the first two will instantiate direction with the string 'left', the last one will consider left as a variable.
The question is about the formal principle in language definition behind this. I understand that in the first mode, the script is invoked as a command, considering that the rest of the command line is just data, not variables (pretty much as in a Linux prompt); while in the last two it is called as a function, interpreting what follows (between parenthesis) as either data or variables. If this is a general design criteria among scripting languages, what is the principle behind it?
To answer your question, yes, this is by design, and it's syntactic sugar offered by matlab (and hence octave) for running certain functions that expect only string arguments. Here is the relevant section in the matlab manual: https://uk.mathworks.com/help/matlab/matlab_prog/command-vs-function-syntax.html
I should clarify some misconceptions though. First, it's not "data" vs "variables". Any argument supplied in command syntax is simply interpreted as a string. So these two are equivalent:
fprintf("1")
fprintf 1
I.e., in fprintf 1, the 1 is not numeric data. It's a string.
Secondly, not all m files are "scripts". You calling your m file a script caused me some confusion. Your particular file contains a function definition and nothing else, so it's a function, 100%.
The reason this is important here, is that all functions can be called either via functional syntax or command syntax (as long as it makes sense in terms of the expected arguments being strings), whereas scripts take no arguments, so there is no functional / command syntax at play, and if you were passing 'arguments' to a script you're doing something wrong.
I understand that in the first mode, the script is invoked as a command [...]
As far as Octave goes, you are better off forgetting about that distinction. I'm not sure if a "command" ever existed but it certainly does not exist now. The command syntax is just syntactic sugar in Octave. Makes it simpler for interactive plot adjustment since it's functions arguments mainly take strings.

Groovy function call omiting the parentheses

According to the gradle documentation/section 13.5.2 we can omit parentheses in a method call:
Parentheses are optional for method calls.
But it seems it doesn't work when we try to apply the java plugin. If a script contains the following line:
apply [plugin: 'java']
We'll get the error:
Maybe something should be set in parentheses or a comma is missing?
# line 1, column 8.
apply [plugin: 'java']
^
But if we put this Map-literal into a parentheses it'll work fine.
apply([plugin: 'java'])
So we can't omit the parentheses when the argument is a Map, can we?
As the specification says, parentheses can be omitted when there is no ambiguity. I suspect the ambiguity in this case arises because the statement without parentheses looks a lot like array index syntax and the parser has trouble working out whether you are calling a method named 'apply' or trying to do something with an array named 'apply'.
Personally, this is why I tend to always use parentheses - if the parser can't work it out I'm sure another programmer reading the code won't either.
While it's true that the usual syntax of an array or map in Groovy uses brackets (for example, for empty ones you typically write [] or [:], respectively), the same bracket symbols are interpreted as the index operator if it follows an identifier. Then Groovy tries to interpret apply as a property of the Gradle project, even if it does not exist. Groovy, as a dynamic language, allows us to define properties dynamically, so there is not always a way to tell at compile time if a property is going to exist. While it is questionable if that is a good design or not, Gradle's ExtraPropertiesExtension very much makes use of this dynamic nature in Groovy for convenience. If you prefer more strict typing, I suggest you try the Kotlin DSL, which has much less of this kind of problems (I do not think it is completely gone, as we can still explicitly declare variables as dynamic type).
On the other hand, one purpose of a DSL is to be concise and remove useless ceremony. That is one thing Groovy excels at, because if the only parameter you need to pass to a method or closure is just an array or a map, you can just omit all kinds of brackets: apply plugin: 'java'. (Anything more complex than that is questionable for a DSL.)
And that is why I think that adding parentheses all the time (e.g., apply([plugin: 'java'])) to everywhere is not the right approach to build scripts whose code is supposed to look declarative and to support that, very DSL like, at least not if you use Groovy, which was designed exactly for that purpose. Never using a language feature, even where it might have an advantage, is what lead to books like Douglas Crockford's JavaScript: The Good Parts, where the author recommends we always use semicolons, and that is one of the most argued practices on the Internet, e.g. disagreed by the JavaScript Standard Style. I personally think that it is more important to know the language you work with and e.g., know when a semicolon is needed, in general to produce the best quality code. The semicolon example may admittedly not be such a strong example for other than reducing ceremony. But just because a language feature can be misused, it does not mean we should ban its use. A kitchen knife is not bad just because we can hurt people with it. And the same can be told about Kotlin's most judged features.
Although nowadays we got the plugins DSL, which is the preferred way of loading plugins (even if plugins are put in buildSrc, which will just need some metadata definition), this question may still be relevant for the following reasons:
programmatically applying plugins, e.g. subprojects { apply plugin: 'java' },
including custom build scripts using a similar syntax: apply from: 'myscript.gradle',
and maybe also if for some reason you can only refer to a plugin by its class name.
Therefore the question is still valid.

Algorithm to detect if a file(or string) have been patched

This question is related to string algorithm, not version control tools or management tools.
I learnt the diff algorithm and tried to implement one. That is, given string A and string B, the diff calculate a sequence of actions that can convert A into B.
I wonder, if it possible, given a string S, and a sequence of actions that diff algorithm can produce, the algorithm will tell if the string S is (a) the origin string A, (b) the patched string B, (c) unrelated string. And what if S is only one of A and B.
Actuallly, what I'm really doing is researching a method that can tell if a patch have been applied (source code level or binary code level). I tried google some time, but didn't find something useful.
It's pretty complicated, but it can be done, on some level.
Essentially, you parse the source level into tokens, after that, you build the abstract syntax tree. Once that is done, you must build a diff tool that can do semantic differential analysis between abstract syntax trees. SemanticMerge for example, does that.
Once that is done, you have semantical difference between two source codes, and then you need to define what exactly consists of a patch.
Some of the rules can be:
1) Variable content was changed
2) A if check was added
The bottom line is, differenting between patch and new functionality is not an easy task. The most reliable way is to probably check the binary file version numbers, and understand the versioning schema.
Eg, only minor version is updated, if patches are applied.

What programming languages will let me manipulate the sequence of instructions in a method?

I have an upcoming project in which a core requirement will be to mutate the way a method works at runtime. Note that I'm not talking about a higher level OO concept like "shadow one method with another", although the practical effect would be similar.
The key properties I'm after are:
I must be able to modify the method in such a way that I can add new expressions, remove existing expressions, or modify any of the expressions that take place in it.
After modifying the method, subsequent calls to that method would invoke the new sequence of operations. (Or, if the language binds methods rather than evaluating every single time, provide me a way to unbind/rebind the new method.)
Ideally, I would like to manipulate the atomic units of the language (e.g., "invoke method foo on object bar") and not the assembly directly (e.g. "pop these three parameters onto the stack"). In other words, I'd like to be able to have high confidence that the operations I construct are semantically meaningful in the language. But I'll take what I can get.
If you're not sure if a candidate language meets these criteria, here's a simple litmus test:
Can you write another method called clean which:
accepts a method m as input
returns another method m2 that performs the same operations as m
such that m2 is identical to m, but doesn't contain any calls to the print-to-standard-out method in your language (puts, System.Console.WriteLn, println, etc.)?
I'd like to do some preliminary research now and figure out what the strongest candidates are. Having a large, active community is as important to me as the practicality of implementing what I want to do. I am aware that there may be some unforged territory here, since manipulating bytecode directly is not typically an operation that needs to be exposed.
What are the choices available to me? If possible, can you provide a toy example in one or more of the languages that you recommend, or point me to a recent example?
Update: The reason I'm after this is that I'd like to write a program which is capable of modifying itself at runtime in response to new information. This modification goes beyond mere parameters or configurable data, but full-fledged, evolved changes in behavior. (No, I'm not writing a virus. ;) )
Well, you could always use .NET and the Expression libraries to build up expressions. That I think is really your best bet as you can build up representations of commands in memory and there is good library support for manipulating, traversing, etc.
Well, those languages with really strong macro support (in particular Lisps) could qualify.
But are you sure you actually need to go this deeply? I don't know what you're trying to do, but I suppose you could emulate it without actually getting too deeply into metaprogramming. Say, instead of using a method and manipulating it, use a collection of functions (with some way of sharing state, e.g. an object holding state passed to each).
I would say Groovy can do this.
For example
class Foo {
void bar() {
println "foobar"
}
}
Foo.metaClass.bar = {->
prinltn "barfoo"
}
Or a specific instance of foo without effecting other instances
fooInstance.metaClass.bar = {->
println "instance barfoo"
}
Using this approach I can modify, remove or add expression from the method and Subsequent calls will use the new method. You can do quite a lot with the Groovy metaClass.
In java, many professional framework do so using the open source ASM framework.
Here is a list of all famous java apps and libs including ASM.
A few years ago BCEL was also very much used.
There are languages/environments that allows a real runtime modification - for example, Common Lisp, Smalltalk, Forth. Use one of them if you really know what you're doing. Otherwise you can simply employ an interpreter pattern for an evolving part of your code, it is possible (and trivial) with any OO or functional language.

Identifying frequent formulas in a codebase

My company maintains a domain-specific language that syntactically resembles the Excel formula language. We're considering adding new builtins to the language. One way to do this is to identify verbose commands that are repeatedly used in our codebase. For example, if we see people always write the same 100-character command to trim whitespace from the beginning and end of a string, that suggests we should add a trim function.
Seeing a list of frequent substrings in the codebase would be a good start (though sometimes the frequently used commands differ by a few characters because of different variable names used).
I know there are well-established algorithms for doing this, but first I want to see if I can avoid reinventing the wheel. For example, I know this concept is the basis of many compression algorithms, so is there a compression module that lets me retrieve the dictionary of frequent substrings? Any other ideas would be appreciated.
The string matching is just the low hanging fruit, the obvious cases. The harder cases are where you're doing similar things but in different order. For example suppose you have:
X+Y
Y+X
Your string matching approach won't realize that those are effectively the same. If you want to go a bit deeper I think you need to parse the formulas into an AST and actually compare the AST's. If you did that you could see that the tree's are actually the same since the binary operator '+' is commutative.
You could also apply reduction rules so you could evaluate complex functions into simpler ones, for example:
(X * A) + ( X * B)
X * ( A + B )
Those are also the same! String matching won't help you there.
Parse into AST
Reduce and Optimize the functions
Compare the resulting AST to other ASTs
If you find a match then replace them with a call to a shared function.
I would think you could use an existing full-text indexer like Lucene, and implement your own Analyzer and Tokenizer that is specific to your formula language.
You then would be able to run queries, and be able to see the most used formulas, which ones appear next to each other, etc.
Here's a quick article to get you started:
Lucene Analyzer, Tokenizer and TokenFilter
You might want to look into tag-cloud generators. I couldn't find any source in the minute that I spent looking, but here's an online one:
http://tagcloud.oclc.org/tagcloud/TagCloudDemo which probably won't work since it uses spaces as delimiters.

Resources