I want to create a tool that can analyze C and C++ code and detect unwanted behaviors, based on a config file. I thought about using ANTLR for this task, as I already created a simple compiler with it from scratch a few years ago (variables, condition, loops, and functions).
I grabbed C.g4 and CPP14.g4 from ANTLR grammars repository. However, I came to notice that they don't support the pre-processing parsing, as that's a different step in the compilation.
I tried to find a grammar that does the pre-processing part (updated to ANTLR4) with no luck. Moreover, I also understood that if I'll go with two-steps parsing I won't be able to retain the original locations of each character, as I'd already modified the input stream.
I wonder if there's a good ANTLR grammar or program (preferably Python, but can deal with other languages as well) that can help me to pre-process the C code. I also thought about using gcc -E, but then I won't be able to inspect the macro definitions (for example, I want to warn if a user used a #pragma GCC (some students at my university, for which I write this program to, used this to bypass some of the course coding style restrictions). Moreover, gcc -E will include library header contents, which I don't want to process.
My question is, therefore, if you can recommend me a grammar/program that I can use to pre-process C and C++ code. Alternatively, if you can guide me on how to create a grammar myself that'd be perfect. I was able to write the basic #define, #pragma etc. processings, but I'm unable to deal with conditions and with macro functions, as I'm unsure how to deal with them.
Thanks in advance!
This question is almost off-topic as it asks for an external resource. However, it also bears a part that deserves some attention.
The term "preprocessor" already indicates what the handling of macros etc. is about. The parser never sees the disabled parts of the input, which also means it can be anything, which might not be part of the actual language to parse. Hence a good approach for parsing C-like languages is to send the input through a preprocessor (which can be a specialized input stream) to strip out all preprocessing constructs, to resolve macros and remove disabled text. The parse position is not a problem, because you can push the current token position before you open a new input stream and restore that when you are done with it. Store reported errors together with your input stream stack. This way you keep the correct token positions. I have used exactly this approach in my Windows resource file parser.
Related
I understand VC++ will let you emit C++ source files which are the result of preprocessor operations e.g. macros are expanded and includes "copy-pasted in line".
Is it possible to restrict this simply to embed included files, which are files in my own project rather than standard libraries?
From the outside there's no way you can tell from which syntax form (<> or "") the content is being preprocessed. Unless a kind of API was exposed by the preprocessor, which is not the case here.
A not so elegant (and not strictly correct) solution I could propose would be to index a preprocessed version of all Standard headers (there are not that many) and after preprocessing the source of interest you could run a string matching script to detect the known files and remove the corresponding content from the final output.
Notice this is subjected to flaws because the #include system is purely textual and influenced by whatever macros are (un)defined at the time of inclusion and order matters. But depending on the complexity of the code you're working on this might give reasonable results.
By the way, may I ask what is the ultimate goal of your task?
Edit: Or actually... Maybe it's possible that you filter the sources before-hand to remove the undesired #includes and then submit it to preprocessing?
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
So is a decompiler really a thing that gives gives the source of a compiled/interpreted piece of code? Because to me that sounds impossible. How would you get the names of the functions, variables, classes, etc if it is compiled. Or am I misinterpreting the definition? How does it work? And what is the general principal behind making one?
You're right about your definition of a decompiler: it takes a compiled application and produces source code to match. However, it does not in most cases know the name and structure of variables/functions/classes--it just guesses. It analyzes the flow of the program and tries to find a way to represent that flow through a certain programming language, typically C. However, because the programming language of choice (C, in this example) is often at a higher level than the state of the underlying program (a binary executable), some parts of the program might be impossible to represent accurately; in this case, the decompiler would fail and you would need to use a disassembler. This is why many people like to obfuscate their code: it makes it much harder for decompilers to open it.
Building a decompiler is not a simple task. Basically, you have to take the application that you are decompiling (be it an executable or some other form of compiled application) and parse it into some kind of tree you can work with in memory. You would then analyze the flow of the program and try to find patters that might suggest that an if statement/variable/function/etc was used in a certain location in the code. It's all really just a guessing game: you'd have to know the patterns that the compiler makes in compiled code, then search for those patterns and replace them with equivalent human-readable source code.
This is all much simpler for higher-level programs like Java or .NET, where you don't have to deal with assembly instructions, and things like variables are mostly taken care of for you. There, you don't have to guess as much as just directly translate. You might not have exact variable/method names, but you can at least deduce the program structure fairly easily.
Disclaimer: I have never written a decompiler and thus don't know every detail of what I'm talking about. If you are really interested in writing a decompiler, you should get a book on the topic.
A decompiler basically takes the machine code and reverts it back to the language it was formatted in. If I'm not mistaken, I think the decompiler needs to know what language it was compiled in, otherwise it won't work.
The basic purpose of the decompiler is to get back to your source code; for example, one time my Java file got corrupted and the only thing I could so to bring it back was by using a decompiler (since the class file wasn't corrupted).
It works by deducing a "reasonable" (based on some heuristics) representation of what's in the object code. The degree of resemblance between what it produces and what was originally there tends to depend heavily upon how much information is contained in binary it starts from. If you start with basically a "pure" binary, it's generally stuck with just making up "reasonable" names for the variables, such as using things like i, j and k for loop indexes, and longer names for most others.
On the other hand, a language that supports introspection needs to embed a great deal more information about variable names, types, etc., into the executable. In a case like this, decompiling can produce something much closer to the original, such as typically retaining the original names for functions, variables, etc. In such a case, the decompiler can often produce something quite similar to the original -- possibly losing little more than formatting and comments.
That depends on what language you are decompiling. If you are decompiling something like C or C++, then the only information provided to you is function names and arguments (In DLLs). If you are dealing with java, then the compiler usually inserts line numbers, variable names, field and method names, and so on. If there are no variable names, then you would get names like localInt1, localInt2, localException1. Or whatever the compiler is. And it can tell the spacing between lines, because of the line numbers.
I need to document the software I'm currently working on. The software consists of several programming languages and scripts which got me thinking. If a new developers comes along and needs to fix something, they might know Java but maybe not bash scripting. It would be nice if there was a program which would help to understand what
for f in "$#" ; do
means. I was thinking of something that creates a static HTML page with the code plus syntax highlighting and if you hover over something (like the "for"), it would display a pop-up with an explanation:
for starts a loop which iterates over all values that follow in. In the loop, you can access each value via the variable $f. The loop body is between do and done
Does something like that already exist?
[EDIT] This is just an example. You'll get another help for f, in, "$#", ; and do, i.e. each and every element of the line should be explained. Unknown elements (like command names) should link to Google. So you can understand what it does even if you're missing some detail.
[EDIT2] I'm aware that you can't write a program which understands what another program does. What I'm looking for is a simple tool which will do "extended syntax highlighting" in the sense that it will color an expression and give a short explanation what it means (plus maybe a link to some in-depth reference).
This is meant for someone who knows how to program but maybe hasn't seen some obscure construct before. Say
echo "Error" 1>&2
Every bash programmer knows what this means but a Java developer might be puzzled by the 1>&2 despite the fact that they can guess that echo == System.out.println. A simple "Redirects stdout to stderr" will clear things up and give that instant "AHA!" which allows them to stay in their current train of thought.
A tool like this could be built using ANTLR, i.e. parse the code into an abstract syntax tree using an ANTLR grammar for that language, and write an HTML generator which produced the annotated code.
It sounds like a useful tool to have for language learning, or exploring source code of projects you're not maintaining -- but is it appropriate for documentation?
Why is it important to help the programmers of other languages understand the code at this level of implementation detail? Anyone maintaining the implementation at this level will obviously have to know the language and will probably have an IDE to do most of this.
That said, I'd definitely consider a tool like this as a learning aid.
IMO it would be simpler and more effective to just collect links to good language-specific references and tutorials on a Wiki page.
For all mainstream languages, such sources exist and are maintained regularly. If you try to create your own reference, you need to maintain it too. Fair enough, bash syntax is not going to change very often, but other languages do develop faster, so it is going to be a burden.
If you think about it, it's not that useful to have a tool that explains the syntax. Developers could just google for keywords instead of browsing a website in a similar fashion to http://www.codeweblog.com/source/ .
I believe that good comments will be by far more useful, plus there are tools to extract the documentation by using the comments (for example, HappyDoc does that for Python).
It is a very tricky thing. First of all by definition it can be proven that program that will "understand" any program down't exist. However, you can still use existing documentation. Maybe using tools like Doxygen can help you. You would need to document your code through comments and the documentation will be generated from them.
A language cannot be explained only through its syntax. The runtime environment plays a great part, together with the underlying philosophy of the language and libraies.
Moreover, syntax is not that complex for most common languages (given that code has been written with maintainability in mind).
Going on with bash example, you cannot deeply understand bash if you know nothing about processes & job control, environment variables, a big list of unix commands (tr, sort, cut, paste, sed, awk, find, ...) and many other features that don't appear in syntax.
If the tool produced
for starts a loop which iterates over
all values that follow in. In the
loop, you can access each value via
the variable $f. The loop body is
between do and done
it would be pretty worthless. This is exactly the kind of comment that trainee (human) programmers are told nver to write.
I know that its possible to read from a .txt file and then convert various parts of that into string, char, and int values, but is it possible to take a string and use it as real code in the program?
Code:
string codeblock1="cout<<This is a test;";
string codeblock2="int array[5]={0,6,6,3,5};}";
int i;
cin>>i;
if(i)
{
execute(codeblock1);
}
else
{
execute(codeblock2);
}
Where execute is a function that converts from text to actual code (I don't know if there actually is a function called execute, I'm using it for the purpose of my example).
In C++ there's no simple way to do this. This feature is available in higher-level languages like Python, Lisp, Ruby and Perl (usually with some variation of an eval function). However, even in these languages this practice is frowned upon, because it can result in very unreadable code.
It's important you ask yourself (and perhaps tell us) why you want to do it?
Or do you only want to know if it's possible? If so, it is, though in a hairy way. You can write a C++ source file (generate whatever you want into it, as long as it's valid C++), then compile it and link to your code. All of this can be done automatically, of course, as long as a compiler is available to you in runtime (and you just execute it with system). I know someone who did this for some heavy optimization once. It's not pretty, but can be made to work.
You can create a function and parse whatever strings you like and create a data structure from it. This is known as a parse tree. Subsequently you can examine your parse tree and generate the necessary dynamic structures to perform the logic therin. The parse tree is subsequently converted into a runtime representation that is executed.
All compilers do exactly this. They take your code and they produce machine code based on this. In your particular case you want a language to write code for itself. Normally this is done in the context of a code generator and it is part of a larger build process. If you write a program to parse your language (consider flex and bison for this operation) that generates code you can achieve the results you desire.
Many scripting languages offer this sort of feature, going all the way back to eval in LISP - but C and C++ don't expose the compiler at runtime.
There's nothing in the spec that stops you from creating and executing some arbitrary machine language, like so:
char code[] = { 0x2f, 0x3c, 0x17, 0x43 }; // some machine code of some sort
typedef void (FuncType*)(); // define a function pointer type
FuncType func = (FuncType)code; // take the address of the code
func(); // and jump to it!
but most environments will crash if you try this, for security reasons. (Many viruses work by convincing ordinary programs to do something like this.)
In a normal environment, one thing you could do is create a complete program as text, then invoke the compiler to compile it and invoke the resulting executable.
If you want to run code in your own memory space, you could invoke the compiler to build you a DLL (or .so, depending on your platform) and then link in the DLL and jump into it.
First, I wanted to say, that I never implemented something like that myself and I may be way off, however, did you try CodeDomProvider class in System.CodeDom.Compiler namespace? I have a feeling the classes in System.CodeDom can provide you with the functionality you are looking for.
Of course, it will all be .NET code, not any other platform
Go here for sample
Yes, you just have to build a compiler (and possibly a linker) and you're there.
Several languages such as Python can be embedded into C/C++ so that may be an option.
It's kind of sort of possible, but not with just straight C/C++. You'll need some layer underneath such as LLVM.
Check out c-repl and ccons
One way that you could do this is with Boost Python. You wouldn't be using C++ at that point, but it's a good way of allowing the user to use a scripting language to interact with the existing program. I know it's not exactly what you want, but perhaps it might help.
Sounds like you're trying to create "C++Script", which doesn't exist as far as I know. C++ is a compiled language. This means it always must be compiled to native bytecode before being executed. You could wrap the code as a function, run it through a compiler, then execute the resulting DLL dynamically, but you're not going to get access to anything a compiled DLL wouldn't normally get.
You'd be better off trying to do this in Java, JavaScript, VBScript, or .NET, which are at one stage or another interpreted languages. Most of these languages either have an eval or execute function for just that, or can just be included as text.
Of course executing blocks of code isn't the safest idea - it will leave you vulnerable to all kinds of data execution attacks.
My recommendation would be to create a scripting language that serves the purposes of your application. This would give the user a limited set of instructions for security reasons, and allow you to interact with the existing program much more dynamically than a compiled external block.
Not easily, because C++ is a compiled language. Several people have pointed round-about ways to make it work - either execute the compiler, or incorporate a compiler or interpreter into your program. If you want to go the interpreter route, you can save yourself a lot of work by using an existing open source project, such as Lua
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
What would be the best way to detect what programming language is used in a snippet of code?
I think that the method used in spam filters would work very well. You split the snippet into words. Then you compare the occurences of these words with known snippets, and compute the probability that this snippet is written in language X for every language you're interested in.
http://en.wikipedia.org/wiki/Bayesian_spam_filtering
If you have the basic mechanism then it's very easy to add new languages: just train the detector with a few snippets in the new language (you could feed it an open source project). This way it learns that "System" is likely to appear in C# snippets and "puts" in Ruby snippets.
I've actually used this method to add language detection to code snippets for forum software. It worked 100% of the time, except in ambiguous cases:
print "Hello"
Let me find the code.
I couldn't find the code so I made a new one. It's a bit simplistic but it works for my tests. Currently if you feed it much more Python code than Ruby code it's likely to say that this code:
def foo
puts "hi"
end
is Python code (although it really is Ruby). This is because Python has a def keyword too. So if it has seen 1000x def in Python and 100x def in Ruby then it may still say Python even though puts and end is Ruby-specific. You could fix this by keeping track of the words seen per language and dividing by that somewhere (or by feeding it equal amounts of code in each language).
class Classifier
def initialize
#data = {}
#totals = Hash.new(1)
end
def words(code)
code.split(/[^a-z]/).reject{|w| w.empty?}
end
def train(code,lang)
#totals[lang] += 1
#data[lang] ||= Hash.new(1)
words(code).each {|w| #data[lang][w] += 1 }
end
def classify(code)
ws = words(code)
#data.keys.max_by do |lang|
# We really want to multiply here but I use logs
# to avoid floating point underflow
# (adding logs is equivalent to multiplication)
Math.log(#totals[lang]) +
ws.map{|w| Math.log(#data[lang][w])}.reduce(:+)
end
end
end
# Example usage
c = Classifier.new
# Train from files
c.train(open("code.rb").read, :ruby)
c.train(open("code.py").read, :python)
c.train(open("code.cs").read, :csharp)
# Test it on another file
c.classify(open("code2.py").read) # => :python (hopefully)
Language detection solved by others:
Ohloh's approach: https://github.com/blackducksw/ohcount/
Github's approach: https://github.com/github/linguist
Guesslang is a possible solution:
http://guesslang.readthedocs.io/en/latest/index.html
There's also SourceClassifier:
https://github.com/chrislo/sourceclassifier/tree/master
I became interested in this problem after finding some code in a blog article which I couldn't identify. Adding this answer since this question was the first search hit for "identify programming language".
An alternative is to use highlight.js, which performs syntax highlighting but uses the success-rate of the highlighting process to identify the language. In principle, any syntax highlighter codebase could be used in the same way, but the nice thing about highlight.js is that language detection is considered a feature and is used for testing purposes.
UPDATE: I tried this and it didn't work that well. Compressed JavaScript completely confused it, i.e. the tokenizer is whitespace sensitive. Generally, just counting highlight hits does not seem very reliable. A stronger parser, or perhaps unmatched section counts, might work better.
First, I would try to find the specific keyworks of a language e.g.
"package, class, implements "=> JAVA
"<?php " => PHP
"include main fopen strcmp stdout "=>C
"cout"=> C++
etc...
It's very hard and sometimes impossible. Which language is this short snippet from?
int i = 5;
int k = 0;
for (int j = 100 ; j > i ; i++) {
j = j + 1000 / i;
k = k + i * j;
}
(Hint: It could be any one out of several.)
You can try to analyze various languages and try to decide using frequency analysis of keywords. If certain sets of keywords occur with certain frequencies in a text it's likely that the language is Java etc. But I don't think you will get anything that is completely fool proof, as you could name for example a variable in C the same name as a keyword in Java, and the frequency analysis will be fooled.
If you take it up a notch in complexity you could look for structures, if a certain keyword always comes after another one, that will get you more clues. But it will also be much harder to design and implement.
It would depend on what type of snippet you have, but I would run it through a series of tokenizers and see which language's BNF it came up as valid against.
I needed this so i created my own.
https://github.com/bertyhell/CodeClassifier
It's very easily extendable by adding a training file in the correct folder.
Written in c#. But i imagine the code is easily converted to any other language.
Best solution I have come across is using the linguist gem in a Ruby on Rails app. It's kind of a specific way to do it, but it works. This was mentioned above by #nisc but I will tell you my exact steps for using it. (Some of the following command line commands are specific to ubuntu but should be easily translated to other OS's)
If you have any rails app that you don't mind temporarily messing with, create a new file in it to insert your code snippet in question. (If you don't have rails installed there's a good guide here although for ubuntu I recommend this. Then run rails new <name-your-app-dir> and cd into that directory. Everything you need to run a rails app is already there).
After you have a rails app to use this with, add gem 'github-linguist' to your Gemfile (literally just called Gemfile in your app directory, no ext).
Then install ruby-dev (sudo apt-get install ruby-dev)
Then install cmake (sudo apt-get install cmake)
Now you can run gem install github-linguist (if you get an error that says icu required, do sudo apt-get install libicu-dev and try again)
(You may need to do a sudo apt-get update or sudo apt-get install make or sudo apt-get install build-essential if the above did not work)
Now everything is set up. You can now use this any time you want to check code snippets. In a text editor, open the file you've made to insert your code snippet (let's just say it's app/test.tpl but if know the extension of your snippet, use that instead of .tpl. If you don't know the extension, don't use one). Now paste your code snippet in this file. Go to command line and run bundle install (must be in your application's directory). Then run linguist app/test.tpl (more generally linguist <path-to-code-snippet-file>). It will tell you the type, mime type, and language. For multiple files (or for general use with a ruby/rails app) you can run bundle exec linguist --breakdown in your application's directory.
It seems like a lot of extra work, especially if you don't already have rails, but you don't actually need to know ANYTHING about rails if you follow these steps and I just really haven't found a better way to detect the language of a file/code snippet.
This site seems to be pretty good at identifying languages, if you want a quick way to paste a snippet into a web form, rather than doing it programmatically: http://dpaste.com/
Nice puzzle.
I think it is imposible to detect all languages. But you could trigger on key tokens. (certain reserved words and often used character combinations).
Ben there are a lot of languages with similar syntax. So it depends on the size of the snippet.
Prettify is a Javascript package that does an okay job of detecting programming languages:
http://code.google.com/p/google-code-prettify/
It is mainly a syntax highlighter, but there is probably a way to extract the detection part for the purposes of detecting the language from a snippet.
I wouldn't think there would be an easy way of accomplishing this. I would probably generate lists of symbols/common keywords unique to certain languages/classes of languages (e.g. curly brackets for C-style language, the Dim and Sub keywords for BASIC languages, the def keyword for Python, the let keyword for functional languages). You then might be able to use basic syntax features to narrow it down even further.
I think the biggest distinction between languages is its structure. So my idea would be to look at certain common elements across all languages and see how they differ. For example, you could use regexes to pick out things such as:
function definitions
variable declarations
class declarations
comments
for loops
while loops
print statements
And maybe a few other things that most languages should have. Then use a point system. Award at most 1 point for each element if the regex is found. Obviously, some languages will use the exact same syntax (for loops are often written like for(int i=0; i<x; ++i) so multiple languages could each score a point for the same thing, but at least you're reducing the likelihood of it being an entirely different language). Some of them might scores 0s across the board (the snippet doesnt contain a function at all, for example) but thats perfectly fine.
Combine this with Jules' solution, and it should work pretty well. Maybe also look for frequencies of keywords for an extra point.
Interesting. I have a similar task to recognize text in different formats. YAML, JSON, XML, or Java properties? Even with syntax errors, for example, I should tell apart JSON from XML with confidence.
I figure how we model the problem is critical. As Mark said, single-word tokenization is necessary but likely not enough. We will need bigrams, or even trigrams. But I think we can go further from there knowing that we are looking at programming languages. I notice that almost any programming language has two unique types of tokens -- symbols and keywords. Symbols are relatively easy (some symbols might be literals not part of the language) to recognize. Then bigrams or trigrams of symbols will pick up unique syntax structures around symbols. Keywords is another easy target if the training set is big and diverse enough. A useful feature could be bigrams around possible keywords. Another interesting type of token is whitespace. Actually if we tokenize in the usual way by white space, we will loose this information. I'd say, for analyzing programming languages, we keep the whitespace tokens as this may carry useful information about the syntax structure.
Finally if I choose a classifier like random forest, I will crawl github and gather all the public source code. Most of the source code file can be labeled by file suffix. For each file, I will randomly split it at empty lines into snippets of various sizes. I will then extract the features and train the classifier using the labeled snippets. After training is done, the classifier can be tested for precision and recall.
I believe that there is no single solution that could possibly identify what language a snippet is in, just based upon that single snippet. Take the keyword print. It could appear in any number of languages, each of which are for different purposes, and have different syntax.
I do have some advice. I'm currently writing a small piece of code for my website that can be used to identify programming languages. Like most of the other posts, there could be a huge range of programming languages that you simply haven't heard, you can't account for them all.
What I have done is that each language can be identified by a selection of keywords. For example, Python could be identified in a number of ways. It's probably easier if you pick 'traits' that are also certainly unique to the language. For Python, I choose the trait of using colons to start a set of statements, which I believe is a fairly unique trait (correct me if I'm wrong).
If, in my example, you can't find a colon to start a statement set, then move onto another possible trait, let's say using the def keyword to define a function. Now this can causes some problems, because Ruby also uses the keyword def to define a function. The key to telling the two (Python and Ruby) apart is to use various levels of filtering to get the best match. Ruby use the keyword end to finish a function, whereas Python doesn't have anything to finish a function, just a de-indent but you don't want to go there. But again, end could also be Lua, yet another programming language to add to the mix.
You can see that programming languages simply overlay too much. One keyword that could be a keyword in one language could happen to be a keyword in another language. Using a combination of keywords that often go together, like Java's public static void main(String[] args) helps to eliminate those problems.
Like I've already said, your best chance is looking for relatively unique keywords or sets of keywords to separate one from the other. And, if you get it wrong, at least you had a go.
Set up the random scrambler like
matrix S = matrix(GF(2),k,[random()<0.5for _ in range(k^2)]); while (rank(S) < k) : S[floor(k*random()),floor(k*random())] +=1;