Translating code comments written in another spoken language

Translating code comments written in another spoken language - translate

I've just inherited some C code from a German programmer, and all of the comments are, naturally, in German. As I've forgotten most of my high school German, this is a slight problem.
Does anyone know of any translation tools that are code-aware; meaning it will only translate language within comments? The project has many files, being able to operate on all of them at once would also be fantastic.
I'm currently copying-and-pasting into Google Translate, and while this is less than ideal, it can at least get me some answers.

I would only know exactly how to do this in java, but I am sure there is a way to do this in C as well, as the tools exist:
Grab a parser that understands C source files (this one sounds ok, but I don't know much about C)
build a syntax tree. iterate over all nodes of the tree, replacing the text of all comment nodes with translated text.
write the tree back to a new source file (perhaps in a different directory).

Very broadly, this should be possible to do using Google translation's Ajax API and a regex function that can deal with callbacks - I don't think JS's built-in regex functions are up to the task but I'm sure there are libraries out there. You would have to build a regular expression that can isolate the comments, send each chunk to the API, and return the translated result in the callback function.

Related

How does one extract URLs from a string? (Any language)

I realise this question has been asked many times across Stack Overflow and across the web, in fact, I have about 20 tabs open just now with apparent solutions to this problem.
The thing is every single answer says something along the lines of
You could use Regex, but it's not a good idea and doesn't reliably work, but I won't offer any alternatives.
So my question is this - Is there really no reliable, definitive way we can extract URLs from text?

Regular Expressions are extremely powerful tools. Like most powerful tools, they are seriously misunderstood, dangerous in the hands of many of their users, and the best answer to certain tasks. Matching known patterns in strings is what they exist for. Once you have a good URL pattern in hand it will work all the time in the context it was designed for. The reason everyone shies away from using them is that creating a good URL pattern for a specific context is difficult work. The pattern will vary by the execution environment (e.g., operating system for file: URLs), by the programming language and/or library in use, etc.
For the specific case of HTTP URLs, there is a clear definition that is mostly adhered to, and you can build a reliable regular expression from it with almost any language or library.

If you want to extract URL from any string there is no other choice than using Regex.
In fact URI Scheme is defined (see http://en.wikipedia.org/wiki/URI_scheme) and if you go through all it's aspects, regex is very reliable.

Is there really no reliable, definitive way we can extract URLs from text?
Well, anything which is in string formatted list needs a careful exception handling. That is said, once you have that handling in place it should be working fine.
The regexp with a uri schema may do the trick may look something similar to:
<a href="(?<url>http://.*?)".*>(?<text>.+?)<\/a>
That's a .NET regexp though, so you may need to modify it to work on your platform language.

Are there any context-sensitive code search tools?

I have been getting very frustrated recently in dealing with a massive bulk of legacy code which I am trying to get familiar with.
Say I try to search for a particular function call, I get loads of results that turn out to be completely irrelevant; some of them are easy to spot, eg a comment saying
// Fixed functionality in foo() so don't need to handle this here any more
But others are much harder to spot manually, because they turn out to be calls from other functions in modules that are only compiled in certain cases, or are part of a much larger block of code that is #if 0'd out in its entirety.
What I'd like would be a search tool that would allow me to search for a term and give me the choice to include or exclude commented out or #if 0'd out code. Then the search results would be displayed alongside a list of #defines that are required in order for that snippet of code to be relevant.
I'm working in C / C++, but other than the specific comment syntax I guess the techniques should be more generally applicable.
Does such a tool exist?

Not entirely what you're after, but I find this quite handy.
GrepWin - A free visual "grep" tool for searching files.
I find it quite helpful because:
Its a separate app (doesn't lock up my editor)
Handles Regular expressions
Its fast
Can specify what folder to search, and what filetypes (handles regex's here too)
Can limit by file size
Can include subdirs (or exclude by regex)
etc.

Almost any decent source browser will let you go to where a function is defined, and/or list all the calls of that function and take you directly to a call site. This will normally be based on a fairly complete parse of the source code so it will ignore comments, code that's excluded by the preprocessor, and so on (in fact, in at least one case, the parser used by the source browser is almost certainly better than the one used in the compiler itself).

What is a de-compiler how does it work? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
So is a decompiler really a thing that gives gives the source of a compiled/interpreted piece of code? Because to me that sounds impossible. How would you get the names of the functions, variables, classes, etc if it is compiled. Or am I misinterpreting the definition? How does it work? And what is the general principal behind making one?

You're right about your definition of a decompiler: it takes a compiled application and produces source code to match. However, it does not in most cases know the name and structure of variables/functions/classes--it just guesses. It analyzes the flow of the program and tries to find a way to represent that flow through a certain programming language, typically C. However, because the programming language of choice (C, in this example) is often at a higher level than the state of the underlying program (a binary executable), some parts of the program might be impossible to represent accurately; in this case, the decompiler would fail and you would need to use a disassembler. This is why many people like to obfuscate their code: it makes it much harder for decompilers to open it.
Building a decompiler is not a simple task. Basically, you have to take the application that you are decompiling (be it an executable or some other form of compiled application) and parse it into some kind of tree you can work with in memory. You would then analyze the flow of the program and try to find patters that might suggest that an if statement/variable/function/etc was used in a certain location in the code. It's all really just a guessing game: you'd have to know the patterns that the compiler makes in compiled code, then search for those patterns and replace them with equivalent human-readable source code.
This is all much simpler for higher-level programs like Java or .NET, where you don't have to deal with assembly instructions, and things like variables are mostly taken care of for you. There, you don't have to guess as much as just directly translate. You might not have exact variable/method names, but you can at least deduce the program structure fairly easily.
Disclaimer: I have never written a decompiler and thus don't know every detail of what I'm talking about. If you are really interested in writing a decompiler, you should get a book on the topic.

A decompiler basically takes the machine code and reverts it back to the language it was formatted in. If I'm not mistaken, I think the decompiler needs to know what language it was compiled in, otherwise it won't work.
The basic purpose of the decompiler is to get back to your source code; for example, one time my Java file got corrupted and the only thing I could so to bring it back was by using a decompiler (since the class file wasn't corrupted).

It works by deducing a "reasonable" (based on some heuristics) representation of what's in the object code. The degree of resemblance between what it produces and what was originally there tends to depend heavily upon how much information is contained in binary it starts from. If you start with basically a "pure" binary, it's generally stuck with just making up "reasonable" names for the variables, such as using things like i, j and k for loop indexes, and longer names for most others.
On the other hand, a language that supports introspection needs to embed a great deal more information about variable names, types, etc., into the executable. In a case like this, decompiling can produce something much closer to the original, such as typically retaining the original names for functions, variables, etc. In such a case, the decompiler can often produce something quite similar to the original -- possibly losing little more than formatting and comments.

That depends on what language you are decompiling. If you are decompiling something like C or C++, then the only information provided to you is function names and arguments (In DLLs). If you are dealing with java, then the compiler usually inserts line numbers, variable names, field and method names, and so on. If there are no variable names, then you would get names like localInt1, localInt2, localException1. Or whatever the compiler is. And it can tell the spacing between lines, because of the line numbers.

Is there a program which can help understand another program?

I need to document the software I'm currently working on. The software consists of several programming languages and scripts which got me thinking. If a new developers comes along and needs to fix something, they might know Java but maybe not bash scripting. It would be nice if there was a program which would help to understand what
for f in "$#" ; do
means. I was thinking of something that creates a static HTML page with the code plus syntax highlighting and if you hover over something (like the "for"), it would display a pop-up with an explanation:
for starts a loop which iterates over all values that follow in. In the loop, you can access each value via the variable $f. The loop body is between do and done
Does something like that already exist?
[EDIT] This is just an example. You'll get another help for f, in, "$#", ; and do, i.e. each and every element of the line should be explained. Unknown elements (like command names) should link to Google. So you can understand what it does even if you're missing some detail.
[EDIT2] I'm aware that you can't write a program which understands what another program does. What I'm looking for is a simple tool which will do "extended syntax highlighting" in the sense that it will color an expression and give a short explanation what it means (plus maybe a link to some in-depth reference).
This is meant for someone who knows how to program but maybe hasn't seen some obscure construct before. Say
echo "Error" 1>&2
Every bash programmer knows what this means but a Java developer might be puzzled by the 1>&2 despite the fact that they can guess that echo == System.out.println. A simple "Redirects stdout to stderr" will clear things up and give that instant "AHA!" which allows them to stay in their current train of thought.

A tool like this could be built using ANTLR, i.e. parse the code into an abstract syntax tree using an ANTLR grammar for that language, and write an HTML generator which produced the annotated code.
It sounds like a useful tool to have for language learning, or exploring source code of projects you're not maintaining -- but is it appropriate for documentation?
Why is it important to help the programmers of other languages understand the code at this level of implementation detail? Anyone maintaining the implementation at this level will obviously have to know the language and will probably have an IDE to do most of this.
That said, I'd definitely consider a tool like this as a learning aid.

IMO it would be simpler and more effective to just collect links to good language-specific references and tutorials on a Wiki page.
For all mainstream languages, such sources exist and are maintained regularly. If you try to create your own reference, you need to maintain it too. Fair enough, bash syntax is not going to change very often, but other languages do develop faster, so it is going to be a burden.

If you think about it, it's not that useful to have a tool that explains the syntax. Developers could just google for keywords instead of browsing a website in a similar fashion to http://www.codeweblog.com/source/ .
I believe that good comments will be by far more useful, plus there are tools to extract the documentation by using the comments (for example, HappyDoc does that for Python).

It is a very tricky thing. First of all by definition it can be proven that program that will "understand" any program down't exist. However, you can still use existing documentation. Maybe using tools like Doxygen can help you. You would need to document your code through comments and the documentation will be generated from them.

A language cannot be explained only through its syntax. The runtime environment plays a great part, together with the underlying philosophy of the language and libraies.
Moreover, syntax is not that complex for most common languages (given that code has been written with maintainability in mind).
Going on with bash example, you cannot deeply understand bash if you know nothing about processes & job control, environment variables, a big list of unix commands (tr, sort, cut, paste, sed, awk, find, ...) and many other features that don't appear in syntax.

If the tool produced
for starts a loop which iterates over
all values that follow in. In the
loop, you can access each value via
the variable $f. The loop body is
between do and done
it would be pretty worthless. This is exactly the kind of comment that trainee (human) programmers are told nver to write.

Detecting programming language from a snippet [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
What would be the best way to detect what programming language is used in a snippet of code?

I think that the method used in spam filters would work very well. You split the snippet into words. Then you compare the occurences of these words with known snippets, and compute the probability that this snippet is written in language X for every language you're interested in.
http://en.wikipedia.org/wiki/Bayesian_spam_filtering
If you have the basic mechanism then it's very easy to add new languages: just train the detector with a few snippets in the new language (you could feed it an open source project). This way it learns that "System" is likely to appear in C# snippets and "puts" in Ruby snippets.
I've actually used this method to add language detection to code snippets for forum software. It worked 100% of the time, except in ambiguous cases:
print "Hello"
Let me find the code.
I couldn't find the code so I made a new one. It's a bit simplistic but it works for my tests. Currently if you feed it much more Python code than Ruby code it's likely to say that this code:
def foo
puts "hi"
end
is Python code (although it really is Ruby). This is because Python has a def keyword too. So if it has seen 1000x def in Python and 100x def in Ruby then it may still say Python even though puts and end is Ruby-specific. You could fix this by keeping track of the words seen per language and dividing by that somewhere (or by feeding it equal amounts of code in each language).
class Classifier
def initialize
#data = {}
#totals = Hash.new(1)
end
def words(code)
code.split(/[^a-z]/).reject{|w| w.empty?}
end
def train(code,lang)
#totals[lang] += 1
#data[lang] ||= Hash.new(1)
words(code).each {|w| #data[lang][w] += 1 }
end
def classify(code)
ws = words(code)
#data.keys.max_by do |lang|
# We really want to multiply here but I use logs
# to avoid floating point underflow
# (adding logs is equivalent to multiplication)
Math.log(#totals[lang]) +
ws.map{|w| Math.log(#data[lang][w])}.reduce(:+)
end
end
end
# Example usage
c = Classifier.new
# Train from files
c.train(open("code.rb").read, :ruby)
c.train(open("code.py").read, :python)
c.train(open("code.cs").read, :csharp)
# Test it on another file
c.classify(open("code2.py").read) # => :python (hopefully)

Language detection solved by others:
Ohloh's approach: https://github.com/blackducksw/ohcount/
Github's approach: https://github.com/github/linguist

Guesslang is a possible solution:
http://guesslang.readthedocs.io/en/latest/index.html
There's also SourceClassifier:
https://github.com/chrislo/sourceclassifier/tree/master
I became interested in this problem after finding some code in a blog article which I couldn't identify. Adding this answer since this question was the first search hit for "identify programming language".

An alternative is to use highlight.js, which performs syntax highlighting but uses the success-rate of the highlighting process to identify the language. In principle, any syntax highlighter codebase could be used in the same way, but the nice thing about highlight.js is that language detection is considered a feature and is used for testing purposes.
UPDATE: I tried this and it didn't work that well. Compressed JavaScript completely confused it, i.e. the tokenizer is whitespace sensitive. Generally, just counting highlight hits does not seem very reliable. A stronger parser, or perhaps unmatched section counts, might work better.

First, I would try to find the specific keyworks of a language e.g.
"package, class, implements "=> JAVA
"<?php " => PHP
"include main fopen strcmp stdout "=>C
"cout"=> C++
etc...

It's very hard and sometimes impossible. Which language is this short snippet from?
int i = 5;
int k = 0;
for (int j = 100 ; j > i ; i++) {
j = j + 1000 / i;
k = k + i * j;
}
(Hint: It could be any one out of several.)
You can try to analyze various languages and try to decide using frequency analysis of keywords. If certain sets of keywords occur with certain frequencies in a text it's likely that the language is Java etc. But I don't think you will get anything that is completely fool proof, as you could name for example a variable in C the same name as a keyword in Java, and the frequency analysis will be fooled.
If you take it up a notch in complexity you could look for structures, if a certain keyword always comes after another one, that will get you more clues. But it will also be much harder to design and implement.

It would depend on what type of snippet you have, but I would run it through a series of tokenizers and see which language's BNF it came up as valid against.

I needed this so i created my own.
https://github.com/bertyhell/CodeClassifier
It's very easily extendable by adding a training file in the correct folder.
Written in c#. But i imagine the code is easily converted to any other language.

Best solution I have come across is using the linguist gem in a Ruby on Rails app. It's kind of a specific way to do it, but it works. This was mentioned above by #nisc but I will tell you my exact steps for using it. (Some of the following command line commands are specific to ubuntu but should be easily translated to other OS's)
If you have any rails app that you don't mind temporarily messing with, create a new file in it to insert your code snippet in question. (If you don't have rails installed there's a good guide here although for ubuntu I recommend this. Then run rails new <name-your-app-dir> and cd into that directory. Everything you need to run a rails app is already there).
After you have a rails app to use this with, add gem 'github-linguist' to your Gemfile (literally just called Gemfile in your app directory, no ext).
Then install ruby-dev (sudo apt-get install ruby-dev)
Then install cmake (sudo apt-get install cmake)
Now you can run gem install github-linguist (if you get an error that says icu required, do sudo apt-get install libicu-dev and try again)
(You may need to do a sudo apt-get update or sudo apt-get install make or sudo apt-get install build-essential if the above did not work)
Now everything is set up. You can now use this any time you want to check code snippets. In a text editor, open the file you've made to insert your code snippet (let's just say it's app/test.tpl but if know the extension of your snippet, use that instead of .tpl. If you don't know the extension, don't use one). Now paste your code snippet in this file. Go to command line and run bundle install (must be in your application's directory). Then run linguist app/test.tpl (more generally linguist <path-to-code-snippet-file>). It will tell you the type, mime type, and language. For multiple files (or for general use with a ruby/rails app) you can run bundle exec linguist --breakdown in your application's directory.
It seems like a lot of extra work, especially if you don't already have rails, but you don't actually need to know ANYTHING about rails if you follow these steps and I just really haven't found a better way to detect the language of a file/code snippet.

This site seems to be pretty good at identifying languages, if you want a quick way to paste a snippet into a web form, rather than doing it programmatically: http://dpaste.com/

Nice puzzle.
I think it is imposible to detect all languages. But you could trigger on key tokens. (certain reserved words and often used character combinations).
Ben there are a lot of languages with similar syntax. So it depends on the size of the snippet.

Prettify is a Javascript package that does an okay job of detecting programming languages:
http://code.google.com/p/google-code-prettify/
It is mainly a syntax highlighter, but there is probably a way to extract the detection part for the purposes of detecting the language from a snippet.

I wouldn't think there would be an easy way of accomplishing this. I would probably generate lists of symbols/common keywords unique to certain languages/classes of languages (e.g. curly brackets for C-style language, the Dim and Sub keywords for BASIC languages, the def keyword for Python, the let keyword for functional languages). You then might be able to use basic syntax features to narrow it down even further.

I think the biggest distinction between languages is its structure. So my idea would be to look at certain common elements across all languages and see how they differ. For example, you could use regexes to pick out things such as:
function definitions
variable declarations
class declarations
comments
for loops
while loops
print statements
And maybe a few other things that most languages should have. Then use a point system. Award at most 1 point for each element if the regex is found. Obviously, some languages will use the exact same syntax (for loops are often written like for(int i=0; i<x; ++i) so multiple languages could each score a point for the same thing, but at least you're reducing the likelihood of it being an entirely different language). Some of them might scores 0s across the board (the snippet doesnt contain a function at all, for example) but thats perfectly fine.
Combine this with Jules' solution, and it should work pretty well. Maybe also look for frequencies of keywords for an extra point.

Interesting. I have a similar task to recognize text in different formats. YAML, JSON, XML, or Java properties? Even with syntax errors, for example, I should tell apart JSON from XML with confidence.
I figure how we model the problem is critical. As Mark said, single-word tokenization is necessary but likely not enough. We will need bigrams, or even trigrams. But I think we can go further from there knowing that we are looking at programming languages. I notice that almost any programming language has two unique types of tokens -- symbols and keywords. Symbols are relatively easy (some symbols might be literals not part of the language) to recognize. Then bigrams or trigrams of symbols will pick up unique syntax structures around symbols. Keywords is another easy target if the training set is big and diverse enough. A useful feature could be bigrams around possible keywords. Another interesting type of token is whitespace. Actually if we tokenize in the usual way by white space, we will loose this information. I'd say, for analyzing programming languages, we keep the whitespace tokens as this may carry useful information about the syntax structure.
Finally if I choose a classifier like random forest, I will crawl github and gather all the public source code. Most of the source code file can be labeled by file suffix. For each file, I will randomly split it at empty lines into snippets of various sizes. I will then extract the features and train the classifier using the labeled snippets. After training is done, the classifier can be tested for precision and recall.

I believe that there is no single solution that could possibly identify what language a snippet is in, just based upon that single snippet. Take the keyword print. It could appear in any number of languages, each of which are for different purposes, and have different syntax.
I do have some advice. I'm currently writing a small piece of code for my website that can be used to identify programming languages. Like most of the other posts, there could be a huge range of programming languages that you simply haven't heard, you can't account for them all.
What I have done is that each language can be identified by a selection of keywords. For example, Python could be identified in a number of ways. It's probably easier if you pick 'traits' that are also certainly unique to the language. For Python, I choose the trait of using colons to start a set of statements, which I believe is a fairly unique trait (correct me if I'm wrong).
If, in my example, you can't find a colon to start a statement set, then move onto another possible trait, let's say using the def keyword to define a function. Now this can causes some problems, because Ruby also uses the keyword def to define a function. The key to telling the two (Python and Ruby) apart is to use various levels of filtering to get the best match. Ruby use the keyword end to finish a function, whereas Python doesn't have anything to finish a function, just a de-indent but you don't want to go there. But again, end could also be Lua, yet another programming language to add to the mix.
You can see that programming languages simply overlay too much. One keyword that could be a keyword in one language could happen to be a keyword in another language. Using a combination of keywords that often go together, like Java's public static void main(String[] args) helps to eliminate those problems.
Like I've already said, your best chance is looking for relatively unique keywords or sets of keywords to separate one from the other. And, if you get it wrong, at least you had a go.

Set up the random scrambler like
matrix S = matrix(GF(2),k,[random()<0.5for _ in range(k^2)]); while (rank(S) < k) : S[floor(k*random()),floor(k*random())] +=1;

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string