Stata programming language without syntax? - programming-languages

I recently got into Stata coming from a procedural/OO/functional background, and am having trouble understanding the basic elements of the language.
For example, I discovered that there is a syntax command which "allows programs to interpret the arguments the user types according to a grammar, such as standard Stata syntax". I infer this is the reason why some command require a list of variables given as arguments to be separated by whitespaces while others require a comma-separated list. But the idea of a program defining its own syntax instead of the (parameter) syntax being enforced seems plain weird.
Another quite interesting construct is the syntax for macro definition and expansion (`macro') and the apparent absence of local variables as known in other languages.
Is there something like a "Stata for Java developers" document explaining the basic concepts of the language to people with my background?
PS: Apologies if this question seems unclear. Unfortunately, I can't formulate more concrete/clear questions at this point :(

I'm not exactly sure what you are looking for... but here's a few related points. Stata is kind of like writing a Unix shell script or a Windows batch file. Each line executes a command, and the first word is the command name. By convention, most commands have the following structure:
command [varlist] [=exp] [if expression] [in range] [weight] [using filename] [, options]
Brackets [.] means it's optional (or unavailable, depending on the command). Some commands can be prefixed (such as by:, xi:, or svy:) The syntax of commands by Stata Corp and experienced users are pretty consistent. But, because Stata users also write commands, you occasionally see things that are wacky.
When Stata users write commands, they are saved in .ado files (not .do) and are defined using the program command. (See help program and the "Ado files" section of the manual.) Writing a command is akin to writing a function in other languages (e.g., MatLab)
The syntax command is used to help you write your own command. When you execute a command, everything following the command's name (command above) is passed to the program in the local macro `0'. The syntax command parses this local macro, so that you can reference `varlist' or `if' and so on. In theory, you could parse `0' yourself, but the syntax command makes it much easier for you and your users (as long as you are following the conventional syntax). I put an example at the bottom.
I don't know exactly what you mean by "apparent absence of local variables as known in other languages." Macros store a single string or a single number in memory. Here's a comment I wrote about Stata's local/global macros. They are indeed a unique feature of Stata's programming language. As their names imply, "local" macros are only available within a specify program (command) or .do file while "global" macros are available throughout a Stata session.
I found that, once I got used to macros in Stata, I started to miss them in other languages. They are pretty handy. In addition to (local/global) macros and the main data set, you can also store "things" in memory with the scalar and matrix commands (and one or two other obscure things).
I hope that helps. Here's a list resources that might help.
Example:
program define myprogram
syntax varlist [if], [hello(string) yes]
macro list _0 _varlist _if _hello _yes
summarize `varlist' `if'
display "Here's the string in my hello option: `hello'"
if !missing("`yes'") di "Yes is on"
else di "Yes is off"
end
sysuse auto.dta
myprogram rep78 headroom if price > 5000 , hello("world") yes

A few books offer an "X for Y users" approach, but generally between stats software solutions. Regarding your question, I would recommend using instinct first.
I started reading (programming and markup) code about ten years ago, and even though I cannot code in a large number of languages, I can read a few languages rather easily. I found Stata easy because most of its core commands are straightforward, with recurrent optional statements like over, if or replace (to take a voluntarily diverse set of statements) that are easy to understand and then apply.
When I teach Stata, I always have problems getting students to use the help pages as much as I do (and I love the fact they can be accessed so easily, just like in R). I explain the paradox by considering the fact that I can read the syntax indications straightaway. Syntax is very well covered by the previous reply to your question.
The extra mile consists in opening the [R], [U] and especially [P] handbooks that come with Stata in the utilities folder. There is a wealth of details there, which will interest both programmers and training statisticians. This is where I learnt to use macros and loops, beyond the obvious logic of commands like local/global and foreach/while (if I understand the term correctly, Stata is Turing-complete).
Stata is sometimes a bit of a pain when it comes to using single/double quotes in macro loops, but it's pretty straightforward otherwise. Have fun!

Related

Is there a program which can help understand another program?

I need to document the software I'm currently working on. The software consists of several programming languages and scripts which got me thinking. If a new developers comes along and needs to fix something, they might know Java but maybe not bash scripting. It would be nice if there was a program which would help to understand what
for f in "$#" ; do
means. I was thinking of something that creates a static HTML page with the code plus syntax highlighting and if you hover over something (like the "for"), it would display a pop-up with an explanation:
for starts a loop which iterates over all values that follow in. In the loop, you can access each value via the variable $f. The loop body is between do and done
Does something like that already exist?
[EDIT] This is just an example. You'll get another help for f, in, "$#", ; and do, i.e. each and every element of the line should be explained. Unknown elements (like command names) should link to Google. So you can understand what it does even if you're missing some detail.
[EDIT2] I'm aware that you can't write a program which understands what another program does. What I'm looking for is a simple tool which will do "extended syntax highlighting" in the sense that it will color an expression and give a short explanation what it means (plus maybe a link to some in-depth reference).
This is meant for someone who knows how to program but maybe hasn't seen some obscure construct before. Say
echo "Error" 1>&2
Every bash programmer knows what this means but a Java developer might be puzzled by the 1>&2 despite the fact that they can guess that echo == System.out.println. A simple "Redirects stdout to stderr" will clear things up and give that instant "AHA!" which allows them to stay in their current train of thought.
A tool like this could be built using ANTLR, i.e. parse the code into an abstract syntax tree using an ANTLR grammar for that language, and write an HTML generator which produced the annotated code.
It sounds like a useful tool to have for language learning, or exploring source code of projects you're not maintaining -- but is it appropriate for documentation?
Why is it important to help the programmers of other languages understand the code at this level of implementation detail? Anyone maintaining the implementation at this level will obviously have to know the language and will probably have an IDE to do most of this.
That said, I'd definitely consider a tool like this as a learning aid.
IMO it would be simpler and more effective to just collect links to good language-specific references and tutorials on a Wiki page.
For all mainstream languages, such sources exist and are maintained regularly. If you try to create your own reference, you need to maintain it too. Fair enough, bash syntax is not going to change very often, but other languages do develop faster, so it is going to be a burden.
If you think about it, it's not that useful to have a tool that explains the syntax. Developers could just google for keywords instead of browsing a website in a similar fashion to http://www.codeweblog.com/source/ .
I believe that good comments will be by far more useful, plus there are tools to extract the documentation by using the comments (for example, HappyDoc does that for Python).
It is a very tricky thing. First of all by definition it can be proven that program that will "understand" any program down't exist. However, you can still use existing documentation. Maybe using tools like Doxygen can help you. You would need to document your code through comments and the documentation will be generated from them.
A language cannot be explained only through its syntax. The runtime environment plays a great part, together with the underlying philosophy of the language and libraies.
Moreover, syntax is not that complex for most common languages (given that code has been written with maintainability in mind).
Going on with bash example, you cannot deeply understand bash if you know nothing about processes & job control, environment variables, a big list of unix commands (tr, sort, cut, paste, sed, awk, find, ...) and many other features that don't appear in syntax.
If the tool produced
for starts a loop which iterates over
all values that follow in. In the
loop, you can access each value via
the variable $f. The loop body is
between do and done
it would be pretty worthless. This is exactly the kind of comment that trainee (human) programmers are told nver to write.

For what reasons do some programmers vehemently hate languages where whitespace matters (e.g. Python)? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
C++ is my first language, and as such I'm used to whitespace being ignored. However, I've been toying around with Python, and I don't find it too hard to get used to the whitespace rules. It seems, however, that a lot of programmers on the Internet can't get past the whitespace rules. From what I've seen, peoples' C++ programs tend to be formatted very consistently with respect to whitespace (or else it's pretty hard to read), so why do some people have such a problem with whitespace-based languages like Python?
It violates the Principle of Least Astonishment, because we have it ingrained in ourselves (whether for good or bad) that whitespace Does Not Matter in a programming language. Whitespace is one of those issues that has been left up to personal style.
I still have bad memories back from being a student of learning the hard way that 8 spaces is not equivalent to a tab in a Makefile... Ah, the sleep I lost...
The only valid reason I have come across is that refactoring using cut-and-paste (not copy) without refactoring tools (or syntax-aware cut-andpaste), can end up changing semantics if an easy mistake is made.
There are several different types of whitespace (spaces, tabs, weird unicode characters, carriage returns, line breaks, etc.), they aren't necessarily visually distinct, and languages and editors may treat them capriciously. This isn't an argument against well-designed whitespace semantics, but many people are against all forms of it simply because of the possibility of poor design.
People hate it because it violates common sense. Not a single one of the replies I have read here decided that it was ok to simply forgo periods and other punctuations. In fact the grammar has been very good. If the nonsense about indentation actually carrying the meaning were true we would all just forget about using punctuations entirely.
No one learned that newlines terminate a sentence in a horizontal language like English, instead we learned to infer when a sentence ended regardless of whether or not the punctuation was present or not.
The same is true for programming languages, especially for those of us who started out with a programming language that did use explicit block termination. You learn to infer where a block starts and stops over time, it does not mean that the spacing did that for you, the semantics of the language itself did.
Most literate people would have no problem understanding posts without punctuations. Having to rely on what is a representation of the absence of a character is not a good idea. Do any of you count from zero when you make your to-do list?
Alright, this is a very narrow perspective, but I haven't seen it mentioned elsewhere: keeping track of white space is a hassle if you are trying to autogenerate a script.
When I first encountered Python, I don't remember the details, but I had developed a Windows tool with a GUI that allowed novice users to configure several settings, and then press OK. The output of the tool was a script, which the user could copy to a Unix machine, and then execute it there to do something or other that was too complicated or tedious for them to do manually. Since nobody maintained the generated scripts, there was no reason they needed to look nice. So, keeping track of indentation seemed like an unnecessary burden from that perspective.
For most purposes, though, I find that Python is much easier than any other language.
Perhaps your C++ background (and thus who your peers are) is clouding your perception of this (ie selective sampling) but in my experience the reaction to Python's "white space is intent" meme is anywhere from ambivalent to they absolutely love it. The reason a lot of people love it is that it forces people to format their code.
I can't say I've ever met anyone who "hates" it because hating it is much like hating the idea of well-formatted code.
Edit: let me put this in some perspective.
In the Java world there are two main methods of packaging and deploying Web apps: Ant and Maven.
Ant is basically an XML-based Make facility that has tasks for the common things you do. It's a blank slate, which is powerful, but it also means you have to write a lot of common things yourself and every installation is free to do things slightly differently. All of this is well-intentioned but can make it hard to figure out someone's Ant scripts.
Maven is far more fully features. It has archetypes, which are basically project types. Depending on which archetype(s) you use, you won't have to write any tasks to start, stop, clean, build, etc but you will have a mandated directory structure, which is quite deep.
The advantage of that is if you've seen one Maven Web app you've seen them all. You know the commands. You know the structure. That's extremely useful.
But you have people who absolutely hate Maven and I think it comes down to this: they don't like giving up control, even when it's ultimately in their interest to do so. Also, you'll find a certain brand of person who thinks that their use case is a justifiable exception. You see this personality trait a lot. For example, I think an old Joel post mentioned a story where someone wanted to use "enter" to go from the username to password form fields even though the convention was that enter executed the default action (usually "OK") so they had to write a custom dialog class for Windows for this.
Basically some people just don't like being told what to do and others are completely obstinate in their belief that they're right even when all evidence points to the contrary.
This probably explains why some supposedly hate Python's white space: they don't like being told how to format their code. They like the freedom of C/C++.
Because change is scary. And maybe, among certain developers, there are some faint memories of languages with capricious rules about whitespacing that were hard to remember and arbitrary, meant more for compiler convenience than expressiveness.
Most likely, not giving whitespace-significance a fair shake before dismissing it is the real reason. Ask someone to fix a bug in a reasonably complex but well-written Python program, then ask them to go fix a bug in a 20 year old system in C, VB or Cobol and ask them which they prefer.
As for me, I have as much trouble with whitespace in Python or Boo as I have with parentheses in Lisp. Which is to say, none.
They will have to get used to it. Initially I had a problem my self trying to read some examples but after using language for some time I started liking it.
I believe it is a habit that people has to overcome.
Some have developed habits (for example: deeply nested loops, unnecessarily large functions) that they perceive would be hard to support in a whitespace sensitive language.
Some have developed an aesthetic dislike for hanging indents.
Because they are used to languages like C and JavaScript where they can align items as they please.
When it comes to Python, you have to indent code based on its context:
def Print():
ManyArgumentFunction(LongParam1,LongParam2,LongParam3,LongParam4...
In C, you could do:
void Print()
{
ManyArgumentFunction(LongParam1,
LongParam2,
LongParam3,...
}
The only complaints I (also of C++ background) have heard about Python are from people who don't like using the "Replace Tabs with Space" option in their IDE.

naming convention for shell script and makefile

I have a few makefiles that store shared variables, such as CC=gcc , how should I name them?
The candidates are:
common.mk
Make.common
Makefile.common
.. which is more classic? Is there a standard?
Similarly, I have some shell scripts, which should i choose among the following:
do_this_please.sh
do-this-please.sh
DoThisPlease.sh
doThisPlease.sh
Is there a generally accepted 'case' and suffix for these?
Google's open-source style uses underscores as separators: https://google.github.io/styleguide/shellguide.html#s7.4-source-filenames.
So in your case, it would be do_this_please.sh or do_this_please using this style.
What you've got are the bits that glue a build together. Build scripts, auto-generated configs, makefiles that other makefiles include - questioning how that stuff should be named is a good idea.
Most of all, be consistent.
I've seen a lot of .mk extensions for files included via the Makefile. However, as Gyom suggests, it's a very subjective question.
Whatever makes the syntax highlighter in your editor of choice happy is probably a good choice. If you're on a team where everyone uses something different, ask folks. For me, naming a makefile include with a .mk extension highlights correctly for everyone. Naming shell scripts with a .sh suffix helps in a similar way.
In short, make the file names obvious and try to make syntax highlighting work on as many editors / IDEs as possible. Makefile.common might not do that, common.mk may have a better shot.
I would go for do_this_please.sh. Here is my reasoning:
From a readability perspective, spacing is preferable. This takes us away from the camelcase.
The dash has special meaning in most shells, so pretty-print lexers will pick it up and give odd visualisations when pretty-printing something involving one of your files. You see a case above.
This leaves the underscored version. Some additional points in favour:
This convention is the same as module names for Python modules (cmp. PEP 8). Modules in python group code together. Your shell script could also be seen as such a grouping.
Pythonians will have an immediate idea as to how much and what scope of function you offer. Python is very popular, with its home-stay originating in scripting, so its conventions are probably a good template.
Since the purpose of the shell script is to execute command line commands, IMO the naming convention of the shell script should follow the naming conventions for files because it's more related to the operating system rather than a particular programming language. Since most operating systems use - or _ as word separators, it's better to name the script do-this-please.sh or do_this_please.sh. I personally use do_this_please.sh, because I think it looks better.
I would recommend always adding .sh extension to shell scripts, because if other people are looking through your project, do_this_please won't give them enough information about the purpose of the file without opening it. On the other hand when they see the extension, they know the file is a shell script and seeing the name of the script, they know the whole purpose of the file without having to open it.
It is quite a subjective question, so I'll answer it with a subjective answer :-)
IMHO, go for Makefile.common and do-this-please (with or without .sh suffix). I've seen these a lot and they are indisputably readable.

What is the worst programming language you ever worked with? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 13 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
If you have an interesting story to
share, please post an answer, but
do not abuse this question for bashing
a language.
We are programmers, and our primary tool is the programming language we use.
While there is a lot of discussion about the best one, I'd like to hear your stories about
the worst programming languages you ever worked with and I'd like to know exactly what annoyed you.
I'd like to collect this stories partly to avoid common pitfalls while designing a language (especially a DSL) and partly to avoid quirky languages in the future in general.
This question is not subjective. If a language supports only single character identifiers (see my own answer) this is bad in a non-debatable way.
EDIT
Some people have raised concerns that this question attracts trolls.
Wading through all your answers made one thing clear.
The large majority of answers is appropriate, useful and well written.
UPDATE 2009-07-01 19:15 GMT
The language overview is now complete, covering 103 different languages from 102 answers.
I decided to be lax about what counts as a programming language and included
anything reasonable. Thank you David for your comments on this.
Here are all programming languages covered so far
(alphabetical order, linked with answer, new entries in bold):
ABAP,
all 20th century languages,
all drag and drop languages,
all proprietary languages,
APF,
APL
(1),
AS400,
Authorware,
Autohotkey,
BancaStar,
BASIC,
Bourne Shell,
Brainfuck,
C++,
Centura Team Developer,
Cobol
(1),
Cold Fusion,
Coldfusion,
CRM114,
Crystal Syntax,
CSS,
Dataflex 2.3,
DB/c DX,
dbase II,
DCL,
Delphi IDE,
Doors DXL,
DOS batch
(1),
Excel Macro language,
FileMaker,
FOCUS,
Forth,
FORTRAN,
FORTRAN 77,
HTML,
Illustra web blade,
Informix 4th Generation Language,
Informix Universal Server web blade,
INTERCAL,
Java,
JavaScript
(1),
JCL
(1),
karol,
LabTalk,
Labview,
Lingo,
LISP,
Logo,
LOLCODE,
LotusScript,
m4,
Magic II,
Makefiles,
MapBasic,
MaxScript,
Meditech Magic,
MEL,
mIRC Script,
MS Access,
MUMPS,
Oberon,
object extensions to C,
Objective-C,
OPS5,
Oz,
Perl
(1),
PHP,
PL/SQL,
PowerDynamo,
PROGRESS 4GL,
prova,
PS-FOCUS,
Python,
Regular Expressions,
RPG,
RPG II,
Scheme,
ScriptMaker,
sendmail.conf,
Smalltalk,
Smalltalk ,
SNOBOL,
SpeedScript,
Sybase PowerBuilder,
Symbian C++,
System RPL,
TCL,
TECO,
The Visual Software Environment,
Tiny praat,
TransCAD,
troff,
uBasic,
VB6
(1),
VBScript
(1),
VDF4,
Vimscript,
Visual Basic
(1),
Visual C++,
Visual Foxpro,
VSE,
Webspeed,
XSLT
The answers covering 80386 assembler, VB6 and VBScript have been removed.
PHP (In no particular order)
Inconsistent function names and argument orders
Because there are a zillion functions, each one of which seems to use a different naming convention and argument order. "Lets see... is it foo_bar or foobar or fooBar... and is it needle, haystack or haystack, needle?" The PHP string functions are a perfect example of this. Half of them use str_foo and the other half use strfoo.
Non-standard date format characters
Take j for example
In UNIX (which, by the way, is what everyone else uses as a guide for date string formats) %j returns the day of the year with leading zeros.
In PHP's date function j returns the day of the month without leading zeros.
Still No Support for Apache 2.0 MPM
It's not recommended.
Why isn't this supported? "When you make the underlying framework more complex by not having completely separate execution threads, completely separate memory segments and a strong sandbox for each request to play in, feet of clay are introduced into PHP's system." Link So... it's not supported 'cause it makes things harder? 'Cause only the things that are easy are worth doing right? (To be fair, as Emil H pointed out, this is generally attributed to bad 3rd-party libs not being thread-safe, whereas the core of PHP is.)
No native Unicode support
Native Unicode support is slated for PHP6
I'm sure glad that we haven't lived in a global environment where we might have need to speak to people in other languages for the past, oh 18 years. Oh wait. (To be fair, the fact that everything doesn't use Unicode in this day and age really annoys me. My point is I shouldn't have to do any extra work to make Unicode happen. This isn't only a PHP problem.)
I have other beefs with the language. These are just some.
Jeff Atwood has an old post about why PHP sucks. He also says it doesn't matter. I don't agree but there we are.
XSLT.
XSLT is baffling, to begin with. The metaphor is completely different from anything else I know.
The thing was designed by a committee so deep in angle brackets that it comes off as a bizarre frankenstein.
The weird incantations required to specify the output format.
The built-in, invisible rules.
The odd bolt-on stuff, like scripts.
The dependency on XPath.
The tools support has been pretty slim, until lately. Debugging XSLT in the early days was an exercise in navigating in complete darkness. The tools change that but, still XSLT tops my list.
XSLT is weird enough that most people just ignore it. If you must use it, you need an XSLT Shaman to give you the magic incantations to make things go.
DOS Batch files. Not sure if this qualifies as programming language at all.
It's not that you can't solve your problems, but if you are used to bash...
Just my two cents.
Not sure if its a true language, but I hate Makefiles.
Makefiles have meaningful differences between space and TAB, so even if two lines appear identical, they do not run the same.
Make also relies on a complex set of implicit rules for many languages, which are difficult to learn, but then are frequently overridden by the make file.
A Makefile system is typically spread over many, many files, across many directories.
With virtually no scoping or abstraction, a change to a make file several directories away can prevent my source from building. Yet the error message is invariably a compliation error, not a meaningful error about make, or the makefiles.
Any environment I've worked in that uses makefiles successfully has a full-time Make expert. And all this to shave a few minutes off compilation??
The worse language I've ever seen come from the tool praat, which is a good audio analysis tool. It does a pretty good job until you use the script language. sigh bad memories.
Tiny praat script tutorial for beginners
Function call
We've listed at least 3 different function calling syntax :
The regular one
string = selected("Strings")
Nothing special here, you assign to the variable string the result of the selected function. Not really scary... yet.
The "I'm invoking some GUI command with parameters"
Create Strings as file list... liste 'path$'/'type$'
As you can see, the function name start at "Create" and finish with the "...". The command "Create Strings as file list" is the text displayed on a button or a menu (I'm to scared to check) on praat. This command take 2 parameters liste and an expression. I'm going to look deeper in the expression 'path$'/'type$'
Hmm. Yep. No spaces. If spaces were introduced, it would be separate arguments. As you can imagine, parenthesis don't work. At this point of the description I would like to point out the suffix of the variable names. I won't develop it in this paragraph, I'm just teasing.
The "Oh, but I want to get the result of the GUI command in my variable"
noliftt = Get number of strings
Yes we can see a pattern here, long and weird function name, it must be a GUI calling. But there's no '...' so no parameters. I don't want to see what the parser looks like.
The incredible type system (AKA Haskell and OCaml, praat is coming to you)
Simple natives types
windowname$ = left$(line$,length(line$)-4)
So, what's going on there?
It's now time to look at the convention and types of expression, so here we got :
left$ :: (String, Int) -> String
lenght :: (String) -> Int
windowname$ :: String
line$ :: String
As you can see, variable name and function names are suffixed with their type or return type. If their suffix is a '$', then it return a string or is a string. If there is nothing it's a number. I can see the point of prefixing the type to a variable to ease implementation, but to suffix, no sorry, I can't
Array type
To show the array type, let me introduce a 'tiny' loop :
for i from 1 to 4
Select... time time
bandwidth'i'$ = Get bandwidth... i
forhertz'i'$ = Get formant... i
endfor
We got i which is a number and... (no it's not a function)
bandwidth'i'$
What it does is create string variables : bandwidth1$, bandwidth2$, bandwidth3$, bandwidth4$ and give them values. As you can expect, you can't create two dimensional array this way, you must do something like that :
band2D__'i'__'j'$
The special string invocation
outline$ = "'time'#F'i':'forhertznum'Hz,'bandnum'Hz, 'spec''newline$'"
outline$ >> 'outfile$'
Strings are weirdly (at least) handled in the language. the '' is used to call the value of a variable inside the global "" string. This is _weird_. It goes against all the convention built into many languages from bash to PHP passing by the powershell. And look, it even got redirection. Don't be fooled, it doesn't work like in your beloved shell. No you have to get the variable value with the ''
Da Wonderderderfulful execution model
I'm going to put the final touch to this wonderderderfulful presentation by talking to you about the execution model. So as in every procedural languages you got instruction executed from top to bottom, there is the variables and the praat GUI. That is you code everything on the praat gui, you invoke commands written on menu/buttons.
The main window of praat contain a list of items which can be :
files
list of files (created by a function with a wonderderfulful long long name)
Spectrogramm
Strings (don't ask)
So if you want to perform operation on a given file, you must select the file in the list programmatically and then push the different buttons to take some actions. If you wanted to pass parameters to a GUI action, you have to follow the GUI layout of the form for your arguments, for example "To Spectrogram... 0.005 5000 0.002 20 Gaussian
" is like that because it follows this layout:
Needless to say, my nightmares are filled with praat scripts dancing around me and shouting "DEBUG MEEEE!!".
More information at the praat site, under the well-named section "easy programmable scripting language"
Well since this question refuses to die and since the OP did prod me into answering...
I humbly proffer for your consideration Authorware (AW) as the worst language it is possible to create. (n.b. I'm going off recollection here, it's been ~6 years since I used AW, which of course means there's a number of awful things I can't even remember)
the horror, the horror http://img.brothersoft.com/screenshots/softimage/a/adobe_authorware-67096-1.jpeg
Let's start with the fact that it's a Macromedia product (-10 points), a proprietary language (-50 more) primarily intended for creating e-learning software and moreover software that could be created by non-programmers and programmers alike implemented as an iconic language AND a text language (-100).
Now if that last statement didn't scare you then you haven't had to fix WYSIWYG generated code before (hello Dreamweaver and Frontpage devs!), but the salient point is that AW had a library of about 12 or so elements which could be dragged into a flow. Like "Page" elements, Animations, IFELSE, and GOTO (-100). Of course removing objects from the flow created any number of broken connections and artifacts which the IDE had variable levels of success coping with. Naturally the built in wizards (-10) were a major source of these.
Fortunately you could always step into a code view, and eventually you'd have to because with a limited set of iconic elements some things just weren't possible otherwise. The language itself was based on TUTOR (-50) - a candidate for worst language itself if only it had the ambition and scope to reach the depths AW would strive for - about which wikipedia says:
...the TUTOR language was not easy to
learn. In fact, it was even suggested
that several years of experience with
the language would be required before
programmers could build programs worth
keeping.
An excellent foundation then, which was built upon in the years before the rise of the internet with exactly nothing. Absolutely no form of data structure beyond an array (-100), certainly no sugar (real men don't use switch statements?) (-10), and a large splash of syntactic vinegar ("--" was the comment indicator so no decrement operator for you!) (-10). Language reference documentation was provided in paper or zip file formats (-100), but at least you had the support of the developer run usegroup and could quickly establish the solution to your problem was to use the DLL or SWF importing features of AW to enable you to do the actual coding in a real language.
AW was driven by a flow (with necessary PAUSE commands) and therefore has all the attendant problems of a linear rather than event based system (-50), and despite the outright marketing lies of the documentation it was not object oriented (-50) either. All code reuse was achieved through GOTO. No scope, lots of globals (-50).
It's not the language's fault directly, but obviously no source control integration was possible, and certainly no TDD, documentation generation or any other add-on tool you might like.
Of course Macromedia met the challenge of the internet head on with a stubborn refusal to engage for years, eventually producing the buggy, hard to use, security nightmare which is Shockwave (-100) to essentially serialise desktop versions of the software through a required plugin (-10). AS HTML rose so did AW stagnate, still persisting with it's shockwave delivery even in the face of IEEE SCORM javascript standards.
Ultimately after years of begging and promises Macromedia announced a radical new version of AW in development to address these issues, and a few years later offshored the development and then cancelled the project. Although of course Macromedia are still selling it (EVIL BONUS -500).
If anything else needs to be said, this is the language which allows spaces in variable names (-10000).
If you ever want to experience true pain, try reading somebody else's uncommented hungarian notation in a language which isn't case sensitive and allows variable name spaces.
Total Annakata Arbitrary Score (AAS): -11300
Adjusted for personal experience: OutOfRangeException
(apologies for length, but it was cathartic)
Seriously: Perl.
It's just a pain in the ass to code with for beginners and even for semi-professionals which work with perl on a daily basis. I can constantly see my colleagues struggle with the language, building the worst scripts, like 2000 lines with no regard of any well accepted coding standard. It's the worst mess i've ever seen in programming.
Now, you can always say, that those people are bad in coding (despite the fact that some of them have used perl for a lot of years, now), but the language just encourages all that freaking shit that makes me scream when i have to read a script by some other guy.
MS Access Visual Basic for Applications (VBA) was also pretty bad. Access was bad altogether in that it forced you down a weak paradigm and was deceptively simple to get started, but a nightmare to finish.
No answer about Cobol yet? :O
Old-skool BASICs with line numbers would be my choice. When you had no space between line numbers to add new lines, you had to run a renumber utility, which caused you to lose any mental anchors you had to what was where.
As a result, you ended up squeezing in too many statements on a single line (separated by colons), or you did a goto or gosub somewhere else to do the work you couldn't cram in.
MUMPS
I worked in it for a couple years, but have done a complete brain dump since then. All I can really remember was no documentation (at my location) and cryptic commands.
It was horrible. Horrible! HORRIBLE!!!
There are just two kinds of languages: the ones everybody complains about and the ones nobody uses.
Bjarne Stroustrup
I haven't yet worked with many languages and deal mostly with scripting languages; out of these VBScript is the one I like least. Although it has some handy features, some things really piss me off:
Object assignments are made using the Set keyword:
Set foo = Nothing
Omitting Set is one of the most common causes of run-time errors.
No such thing as structured exception handling. Error checking is like this:
On Error Resume Next
' Do something
If Err.Number <> 0
' Handle error
Err.Clear
End If
' And so on
Enclosing the procedure call parameters in parentheses requires using the Call keyword:
Call Foo (a, b)
Its English-like syntax is way too verbose. (I'm a fan of curly braces.)
Logical operators are long-circuit. If you need to test a compound condition where the subsequent condition relies on the success of the previous one, you need to put conditions into separate If statements.
Lack of parameterized class constructors.
To wrap a statement into several lines, you have to use an underscore:
str = "Hello, " & _
"world!"
Lack of multiline comments.
Edit: found this article: The Flangy Guide to Hating VBScript. The author sums up his complaints as "VBS isn't Python" :)
Objective-C.
The annotations are confusing, using brackets to call methods still does not compute in my brain, and what is worse is that all of the library functions from C are called using the standard operators in C, -> and ., and it seems like the only company that is driving this language is Apple.
I admit I have only used the language when programming for the iPhone (and looking into programming for OS X), but it feels as if C++ were merely forked, adding in annotations and forcing the implementation and the header files to be separate would make much more sense.
PROGRESS 4GL (apparently now known as "OpenEdge Advanced Business Language").
PROGRESS is both a language and a database system. The whole language is designed to make it easy to write crappy green-screen data-entry screens. (So start by imagining how well this translates to Windows.) Anything fancier than that, whether pretty screens, program logic, or batch processing... not so much.
I last used version 7, back in the late '90s, so it's vaguely possible that some of this is out-of-date, but I wouldn't bet on it.
It was originally designed for text-mode data-entry screens, so on Windows, all screen coordinates are in "character" units, which are some weird number of pixels wide and a different number of pixels high. But of course they default to a proportional font, so the number of "character units" doesn't correspond to the actual number of characters that will fit in a given space.
No classes or objects.
No language support for arrays or dynamic memory allocation. If you want something resembling an array, you create a temporary in-memory database table, define its schema, and then get a cursor on it. (I saw a bit of code from a later version, where they actually built and shipped a primitive object-oriented system on top of these in-memory tables. Scary.)
ISAM database access is built in. (But not SQL. Who needs it?) If you want to increment the Counter field in the current record in the State table, you just say State.Counter = State.Counter + 1. Which isn't so bad, except...
When you use a table directly in code, then behind the scenes, they create something resembling an invisible, magic local variable to hold the current cursor position in that table. They guess at which containing block this cursor will be scoped to. If you're not careful, your cursor will vanish when you exit a block, and reset itself later, with no warning. Or you'll start working with a table and find that you're not starting at the first record, because you're reusing the cursor from some other block (or even your own, because your scope was expanded when you didn't expect it).
Transactions operate on these wild-guess scopes. Are we having fun yet?
Everything can be abbreviated. For some of the offensively long keywords, this might not seem so bad at first. But if you have a variable named Index, you can refer to it as Index or as Ind or even as I. (Typos can have very interesting results.) And if you want to access a database field, not only can you abbreviate the field name, but you don't even have to qualify it with the table name; they'll guess the table too. For truly frightening results, combine this with:
Unless otherwise specified, they assume everything is a database access. If you access a variable you haven't declared yet (or, more likely, if you mistype the variable name), there's no compiler error: instead, it goes looking for a database field with that name... or a field that abbreviates to that name.
The guessing is the worst. Between the abbreviations and the field-by-default, you could get some nasty stuff if you weren't careful. (Forgot to declare I as a local variable before using it as a loop variable? No problem, we'll just randomly pick a table, grab its current record, and completely trash an arbitrarily-chosen field whose name starts with I!)
Then add in the fact that an accidental field-by-default access could change the scope it guessed for your tables, thus breaking some completely unrelated piece of code. Fun, yes?
They also have a reporting system built into the language, but I have apparently repressed all memories of it.
When I got another job working with Netscape LiveWire (an ill-fated attempt at server-side JavaScript) and classic ASP (VBScript), I was in heaven.
The worst language? BancStar, hands down.
3,000 predefined variables, all numbered, all global. No variable declaration, no initialization. Half of them, scattered over the range, reserved for system use, but you can use them at your peril. A hundred or so are automatically filled in as a result of various operations, and no list of which ones those are. They all fit in 38k bytes, and there is no protection whatsoever for buffer overflow. The system will cheerfully let users put 20 bytes in a ten byte field if you declared the length of an input field incorrectly. The effects are unpredictable, to say the least.
This is a language that will let you declare a calculated gosub or goto; due to its limitations, this is frequently necessary. Conditionals can be declared forward or reverse. Picture an "If" statement that terminates 20 lines before it begins.
The return stack is very shallow, (20 Gosubs or so) and since a user's press of any function key kicks off a different subroutine, you can overrun the stack easily. The designers thoughtfully included a "Clear Gosubs" command to nuke the stack completely in order to fix that problem and to make sure you would never know exactly what the program would do next.
There is much more. Tens of thousands of lines of this Lovecraftian horror.
The .bat files scripting language on DOS/Windows. God only knows how un-powerful is this one, specially if you compare it to the Unix shell languages (that aren't so powerful either, but way better nonetheless).
Just try to concatenate two strings or make a for loop. Nah.
VSE, The Visual Software Environment.
This is a language that a prof of mine (Dr. Henry Ledgard) tried to sell us on back in undergrad/grad school. (I don't feel bad about giving his name because, as far as I can tell, he's still a big proponent and would welcome the chance to convince some folks it's the best thing since sliced bread). When describing it to people, my best analogy is that it's sort of a bastard child of FORTRAN and COBOL, with some extra bad thrown in. From the only really accessible folder I've found with this material (there's lots more in there that I'm not going to link specifically here):
VSE Overview (pdf)
Chapter 3: The VSE Language (pdf) (Not really an overview of the language at all)
Appendix: On Strings and Characters (pdf)
The Software Survivors (pdf) (Fevered ramblings attempting to justify this turd)
VSE is built around what they call "The Separation Principle". The idea is that Data and Behavior must be completely segregated. Imagine C's requirement that all variables/data must be declared at the beginning of the function, except now move that declaration into a separate file that other functions can use as well. When other functions use it, they're using the same data, not a local copy of data with the same layout.
Why do things this way? We learn that from The Software Survivors that Variable Scope Rules Are Hard. I'd include a quote but, like most fools, it takes these guys forever to say anything. Search that PDF for "Quagmire Of Scope" and you'll discover some true enlightenment.
They go on to claim that this somehow makes it more suitable for multi-proc environments because it more closely models the underlying hardware implementation. Riiiight.
Another choice theme that comes up frequently:
INCREMENT DAY COUNT BY 7 (or DAY COUNT = DAY COUNT + 7)
DECREMENT TOTAL LOSS BY GROUND_LOSS
ADD 100.3 TO TOTAL LOSS(LINK_POINTER)
SET AIRCRAFT STATE TO ON_THE_GROUND
PERCENT BUSY = (TOTAL BUSY CALLS * 100)/TOTAL CALLS
Although not earthshaking, the style
of arithmetic reflects ordinary usage,
i.e., anyone can read and understand
it - without knowing a programming
language. In fact, VisiSoft arithmetic
is virtually identical to FORTRAN,
including embedded complex arithmetic.
This puts programmers concerned with
their professional status and
corresponding job security ill at
ease.
Ummm, not that concerned at all, really. One of the key selling points that Bill Cave uses to try to sell VSE is the democratization of programming so that business people don't need to indenture themselves to programmers who use crazy, arcane tools for the sole purpose of job security. He leverages this irrational fear to sell his tool. (And it works-- the federal gov't is his biggest customer). I counted 17 uses of the phrase "job security" in the document. Examples:
... and fit only for those desiring artificial job security.
More false job security?
Is job security dependent upon ensuring the other guy can't figure out what was done?
Is job security dependent upon complex code...?
One of the strongest forces affecting the acceptance of new technology is the perception of one's job security.
He uses this paranoia to drive wedge between the managers holding the purse strings and the technical people who have the knowledge to recognize VSE for the turd that it is. This is how he squeezes it into companies-- "Your technical people are only saying it sucks because they're afraid it will make them obsolete!"
A few additional choice quotes from the overview documentation:
Another consequence of this approach
is that data is mapped into memory
on a "What You See Is What You Get"
basis, and maintained throughout.
This allows users to move a complete
structure as a string of characters
into a template that descrives each
individual field. Multiple templates
can be redefined for a given storage
area. Unlike C and other languages,
substructures can be moved without the problems of misalignment due to
word boundary alignment standards.
Now, I don't know about you, but I know that a WYSIWYG approach to memory layout is at the top of my priority list when it comes to language choice! Basically, they ignore alignment issues because only old languages that were designed in the '60's and '70's care about word alignment. Or something like that. The reasoning is bogus. It made so little sense to me that I proceeded to forget it almost immediately.
There are no user-defined types in VSE. This is a far-reaching
decision that greatly simplifies the
language. The gain from a practical
point of view is also great. VSE
allows the designer and programmer to
organize a program along the same
lines as a physical system being
modeled. VSE allows structures to be
built in an easy-to-read, logical
attribute hierarchy.
Awesome! User-defined types are lame. Why would I want something like an InputMessage object when I can have:
LINKS_IN_USE INTEGER
INPUT_MESSAGE
1 ORIGIN INTEGER
1 DESTINATION INTEGER
1 MESSAGE
2 MESSAGE_HEADER CHAR 10
2 MESSAGE_BODY CHAR 24
2 MESSAGE_TRAILER CHAR 10
1 ARRIVAL_TIME INTEGER
1 DURATION INTEGER
1 TYPE CHAR 5
OUTPUT_MESSAGE CHARACTER 50
You might look at that and think, "Oh, that's pretty nicely formatted, if a bit old-school." Old-school is right. Whitespace is significant-- very significant. And redundant! The 1's must be in column 3. The 1 indicates that it's at the first level of the hierarchy. The Symbol name must be in column 5. You hierarchies are limited to a depth of 9.
Well, ok, but is that so awful? Just wait:
It is well known that for reading
text, use of conventional upper/lower
case is more readable. VSE uses all
upper case (except for comments). Why?
The literature in psychology is based
on prose. Programs, simply, are not
prose. Programs are more like math,
accounting, tables. Program fonts
(usually Courier) are almost
universally fixed-pitch, and for good
reason – vertical alignment among
related lines of code. Programs in
upper case are nicely readable, and,
after a time, much better in our
opinion
Nothing like enforcing your opinion at the language level! That's right, you cannot use any lower case in VSE unless it's in a comment. Just keep your CAPSLOCK on, it's gonna be stuck there for a while.
VSE subprocedures are called processes. This code sample contains three processes:
PROCESS_MUSIC
EXECUTE INITIALIZE_THE_SCENE
EXECUTE PROCESS_PANEL_WIDGET
INITIALIZE_THE_SCENE
SET TEST_BUTTON PANEL_BUTTON_STATUS TO ON
MOVE ' ' TO TEST_INPUT PANEL_INPUT_TEXT
DISPLAY PANEL PANEL_MUSIC
PROCESS_PANEL_WIDGET
ACCEPT PANEL PANEL_MUSIC
*** CHECK FOR BUTTON CLICK
IF RTG_PANEL_WIDGET_NAME IS EQUAL TO 'TEST_BUTTON'
MOVE 'I LIKE THE BEATLES!' TO TEST_INPUT PANEL_INPUT_TEXT.
DISPLAY PANEL PANEL_MUSIC
All caps as expected. After all, that's easier to read. Note the whitespace. It's significant again. All process names must start in column 0. The initial level of instructions must start on column 4. Deeper levels must be indented exactly 3 spaces. This isn't a big deal, though, because you aren't allowed to do things like nest conditionals. You want a nested conditional? Well just make another process and call it. And note the delicious COBOL-esque syntax!
You want loops? Easy:
EXECUTE NEXT_CALL
EXECUTE NEXT_CALL 5 TIMES
EXECUTE NEXT_CALL TOTAL CALL TIMES
EXECUTE NEXT_CALL UNTIL NO LINES ARE AVAILABLE
EXECUTE NEXT_CALL UNTIL CALLS_ANSWERED ARE EQUAL TO CALLS_WAITING
EXECUTE READ_MESSAGE UNTIL LEAD_CHARACTER IS A DELIMITER
Ugh.
Here is the contribution to my own question:
Origin LabTalk
My all-time favourite in this regard is Origin LabTalk.
In LabTalk the maximum length of a string variable identifier is one character.
That is, there are only 26 string variables at all. Even worse, some of them are used by Origin itself, and it is not clear which ones.
From the manual:
LabTalk uses the % notation to define
a string variable. A legal string
variable name must be a % character
followed by a single alphabetic
character (a letter from A to Z).
String variable names are
caseinsensitive. Of all the 26 string
variables that exist, Origin itself
uses 14.
Doors DXL
For me the second worst in my opinion is Doors DXL.
Programming languages can be divided into two groups:
Those with manual memory management (e.g. delete, free) and those with a garbage collector.
Some languages offer both, but DXL is probably the only language in the world that
supports neither. OK, to be honest this is only true for strings, but hey, strings aren't exactly
the most rarely used data type in requirements engineering software.
The consequence is that memory used by a string can never be reclaimed and
DOORS DXL leaks like sieve.
There are countless other quirks in DXL, just to name a few:
DXL function syntax
DXL arrays
Cold Fusion
I guess it's good for designers but as a programmer I always felt like one hand was tied behind my back.
The worst two languages I've worked with were APL, which is relatively well known for languages of its age, and TECO, the language in which the original Emacs was written. Both are notable for their terse, inscrutable syntax.
APL is an array processing language; it's extremely powerful, but nearly impossible to read, since every character is an operator, and many don't appear on standard keyboards.
TECO had a similar look, and for a similar reason. Most characters are operators, and this special purpose language was devoted to editing text files. It was a little better, since it used the standard character set. And it did have the ability to define functions, which was what gave life to emacs--people wrote macros, and only invoked those after a while. But figuring out what a program did or writing a new one was quite a challenge.
LOLCODE:
HAI
CAN HAS STDIO?
VISIBLE "HAI WORLD!"
KTHXBYE
Seriously, the worst programming language ever is that of Makefiles. Totally incomprehensible, tabs have a syntactic meaning and not even a debugger to find out what's going on.
I'm not sure if you meant to include scripting languages, but I've seen TCL (which is also annoying), but... the mIRC scripting language annoys me to no end.
Because of some oversight in the parsing, it's whitespace significant when it's not supposed to be. Conditional statements will sometimes be executed when they're supposed to be skipped because of this. Opening a block statement cannot be done on a separate line, etc.
Other than that it's just full of messy, inconsistent syntax that was probably designed that way to make very basic stuff easy, but at the same time makes anything a little more complex barely readable.
I lost most of my mIRC scripts, or I could have probably found some good examples of what a horrible mess it forces you to create :(
Regular expressions
It's a write only language, and it's hard to verify if it works correctly for the right inputs.
Visual Foxpro
I can't belive nobody has said this one:
LotusScript
I thinks is far worst than php at least.
Is not about the language itself which follows a syntax similar to Visual Basic, is the fact that it seem to have a lot of functions for extremely unuseful things that you will never (or one in a million times) use, but lack support for things you will use everyday.
I don't remember any concrete example but they were like:
"Ok, I have an event to check whether the mouse pointer is in the upper corner of the form and I don't have an double click event for the Form!!?? WTF??"
Twice I've had to work in 'languages' where you drag-n-dropped modules onto the page and linked them together with lines to show data flow. (One claimed to be a RDBMs, and the other a general purpose data acquisition and number crunching language.)
Just thinking of it makes me what to throttle someone. Or puke. Or both.
Worse, neither exposed a text language that you could hack directly.
I find myself avoid having to use VBScript/Visual Basic 6 the most.
I use primarily C++, javascript, Java for most tasks and dabble in ruby, scala, erlang, python, assembler, perl when the need arises.
I, like most other reasonably minded polyglots/programmers, strongly feel that you have to use the right tool for the job - this requires you to understand your domain and to understand your tools.
My issue with VBscript and VB6 is when I use them to script windows or office applications (the right domain for them) - i find myself struggling with the language (they fall short of being the right tool).
VBScript's lack of easy to use native data structures (such as associative containers/maps) and other quirks (such as set for assignment to objects) is a needless and frustrating annoyance, especially for a scripting language. Contrast it with Javascript (which i now use to program wscript/cscript windows and do activex automation scripts) which is far more expressive. While there are certain things that work better with vbscript (such as passing arrays back and forth from COM objects is slightly easier, although it is easier to pass event handlers into COM components with jscript), I am still surprised by the amount of coders that still use vbscript to script windows - I bet if they wrote the same program in both languages they would find that jscript works with you much more than vbscript, because of jscript's native hash data types and closures.
Vb6/VBA, though a little better than vbscript in general, still has many similar issues where (for their domain) they require much more boiler plate to do simple tasks than what I would like and have seen in other scripting languages.
In 25+ years of computer programming, by far the worst thing I've ever experienced was a derivative of MUMPS called Meditech Magic. It's much more evil than PHP could ever hope to be.
It doesn't even use '=' for assignment! 100^b assigns a value of 100 to b and is read as "100 goes to b". Basically, this language invented its own syntax from top to bottom. So no matter how many programming languages you know, Magic will be a complete mystery to you.
Here is 100 bottles of beer on the wall written in this abomination of a language:
BEERv1.1,
100^b,T("")^#,DO{b'<1 NN(b,"bottle"_IF{b=1 " ";"s "}_"of beer on the wall")^#,
N(b,"bottle"_IF{b=1 " ";"s "}_"of beer!")^#,
N("You take one down, pass it around,")^#,b-1^b,
N(b,"bottle"_IF{b=1 " ";"s "}_"of beer on the wall!")^#},
END;
TCL. It only compiles code right before it executes, so it's possible that if your code never went down branch A while testing, and one day, in the field it goes down branch A, it could have a SYNTAX ERROR!

Detecting programming language from a snippet [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
What would be the best way to detect what programming language is used in a snippet of code?
I think that the method used in spam filters would work very well. You split the snippet into words. Then you compare the occurences of these words with known snippets, and compute the probability that this snippet is written in language X for every language you're interested in.
http://en.wikipedia.org/wiki/Bayesian_spam_filtering
If you have the basic mechanism then it's very easy to add new languages: just train the detector with a few snippets in the new language (you could feed it an open source project). This way it learns that "System" is likely to appear in C# snippets and "puts" in Ruby snippets.
I've actually used this method to add language detection to code snippets for forum software. It worked 100% of the time, except in ambiguous cases:
print "Hello"
Let me find the code.
I couldn't find the code so I made a new one. It's a bit simplistic but it works for my tests. Currently if you feed it much more Python code than Ruby code it's likely to say that this code:
def foo
puts "hi"
end
is Python code (although it really is Ruby). This is because Python has a def keyword too. So if it has seen 1000x def in Python and 100x def in Ruby then it may still say Python even though puts and end is Ruby-specific. You could fix this by keeping track of the words seen per language and dividing by that somewhere (or by feeding it equal amounts of code in each language).
class Classifier
def initialize
#data = {}
#totals = Hash.new(1)
end
def words(code)
code.split(/[^a-z]/).reject{|w| w.empty?}
end
def train(code,lang)
#totals[lang] += 1
#data[lang] ||= Hash.new(1)
words(code).each {|w| #data[lang][w] += 1 }
end
def classify(code)
ws = words(code)
#data.keys.max_by do |lang|
# We really want to multiply here but I use logs
# to avoid floating point underflow
# (adding logs is equivalent to multiplication)
Math.log(#totals[lang]) +
ws.map{|w| Math.log(#data[lang][w])}.reduce(:+)
end
end
end
# Example usage
c = Classifier.new
# Train from files
c.train(open("code.rb").read, :ruby)
c.train(open("code.py").read, :python)
c.train(open("code.cs").read, :csharp)
# Test it on another file
c.classify(open("code2.py").read) # => :python (hopefully)
Language detection solved by others:
Ohloh's approach: https://github.com/blackducksw/ohcount/
Github's approach: https://github.com/github/linguist
Guesslang is a possible solution:
http://guesslang.readthedocs.io/en/latest/index.html
There's also SourceClassifier:
https://github.com/chrislo/sourceclassifier/tree/master
I became interested in this problem after finding some code in a blog article which I couldn't identify. Adding this answer since this question was the first search hit for "identify programming language".
An alternative is to use highlight.js, which performs syntax highlighting but uses the success-rate of the highlighting process to identify the language. In principle, any syntax highlighter codebase could be used in the same way, but the nice thing about highlight.js is that language detection is considered a feature and is used for testing purposes.
UPDATE: I tried this and it didn't work that well. Compressed JavaScript completely confused it, i.e. the tokenizer is whitespace sensitive. Generally, just counting highlight hits does not seem very reliable. A stronger parser, or perhaps unmatched section counts, might work better.
First, I would try to find the specific keyworks of a language e.g.
"package, class, implements "=> JAVA
"<?php " => PHP
"include main fopen strcmp stdout "=>C
"cout"=> C++
etc...
It's very hard and sometimes impossible. Which language is this short snippet from?
int i = 5;
int k = 0;
for (int j = 100 ; j > i ; i++) {
j = j + 1000 / i;
k = k + i * j;
}
(Hint: It could be any one out of several.)
You can try to analyze various languages and try to decide using frequency analysis of keywords. If certain sets of keywords occur with certain frequencies in a text it's likely that the language is Java etc. But I don't think you will get anything that is completely fool proof, as you could name for example a variable in C the same name as a keyword in Java, and the frequency analysis will be fooled.
If you take it up a notch in complexity you could look for structures, if a certain keyword always comes after another one, that will get you more clues. But it will also be much harder to design and implement.
It would depend on what type of snippet you have, but I would run it through a series of tokenizers and see which language's BNF it came up as valid against.
I needed this so i created my own.
https://github.com/bertyhell/CodeClassifier
It's very easily extendable by adding a training file in the correct folder.
Written in c#. But i imagine the code is easily converted to any other language.
Best solution I have come across is using the linguist gem in a Ruby on Rails app. It's kind of a specific way to do it, but it works. This was mentioned above by #nisc but I will tell you my exact steps for using it. (Some of the following command line commands are specific to ubuntu but should be easily translated to other OS's)
If you have any rails app that you don't mind temporarily messing with, create a new file in it to insert your code snippet in question. (If you don't have rails installed there's a good guide here although for ubuntu I recommend this. Then run rails new <name-your-app-dir> and cd into that directory. Everything you need to run a rails app is already there).
After you have a rails app to use this with, add gem 'github-linguist' to your Gemfile (literally just called Gemfile in your app directory, no ext).
Then install ruby-dev (sudo apt-get install ruby-dev)
Then install cmake (sudo apt-get install cmake)
Now you can run gem install github-linguist (if you get an error that says icu required, do sudo apt-get install libicu-dev and try again)
(You may need to do a sudo apt-get update or sudo apt-get install make or sudo apt-get install build-essential if the above did not work)
Now everything is set up. You can now use this any time you want to check code snippets. In a text editor, open the file you've made to insert your code snippet (let's just say it's app/test.tpl but if know the extension of your snippet, use that instead of .tpl. If you don't know the extension, don't use one). Now paste your code snippet in this file. Go to command line and run bundle install (must be in your application's directory). Then run linguist app/test.tpl (more generally linguist <path-to-code-snippet-file>). It will tell you the type, mime type, and language. For multiple files (or for general use with a ruby/rails app) you can run bundle exec linguist --breakdown in your application's directory.
It seems like a lot of extra work, especially if you don't already have rails, but you don't actually need to know ANYTHING about rails if you follow these steps and I just really haven't found a better way to detect the language of a file/code snippet.
This site seems to be pretty good at identifying languages, if you want a quick way to paste a snippet into a web form, rather than doing it programmatically: http://dpaste.com/
Nice puzzle.
I think it is imposible to detect all languages. But you could trigger on key tokens. (certain reserved words and often used character combinations).
Ben there are a lot of languages with similar syntax. So it depends on the size of the snippet.
Prettify is a Javascript package that does an okay job of detecting programming languages:
http://code.google.com/p/google-code-prettify/
It is mainly a syntax highlighter, but there is probably a way to extract the detection part for the purposes of detecting the language from a snippet.
I wouldn't think there would be an easy way of accomplishing this. I would probably generate lists of symbols/common keywords unique to certain languages/classes of languages (e.g. curly brackets for C-style language, the Dim and Sub keywords for BASIC languages, the def keyword for Python, the let keyword for functional languages). You then might be able to use basic syntax features to narrow it down even further.
I think the biggest distinction between languages is its structure. So my idea would be to look at certain common elements across all languages and see how they differ. For example, you could use regexes to pick out things such as:
function definitions
variable declarations
class declarations
comments
for loops
while loops
print statements
And maybe a few other things that most languages should have. Then use a point system. Award at most 1 point for each element if the regex is found. Obviously, some languages will use the exact same syntax (for loops are often written like for(int i=0; i<x; ++i) so multiple languages could each score a point for the same thing, but at least you're reducing the likelihood of it being an entirely different language). Some of them might scores 0s across the board (the snippet doesnt contain a function at all, for example) but thats perfectly fine.
Combine this with Jules' solution, and it should work pretty well. Maybe also look for frequencies of keywords for an extra point.
Interesting. I have a similar task to recognize text in different formats. YAML, JSON, XML, or Java properties? Even with syntax errors, for example, I should tell apart JSON from XML with confidence.
I figure how we model the problem is critical. As Mark said, single-word tokenization is necessary but likely not enough. We will need bigrams, or even trigrams. But I think we can go further from there knowing that we are looking at programming languages. I notice that almost any programming language has two unique types of tokens -- symbols and keywords. Symbols are relatively easy (some symbols might be literals not part of the language) to recognize. Then bigrams or trigrams of symbols will pick up unique syntax structures around symbols. Keywords is another easy target if the training set is big and diverse enough. A useful feature could be bigrams around possible keywords. Another interesting type of token is whitespace. Actually if we tokenize in the usual way by white space, we will loose this information. I'd say, for analyzing programming languages, we keep the whitespace tokens as this may carry useful information about the syntax structure.
Finally if I choose a classifier like random forest, I will crawl github and gather all the public source code. Most of the source code file can be labeled by file suffix. For each file, I will randomly split it at empty lines into snippets of various sizes. I will then extract the features and train the classifier using the labeled snippets. After training is done, the classifier can be tested for precision and recall.
I believe that there is no single solution that could possibly identify what language a snippet is in, just based upon that single snippet. Take the keyword print. It could appear in any number of languages, each of which are for different purposes, and have different syntax.
I do have some advice. I'm currently writing a small piece of code for my website that can be used to identify programming languages. Like most of the other posts, there could be a huge range of programming languages that you simply haven't heard, you can't account for them all.
What I have done is that each language can be identified by a selection of keywords. For example, Python could be identified in a number of ways. It's probably easier if you pick 'traits' that are also certainly unique to the language. For Python, I choose the trait of using colons to start a set of statements, which I believe is a fairly unique trait (correct me if I'm wrong).
If, in my example, you can't find a colon to start a statement set, then move onto another possible trait, let's say using the def keyword to define a function. Now this can causes some problems, because Ruby also uses the keyword def to define a function. The key to telling the two (Python and Ruby) apart is to use various levels of filtering to get the best match. Ruby use the keyword end to finish a function, whereas Python doesn't have anything to finish a function, just a de-indent but you don't want to go there. But again, end could also be Lua, yet another programming language to add to the mix.
You can see that programming languages simply overlay too much. One keyword that could be a keyword in one language could happen to be a keyword in another language. Using a combination of keywords that often go together, like Java's public static void main(String[] args) helps to eliminate those problems.
Like I've already said, your best chance is looking for relatively unique keywords or sets of keywords to separate one from the other. And, if you get it wrong, at least you had a go.
Set up the random scrambler like
matrix S = matrix(GF(2),k,[random()<0.5for _ in range(k^2)]); while (rank(S) < k) : S[floor(k*random()),floor(k*random())] +=1;

Resources