Fixing English text with no spaces - text

I have a lot of lines of English text with mostly no spaces between the words. The text is normal English from 19th century historical records. I can look at the text and add spaces, but it is very time consuming, not to mention boring. Is there a "simple" script or program that could work out where to put the spaces? For some definition of "simple"? Clearly it would need a dictionary. I would prefer a script language I could adjust a bit and hopefully it would run on linux/BSD/MacOS.

This is (close to) impossible. You need something that can understand the meaning of the text.
Remember the old joke about expertsexchange.com, the domain name of Stack Overflow's "competitor", Experts Exchange … or is it Expert Sex Change? You cannot correctly place the spaces without knowing and understanding the context. Note that both have the same grammatical function, so even grammatical analysis cannot tell you which is correct.
There is nothing "simple" about this, you are in territory that could earn you a degree in Artificial Intelligence and Natural Language Processing if you get it right. At least it will be publishable in a reputable top-tier scientific journal.

Related

Detecting the start and end of dialog sections in prose

I have looked through a lot of the open source NLP tools (OpenNLP primarily) and I do not see anything which automates the task of detecting the start and end of dialog.
The sentence detection tools find the boundaries of the full sentence. The tokenizers accurately tokenize the punctuation, but still don't detect start and end. I've read many scholarly articles (such as) where dialog detection is assumed. But I don't see any tools which automate this as general purpose dialog detection.
For instance, text like this:
"I am happy," she said.
Should have "I am happy," defined as dialog. Text like this:
"This is a really long piece of dialog spoken by a character.
"That spans across multiple paragraphs."
Should have the whole thing identified as dialog (even though the end of the first paragraph is missing the closing quotation mark). Also there are weirder ways of specifying dialog. Such as with dashes:
They were walking when Joe spoke up.
--I really like walking.
Plus, often internal dialog will be denoted with italics, such as:
Joe walked down the street. *I really hope I don't get hit by a bus.*
Is there an NLP tool that can detect dialog sections like this? Or a way to do this with OpenNLP that I just missed?
I'm not aware of any tool that does this, out of the box, domain-independent. Probably for specific domains you could either train something. In a call transcript for example, it's quite likely that you have an A-B-A-B (etc.) structure, where two people take turns in talking. But when more people are involved in a dialog, things get a lot more complicated. Also, whether you can do this with orthographic features (like single/double quotes) or not, also depends on whether the people who constructed your text/corpus bothered to do this in a neat and consistent way or not.
I recently stumbled upon a tool that does discourse parsing: http://alt.qcri.org/tools/discourse-parser/
This provides you with something called a rhetorical structure tree, which is a representation of the input document that clarifies which sentence has which relation to another sentence. I haven't tried it for dialogs and have no idea about performance there. But it is/relies on a somewhat more semantically aware way of cutting up texts in pieces. Maybe that could help you. The tool is not as user-friendly as the corenlp/opennlp bunch though, and it requires (at least it did for me) quite some fiddling around to get up and running.
Anyway; probably (way) too much information, short answer; as far as I know, there is no easy to implement and use tool for this.
After some searching, it looks like the Stanford NLP tools have a "QuoteAnnotator" that is exactly what I am looking for.

How do I pad a Unicode string to a specific visible length?

I'd like to create a left pad function in a programming language. The function pads a string with leading characters to a specified total length. Strings are UTF-16 encoded in this language.
There are a few things in Unicode that make it complicated:
Surrogates: 2 surrogate characters = 1 unicode character
Combining characters: 1 non-combining character + any number of combining characters = 1 visible character
Invisible characters: 1 invisible character = 0 visible characters
What other factors have to be taken into consideration, and how would they be dealt with?
When you’re first starting out trying to understand something, it’s really frustrating. We’ve all been there. But while it’s very easy to call it stupid and everyone who made it stupid, you’re not going to get very far doing that. With an attitude like that, you’re implying that people who do understand it are also stupid for wasting their time on something so obviously stupid. After calling the people who do understand it stupid, it’s extremely unlikely that anyone who does understand it will take the time to explain it to you.
I understand the frustration. Unicode’s really complicated and it was a huge pain for me before I understood it and it’s still a pain for a lot of things I don’t have experience with. But the reason it’s so complicated isn’t because the people who made it were stupid and trying to ruin your life. It’s complicated because it attempts to provide a standard way of representing every human writing system ever used. Writing systems are insanely complicated, and throughout history developing a new and different writing system has been a fairly standard part of identifying yourself as a different culture from the people across the river or over the next mountain range. You yourself start off by identifying yourself as Hungarian based on the language you speak. Having once tried to pronounce a Hungarian professor’s name, I know that Hungarian is very complicated compared to English, just as English is very complicated compared to Hungarian. How would you feel if I was having trouble with Hungarian and asked you, “Boy, Hungarian sure is a stupid language! It must have been designed by idiots! By the way, how do I pronounce this word??”
There’s just no simple way to express something that’s inherently complicated in a very simple way. Human writing systems are inherently complicated and intentionally different from each other. As complicated as Unicode is, it’s better than what people had to do before, when instead of one single complicated standard there were multiple complicated standards in every country and you’d have to understand all of the different ‘standards.’
I’m not sure what your general life strategy is, but what I usually do when I don’t understand something is to pick up a few textbooks on the topic, read the textbooks through, and work out the examples. A good textbook will not only tell you how things are and what you need to do, but also how they go to be that way and why you need to do what you need to do.
I found Unicode Demysitifed to be an excellent book, and the newer book Unicode Explained has even higher ratings on amazon.

unicode characters

In my application I have unicode strings, I need to tell in which language the string is in,
I want to do it by narrowing list of possible languages by determining in which range the characters of string are.
Ranges I have from http://jrgraphix.net/research/unicode_blocks.php
And possible languages from http://unicode-table.com/en/
The problem is that algorithm has to detect all languages, does someone know more wide mapping of unicode ranges to languages ?
Thanks
Wojciech
This is not really possible, for a couple of reasons:
Many languages share the same writing system. Look at English and Dutch, for example. Both use the Basic Latin alphabet. By only looking at the range of code points, you simply cannot distinguish between them.
Some languages use more characters, but there is no guarantee that a
specific piece of text contains them. German, for example, uses the
Basic Latin alphabet plus "ä", "ö", "ü" and "ß". While these letters
are not particularly rare, you can easily create whole sentences
without them. So, a short text might not contain them. Thus, again,
looking at code points alone is not enough.
Text is not always "pure". An English text may contain French letters
because of a French loanword (e.g. "déjà vu"). Or it may contain
foreign words, because the text is talking about foreign things (e.g.
"Götterdämmerung is an opera by Richard Wagner...", or "The Great
Wall of China (万里长城) is..."). Looking at code points alone would be
misleading.
To sum up, no, you cannot reliably map code point ranges to languages.
What you could do: Count how often each character appears in the text and heuristically compare with statistics about known languages. Or analyse word structures, e.g. with Markov chains. Or search for the words in dictionaries (taking inflection, composition etc. into account). Or a combination of these.
But this is hard and a lot of work. You should rather use an existing solution, such as those recommended by deceze and Esailija.
I like the suggestion of using something like google translate -- as they will be doing all the work for you.
You might be able to build a rule-based system that gets you part of the way there. Build heuristic rules for languages and see if that is sufficient. Certain Tibetan characters do indicate Tibetan, and there are unique characters in many languages that will be a give away. But as the other answer pointed out, a limited sample of text may not be that accurate, as you may not have a clear indicator.
Languages will however differ in the frequencies that each character appears, so you could have a basic fingerprint of each language you need to classify and make guesses based on letter frequency. This probably goes a bit further than a rule-based system. Probably a good tool to build this would be a text classification algorithm, which will do all the analysis for you. You would train an algorithm on different languages, instead of having to articulate the actual rules yourself.
A much more sophisticated version of this is presumably what Google does.

For what reasons do some programmers vehemently hate languages where whitespace matters (e.g. Python)? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
C++ is my first language, and as such I'm used to whitespace being ignored. However, I've been toying around with Python, and I don't find it too hard to get used to the whitespace rules. It seems, however, that a lot of programmers on the Internet can't get past the whitespace rules. From what I've seen, peoples' C++ programs tend to be formatted very consistently with respect to whitespace (or else it's pretty hard to read), so why do some people have such a problem with whitespace-based languages like Python?
It violates the Principle of Least Astonishment, because we have it ingrained in ourselves (whether for good or bad) that whitespace Does Not Matter in a programming language. Whitespace is one of those issues that has been left up to personal style.
I still have bad memories back from being a student of learning the hard way that 8 spaces is not equivalent to a tab in a Makefile... Ah, the sleep I lost...
The only valid reason I have come across is that refactoring using cut-and-paste (not copy) without refactoring tools (or syntax-aware cut-andpaste), can end up changing semantics if an easy mistake is made.
There are several different types of whitespace (spaces, tabs, weird unicode characters, carriage returns, line breaks, etc.), they aren't necessarily visually distinct, and languages and editors may treat them capriciously. This isn't an argument against well-designed whitespace semantics, but many people are against all forms of it simply because of the possibility of poor design.
People hate it because it violates common sense. Not a single one of the replies I have read here decided that it was ok to simply forgo periods and other punctuations. In fact the grammar has been very good. If the nonsense about indentation actually carrying the meaning were true we would all just forget about using punctuations entirely.
No one learned that newlines terminate a sentence in a horizontal language like English, instead we learned to infer when a sentence ended regardless of whether or not the punctuation was present or not.
The same is true for programming languages, especially for those of us who started out with a programming language that did use explicit block termination. You learn to infer where a block starts and stops over time, it does not mean that the spacing did that for you, the semantics of the language itself did.
Most literate people would have no problem understanding posts without punctuations. Having to rely on what is a representation of the absence of a character is not a good idea. Do any of you count from zero when you make your to-do list?
Alright, this is a very narrow perspective, but I haven't seen it mentioned elsewhere: keeping track of white space is a hassle if you are trying to autogenerate a script.
When I first encountered Python, I don't remember the details, but I had developed a Windows tool with a GUI that allowed novice users to configure several settings, and then press OK. The output of the tool was a script, which the user could copy to a Unix machine, and then execute it there to do something or other that was too complicated or tedious for them to do manually. Since nobody maintained the generated scripts, there was no reason they needed to look nice. So, keeping track of indentation seemed like an unnecessary burden from that perspective.
For most purposes, though, I find that Python is much easier than any other language.
Perhaps your C++ background (and thus who your peers are) is clouding your perception of this (ie selective sampling) but in my experience the reaction to Python's "white space is intent" meme is anywhere from ambivalent to they absolutely love it. The reason a lot of people love it is that it forces people to format their code.
I can't say I've ever met anyone who "hates" it because hating it is much like hating the idea of well-formatted code.
Edit: let me put this in some perspective.
In the Java world there are two main methods of packaging and deploying Web apps: Ant and Maven.
Ant is basically an XML-based Make facility that has tasks for the common things you do. It's a blank slate, which is powerful, but it also means you have to write a lot of common things yourself and every installation is free to do things slightly differently. All of this is well-intentioned but can make it hard to figure out someone's Ant scripts.
Maven is far more fully features. It has archetypes, which are basically project types. Depending on which archetype(s) you use, you won't have to write any tasks to start, stop, clean, build, etc but you will have a mandated directory structure, which is quite deep.
The advantage of that is if you've seen one Maven Web app you've seen them all. You know the commands. You know the structure. That's extremely useful.
But you have people who absolutely hate Maven and I think it comes down to this: they don't like giving up control, even when it's ultimately in their interest to do so. Also, you'll find a certain brand of person who thinks that their use case is a justifiable exception. You see this personality trait a lot. For example, I think an old Joel post mentioned a story where someone wanted to use "enter" to go from the username to password form fields even though the convention was that enter executed the default action (usually "OK") so they had to write a custom dialog class for Windows for this.
Basically some people just don't like being told what to do and others are completely obstinate in their belief that they're right even when all evidence points to the contrary.
This probably explains why some supposedly hate Python's white space: they don't like being told how to format their code. They like the freedom of C/C++.
Because change is scary. And maybe, among certain developers, there are some faint memories of languages with capricious rules about whitespacing that were hard to remember and arbitrary, meant more for compiler convenience than expressiveness.
Most likely, not giving whitespace-significance a fair shake before dismissing it is the real reason. Ask someone to fix a bug in a reasonably complex but well-written Python program, then ask them to go fix a bug in a 20 year old system in C, VB or Cobol and ask them which they prefer.
As for me, I have as much trouble with whitespace in Python or Boo as I have with parentheses in Lisp. Which is to say, none.
They will have to get used to it. Initially I had a problem my self trying to read some examples but after using language for some time I started liking it.
I believe it is a habit that people has to overcome.
Some have developed habits (for example: deeply nested loops, unnecessarily large functions) that they perceive would be hard to support in a whitespace sensitive language.
Some have developed an aesthetic dislike for hanging indents.
Because they are used to languages like C and JavaScript where they can align items as they please.
When it comes to Python, you have to indent code based on its context:
def Print():
ManyArgumentFunction(LongParam1,LongParam2,LongParam3,LongParam4...
In C, you could do:
void Print()
{
ManyArgumentFunction(LongParam1,
LongParam2,
LongParam3,...
}
The only complaints I (also of C++ background) have heard about Python are from people who don't like using the "Replace Tabs with Space" option in their IDE.

What is the worst programming language you ever worked with? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 13 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
If you have an interesting story to
share, please post an answer, but
do not abuse this question for bashing
a language.
We are programmers, and our primary tool is the programming language we use.
While there is a lot of discussion about the best one, I'd like to hear your stories about
the worst programming languages you ever worked with and I'd like to know exactly what annoyed you.
I'd like to collect this stories partly to avoid common pitfalls while designing a language (especially a DSL) and partly to avoid quirky languages in the future in general.
This question is not subjective. If a language supports only single character identifiers (see my own answer) this is bad in a non-debatable way.
EDIT
Some people have raised concerns that this question attracts trolls.
Wading through all your answers made one thing clear.
The large majority of answers is appropriate, useful and well written.
UPDATE 2009-07-01 19:15 GMT
The language overview is now complete, covering 103 different languages from 102 answers.
I decided to be lax about what counts as a programming language and included
anything reasonable. Thank you David for your comments on this.
Here are all programming languages covered so far
(alphabetical order, linked with answer, new entries in bold):
ABAP,
all 20th century languages,
all drag and drop languages,
all proprietary languages,
APF,
APL
(1),
AS400,
Authorware,
Autohotkey,
BancaStar,
BASIC,
Bourne Shell,
Brainfuck,
C++,
Centura Team Developer,
Cobol
(1),
Cold Fusion,
Coldfusion,
CRM114,
Crystal Syntax,
CSS,
Dataflex 2.3,
DB/c DX,
dbase II,
DCL,
Delphi IDE,
Doors DXL,
DOS batch
(1),
Excel Macro language,
FileMaker,
FOCUS,
Forth,
FORTRAN,
FORTRAN 77,
HTML,
Illustra web blade,
Informix 4th Generation Language,
Informix Universal Server web blade,
INTERCAL,
Java,
JavaScript
(1),
JCL
(1),
karol,
LabTalk,
Labview,
Lingo,
LISP,
Logo,
LOLCODE,
LotusScript,
m4,
Magic II,
Makefiles,
MapBasic,
MaxScript,
Meditech Magic,
MEL,
mIRC Script,
MS Access,
MUMPS,
Oberon,
object extensions to C,
Objective-C,
OPS5,
Oz,
Perl
(1),
PHP,
PL/SQL,
PowerDynamo,
PROGRESS 4GL,
prova,
PS-FOCUS,
Python,
Regular Expressions,
RPG,
RPG II,
Scheme,
ScriptMaker,
sendmail.conf,
Smalltalk,
Smalltalk ,
SNOBOL,
SpeedScript,
Sybase PowerBuilder,
Symbian C++,
System RPL,
TCL,
TECO,
The Visual Software Environment,
Tiny praat,
TransCAD,
troff,
uBasic,
VB6
(1),
VBScript
(1),
VDF4,
Vimscript,
Visual Basic
(1),
Visual C++,
Visual Foxpro,
VSE,
Webspeed,
XSLT
The answers covering 80386 assembler, VB6 and VBScript have been removed.
PHP (In no particular order)
Inconsistent function names and argument orders
Because there are a zillion functions, each one of which seems to use a different naming convention and argument order. "Lets see... is it foo_bar or foobar or fooBar... and is it needle, haystack or haystack, needle?" The PHP string functions are a perfect example of this. Half of them use str_foo and the other half use strfoo.
Non-standard date format characters
Take j for example
In UNIX (which, by the way, is what everyone else uses as a guide for date string formats) %j returns the day of the year with leading zeros.
In PHP's date function j returns the day of the month without leading zeros.
Still No Support for Apache 2.0 MPM
It's not recommended.
Why isn't this supported? "When you make the underlying framework more complex by not having completely separate execution threads, completely separate memory segments and a strong sandbox for each request to play in, feet of clay are introduced into PHP's system." Link So... it's not supported 'cause it makes things harder? 'Cause only the things that are easy are worth doing right? (To be fair, as Emil H pointed out, this is generally attributed to bad 3rd-party libs not being thread-safe, whereas the core of PHP is.)
No native Unicode support
Native Unicode support is slated for PHP6
I'm sure glad that we haven't lived in a global environment where we might have need to speak to people in other languages for the past, oh 18 years. Oh wait. (To be fair, the fact that everything doesn't use Unicode in this day and age really annoys me. My point is I shouldn't have to do any extra work to make Unicode happen. This isn't only a PHP problem.)
I have other beefs with the language. These are just some.
Jeff Atwood has an old post about why PHP sucks. He also says it doesn't matter. I don't agree but there we are.
XSLT.
XSLT is baffling, to begin with. The metaphor is completely different from anything else I know.
The thing was designed by a committee so deep in angle brackets that it comes off as a bizarre frankenstein.
The weird incantations required to specify the output format.
The built-in, invisible rules.
The odd bolt-on stuff, like scripts.
The dependency on XPath.
The tools support has been pretty slim, until lately. Debugging XSLT in the early days was an exercise in navigating in complete darkness. The tools change that but, still XSLT tops my list.
XSLT is weird enough that most people just ignore it. If you must use it, you need an XSLT Shaman to give you the magic incantations to make things go.
DOS Batch files. Not sure if this qualifies as programming language at all.
It's not that you can't solve your problems, but if you are used to bash...
Just my two cents.
Not sure if its a true language, but I hate Makefiles.
Makefiles have meaningful differences between space and TAB, so even if two lines appear identical, they do not run the same.
Make also relies on a complex set of implicit rules for many languages, which are difficult to learn, but then are frequently overridden by the make file.
A Makefile system is typically spread over many, many files, across many directories.
With virtually no scoping or abstraction, a change to a make file several directories away can prevent my source from building. Yet the error message is invariably a compliation error, not a meaningful error about make, or the makefiles.
Any environment I've worked in that uses makefiles successfully has a full-time Make expert. And all this to shave a few minutes off compilation??
The worse language I've ever seen come from the tool praat, which is a good audio analysis tool. It does a pretty good job until you use the script language. sigh bad memories.
Tiny praat script tutorial for beginners
Function call
We've listed at least 3 different function calling syntax :
The regular one
string = selected("Strings")
Nothing special here, you assign to the variable string the result of the selected function. Not really scary... yet.
The "I'm invoking some GUI command with parameters"
Create Strings as file list... liste 'path$'/'type$'
As you can see, the function name start at "Create" and finish with the "...". The command "Create Strings as file list" is the text displayed on a button or a menu (I'm to scared to check) on praat. This command take 2 parameters liste and an expression. I'm going to look deeper in the expression 'path$'/'type$'
Hmm. Yep. No spaces. If spaces were introduced, it would be separate arguments. As you can imagine, parenthesis don't work. At this point of the description I would like to point out the suffix of the variable names. I won't develop it in this paragraph, I'm just teasing.
The "Oh, but I want to get the result of the GUI command in my variable"
noliftt = Get number of strings
Yes we can see a pattern here, long and weird function name, it must be a GUI calling. But there's no '...' so no parameters. I don't want to see what the parser looks like.
The incredible type system (AKA Haskell and OCaml, praat is coming to you)
Simple natives types
windowname$ = left$(line$,length(line$)-4)
So, what's going on there?
It's now time to look at the convention and types of expression, so here we got :
left$ :: (String, Int) -> String
lenght :: (String) -> Int
windowname$ :: String
line$ :: String
As you can see, variable name and function names are suffixed with their type or return type. If their suffix is a '$', then it return a string or is a string. If there is nothing it's a number. I can see the point of prefixing the type to a variable to ease implementation, but to suffix, no sorry, I can't
Array type
To show the array type, let me introduce a 'tiny' loop :
for i from 1 to 4
Select... time time
bandwidth'i'$ = Get bandwidth... i
forhertz'i'$ = Get formant... i
endfor
We got i which is a number and... (no it's not a function)
bandwidth'i'$
What it does is create string variables : bandwidth1$, bandwidth2$, bandwidth3$, bandwidth4$ and give them values. As you can expect, you can't create two dimensional array this way, you must do something like that :
band2D__'i'__'j'$
The special string invocation
outline$ = "'time'#F'i':'forhertznum'Hz,'bandnum'Hz, 'spec''newline$'"
outline$ >> 'outfile$'
Strings are weirdly (at least) handled in the language. the '' is used to call the value of a variable inside the global "" string. This is _weird_. It goes against all the convention built into many languages from bash to PHP passing by the powershell. And look, it even got redirection. Don't be fooled, it doesn't work like in your beloved shell. No you have to get the variable value with the ''
Da Wonderderderfulful execution model
I'm going to put the final touch to this wonderderderfulful presentation by talking to you about the execution model. So as in every procedural languages you got instruction executed from top to bottom, there is the variables and the praat GUI. That is you code everything on the praat gui, you invoke commands written on menu/buttons.
The main window of praat contain a list of items which can be :
files
list of files (created by a function with a wonderderfulful long long name)
Spectrogramm
Strings (don't ask)
So if you want to perform operation on a given file, you must select the file in the list programmatically and then push the different buttons to take some actions. If you wanted to pass parameters to a GUI action, you have to follow the GUI layout of the form for your arguments, for example "To Spectrogram... 0.005 5000 0.002 20 Gaussian
" is like that because it follows this layout:
Needless to say, my nightmares are filled with praat scripts dancing around me and shouting "DEBUG MEEEE!!".
More information at the praat site, under the well-named section "easy programmable scripting language"
Well since this question refuses to die and since the OP did prod me into answering...
I humbly proffer for your consideration Authorware (AW) as the worst language it is possible to create. (n.b. I'm going off recollection here, it's been ~6 years since I used AW, which of course means there's a number of awful things I can't even remember)
the horror, the horror http://img.brothersoft.com/screenshots/softimage/a/adobe_authorware-67096-1.jpeg
Let's start with the fact that it's a Macromedia product (-10 points), a proprietary language (-50 more) primarily intended for creating e-learning software and moreover software that could be created by non-programmers and programmers alike implemented as an iconic language AND a text language (-100).
Now if that last statement didn't scare you then you haven't had to fix WYSIWYG generated code before (hello Dreamweaver and Frontpage devs!), but the salient point is that AW had a library of about 12 or so elements which could be dragged into a flow. Like "Page" elements, Animations, IFELSE, and GOTO (-100). Of course removing objects from the flow created any number of broken connections and artifacts which the IDE had variable levels of success coping with. Naturally the built in wizards (-10) were a major source of these.
Fortunately you could always step into a code view, and eventually you'd have to because with a limited set of iconic elements some things just weren't possible otherwise. The language itself was based on TUTOR (-50) - a candidate for worst language itself if only it had the ambition and scope to reach the depths AW would strive for - about which wikipedia says:
...the TUTOR language was not easy to
learn. In fact, it was even suggested
that several years of experience with
the language would be required before
programmers could build programs worth
keeping.
An excellent foundation then, which was built upon in the years before the rise of the internet with exactly nothing. Absolutely no form of data structure beyond an array (-100), certainly no sugar (real men don't use switch statements?) (-10), and a large splash of syntactic vinegar ("--" was the comment indicator so no decrement operator for you!) (-10). Language reference documentation was provided in paper or zip file formats (-100), but at least you had the support of the developer run usegroup and could quickly establish the solution to your problem was to use the DLL or SWF importing features of AW to enable you to do the actual coding in a real language.
AW was driven by a flow (with necessary PAUSE commands) and therefore has all the attendant problems of a linear rather than event based system (-50), and despite the outright marketing lies of the documentation it was not object oriented (-50) either. All code reuse was achieved through GOTO. No scope, lots of globals (-50).
It's not the language's fault directly, but obviously no source control integration was possible, and certainly no TDD, documentation generation or any other add-on tool you might like.
Of course Macromedia met the challenge of the internet head on with a stubborn refusal to engage for years, eventually producing the buggy, hard to use, security nightmare which is Shockwave (-100) to essentially serialise desktop versions of the software through a required plugin (-10). AS HTML rose so did AW stagnate, still persisting with it's shockwave delivery even in the face of IEEE SCORM javascript standards.
Ultimately after years of begging and promises Macromedia announced a radical new version of AW in development to address these issues, and a few years later offshored the development and then cancelled the project. Although of course Macromedia are still selling it (EVIL BONUS -500).
If anything else needs to be said, this is the language which allows spaces in variable names (-10000).
If you ever want to experience true pain, try reading somebody else's uncommented hungarian notation in a language which isn't case sensitive and allows variable name spaces.
Total Annakata Arbitrary Score (AAS): -11300
Adjusted for personal experience: OutOfRangeException
(apologies for length, but it was cathartic)
Seriously: Perl.
It's just a pain in the ass to code with for beginners and even for semi-professionals which work with perl on a daily basis. I can constantly see my colleagues struggle with the language, building the worst scripts, like 2000 lines with no regard of any well accepted coding standard. It's the worst mess i've ever seen in programming.
Now, you can always say, that those people are bad in coding (despite the fact that some of them have used perl for a lot of years, now), but the language just encourages all that freaking shit that makes me scream when i have to read a script by some other guy.
MS Access Visual Basic for Applications (VBA) was also pretty bad. Access was bad altogether in that it forced you down a weak paradigm and was deceptively simple to get started, but a nightmare to finish.
No answer about Cobol yet? :O
Old-skool BASICs with line numbers would be my choice. When you had no space between line numbers to add new lines, you had to run a renumber utility, which caused you to lose any mental anchors you had to what was where.
As a result, you ended up squeezing in too many statements on a single line (separated by colons), or you did a goto or gosub somewhere else to do the work you couldn't cram in.
MUMPS
I worked in it for a couple years, but have done a complete brain dump since then. All I can really remember was no documentation (at my location) and cryptic commands.
It was horrible. Horrible! HORRIBLE!!!
There are just two kinds of languages: the ones everybody complains about and the ones nobody uses.
Bjarne Stroustrup
I haven't yet worked with many languages and deal mostly with scripting languages; out of these VBScript is the one I like least. Although it has some handy features, some things really piss me off:
Object assignments are made using the Set keyword:
Set foo = Nothing
Omitting Set is one of the most common causes of run-time errors.
No such thing as structured exception handling. Error checking is like this:
On Error Resume Next
' Do something
If Err.Number <> 0
' Handle error
Err.Clear
End If
' And so on
Enclosing the procedure call parameters in parentheses requires using the Call keyword:
Call Foo (a, b)
Its English-like syntax is way too verbose. (I'm a fan of curly braces.)
Logical operators are long-circuit. If you need to test a compound condition where the subsequent condition relies on the success of the previous one, you need to put conditions into separate If statements.
Lack of parameterized class constructors.
To wrap a statement into several lines, you have to use an underscore:
str = "Hello, " & _
"world!"
Lack of multiline comments.
Edit: found this article: The Flangy Guide to Hating VBScript. The author sums up his complaints as "VBS isn't Python" :)
Objective-C.
The annotations are confusing, using brackets to call methods still does not compute in my brain, and what is worse is that all of the library functions from C are called using the standard operators in C, -> and ., and it seems like the only company that is driving this language is Apple.
I admit I have only used the language when programming for the iPhone (and looking into programming for OS X), but it feels as if C++ were merely forked, adding in annotations and forcing the implementation and the header files to be separate would make much more sense.
PROGRESS 4GL (apparently now known as "OpenEdge Advanced Business Language").
PROGRESS is both a language and a database system. The whole language is designed to make it easy to write crappy green-screen data-entry screens. (So start by imagining how well this translates to Windows.) Anything fancier than that, whether pretty screens, program logic, or batch processing... not so much.
I last used version 7, back in the late '90s, so it's vaguely possible that some of this is out-of-date, but I wouldn't bet on it.
It was originally designed for text-mode data-entry screens, so on Windows, all screen coordinates are in "character" units, which are some weird number of pixels wide and a different number of pixels high. But of course they default to a proportional font, so the number of "character units" doesn't correspond to the actual number of characters that will fit in a given space.
No classes or objects.
No language support for arrays or dynamic memory allocation. If you want something resembling an array, you create a temporary in-memory database table, define its schema, and then get a cursor on it. (I saw a bit of code from a later version, where they actually built and shipped a primitive object-oriented system on top of these in-memory tables. Scary.)
ISAM database access is built in. (But not SQL. Who needs it?) If you want to increment the Counter field in the current record in the State table, you just say State.Counter = State.Counter + 1. Which isn't so bad, except...
When you use a table directly in code, then behind the scenes, they create something resembling an invisible, magic local variable to hold the current cursor position in that table. They guess at which containing block this cursor will be scoped to. If you're not careful, your cursor will vanish when you exit a block, and reset itself later, with no warning. Or you'll start working with a table and find that you're not starting at the first record, because you're reusing the cursor from some other block (or even your own, because your scope was expanded when you didn't expect it).
Transactions operate on these wild-guess scopes. Are we having fun yet?
Everything can be abbreviated. For some of the offensively long keywords, this might not seem so bad at first. But if you have a variable named Index, you can refer to it as Index or as Ind or even as I. (Typos can have very interesting results.) And if you want to access a database field, not only can you abbreviate the field name, but you don't even have to qualify it with the table name; they'll guess the table too. For truly frightening results, combine this with:
Unless otherwise specified, they assume everything is a database access. If you access a variable you haven't declared yet (or, more likely, if you mistype the variable name), there's no compiler error: instead, it goes looking for a database field with that name... or a field that abbreviates to that name.
The guessing is the worst. Between the abbreviations and the field-by-default, you could get some nasty stuff if you weren't careful. (Forgot to declare I as a local variable before using it as a loop variable? No problem, we'll just randomly pick a table, grab its current record, and completely trash an arbitrarily-chosen field whose name starts with I!)
Then add in the fact that an accidental field-by-default access could change the scope it guessed for your tables, thus breaking some completely unrelated piece of code. Fun, yes?
They also have a reporting system built into the language, but I have apparently repressed all memories of it.
When I got another job working with Netscape LiveWire (an ill-fated attempt at server-side JavaScript) and classic ASP (VBScript), I was in heaven.
The worst language? BancStar, hands down.
3,000 predefined variables, all numbered, all global. No variable declaration, no initialization. Half of them, scattered over the range, reserved for system use, but you can use them at your peril. A hundred or so are automatically filled in as a result of various operations, and no list of which ones those are. They all fit in 38k bytes, and there is no protection whatsoever for buffer overflow. The system will cheerfully let users put 20 bytes in a ten byte field if you declared the length of an input field incorrectly. The effects are unpredictable, to say the least.
This is a language that will let you declare a calculated gosub or goto; due to its limitations, this is frequently necessary. Conditionals can be declared forward or reverse. Picture an "If" statement that terminates 20 lines before it begins.
The return stack is very shallow, (20 Gosubs or so) and since a user's press of any function key kicks off a different subroutine, you can overrun the stack easily. The designers thoughtfully included a "Clear Gosubs" command to nuke the stack completely in order to fix that problem and to make sure you would never know exactly what the program would do next.
There is much more. Tens of thousands of lines of this Lovecraftian horror.
The .bat files scripting language on DOS/Windows. God only knows how un-powerful is this one, specially if you compare it to the Unix shell languages (that aren't so powerful either, but way better nonetheless).
Just try to concatenate two strings or make a for loop. Nah.
VSE, The Visual Software Environment.
This is a language that a prof of mine (Dr. Henry Ledgard) tried to sell us on back in undergrad/grad school. (I don't feel bad about giving his name because, as far as I can tell, he's still a big proponent and would welcome the chance to convince some folks it's the best thing since sliced bread). When describing it to people, my best analogy is that it's sort of a bastard child of FORTRAN and COBOL, with some extra bad thrown in. From the only really accessible folder I've found with this material (there's lots more in there that I'm not going to link specifically here):
VSE Overview (pdf)
Chapter 3: The VSE Language (pdf) (Not really an overview of the language at all)
Appendix: On Strings and Characters (pdf)
The Software Survivors (pdf) (Fevered ramblings attempting to justify this turd)
VSE is built around what they call "The Separation Principle". The idea is that Data and Behavior must be completely segregated. Imagine C's requirement that all variables/data must be declared at the beginning of the function, except now move that declaration into a separate file that other functions can use as well. When other functions use it, they're using the same data, not a local copy of data with the same layout.
Why do things this way? We learn that from The Software Survivors that Variable Scope Rules Are Hard. I'd include a quote but, like most fools, it takes these guys forever to say anything. Search that PDF for "Quagmire Of Scope" and you'll discover some true enlightenment.
They go on to claim that this somehow makes it more suitable for multi-proc environments because it more closely models the underlying hardware implementation. Riiiight.
Another choice theme that comes up frequently:
INCREMENT DAY COUNT BY 7 (or DAY COUNT = DAY COUNT + 7)
DECREMENT TOTAL LOSS BY GROUND_LOSS
ADD 100.3 TO TOTAL LOSS(LINK_POINTER)
SET AIRCRAFT STATE TO ON_THE_GROUND
PERCENT BUSY = (TOTAL BUSY CALLS * 100)/TOTAL CALLS
Although not earthshaking, the style
of arithmetic reflects ordinary usage,
i.e., anyone can read and understand
it - without knowing a programming
language. In fact, VisiSoft arithmetic
is virtually identical to FORTRAN,
including embedded complex arithmetic.
This puts programmers concerned with
their professional status and
corresponding job security ill at
ease.
Ummm, not that concerned at all, really. One of the key selling points that Bill Cave uses to try to sell VSE is the democratization of programming so that business people don't need to indenture themselves to programmers who use crazy, arcane tools for the sole purpose of job security. He leverages this irrational fear to sell his tool. (And it works-- the federal gov't is his biggest customer). I counted 17 uses of the phrase "job security" in the document. Examples:
... and fit only for those desiring artificial job security.
More false job security?
Is job security dependent upon ensuring the other guy can't figure out what was done?
Is job security dependent upon complex code...?
One of the strongest forces affecting the acceptance of new technology is the perception of one's job security.
He uses this paranoia to drive wedge between the managers holding the purse strings and the technical people who have the knowledge to recognize VSE for the turd that it is. This is how he squeezes it into companies-- "Your technical people are only saying it sucks because they're afraid it will make them obsolete!"
A few additional choice quotes from the overview documentation:
Another consequence of this approach
is that data is mapped into memory
on a "What You See Is What You Get"
basis, and maintained throughout.
This allows users to move a complete
structure as a string of characters
into a template that descrives each
individual field. Multiple templates
can be redefined for a given storage
area. Unlike C and other languages,
substructures can be moved without the problems of misalignment due to
word boundary alignment standards.
Now, I don't know about you, but I know that a WYSIWYG approach to memory layout is at the top of my priority list when it comes to language choice! Basically, they ignore alignment issues because only old languages that were designed in the '60's and '70's care about word alignment. Or something like that. The reasoning is bogus. It made so little sense to me that I proceeded to forget it almost immediately.
There are no user-defined types in VSE. This is a far-reaching
decision that greatly simplifies the
language. The gain from a practical
point of view is also great. VSE
allows the designer and programmer to
organize a program along the same
lines as a physical system being
modeled. VSE allows structures to be
built in an easy-to-read, logical
attribute hierarchy.
Awesome! User-defined types are lame. Why would I want something like an InputMessage object when I can have:
LINKS_IN_USE INTEGER
INPUT_MESSAGE
1 ORIGIN INTEGER
1 DESTINATION INTEGER
1 MESSAGE
2 MESSAGE_HEADER CHAR 10
2 MESSAGE_BODY CHAR 24
2 MESSAGE_TRAILER CHAR 10
1 ARRIVAL_TIME INTEGER
1 DURATION INTEGER
1 TYPE CHAR 5
OUTPUT_MESSAGE CHARACTER 50
You might look at that and think, "Oh, that's pretty nicely formatted, if a bit old-school." Old-school is right. Whitespace is significant-- very significant. And redundant! The 1's must be in column 3. The 1 indicates that it's at the first level of the hierarchy. The Symbol name must be in column 5. You hierarchies are limited to a depth of 9.
Well, ok, but is that so awful? Just wait:
It is well known that for reading
text, use of conventional upper/lower
case is more readable. VSE uses all
upper case (except for comments). Why?
The literature in psychology is based
on prose. Programs, simply, are not
prose. Programs are more like math,
accounting, tables. Program fonts
(usually Courier) are almost
universally fixed-pitch, and for good
reason – vertical alignment among
related lines of code. Programs in
upper case are nicely readable, and,
after a time, much better in our
opinion
Nothing like enforcing your opinion at the language level! That's right, you cannot use any lower case in VSE unless it's in a comment. Just keep your CAPSLOCK on, it's gonna be stuck there for a while.
VSE subprocedures are called processes. This code sample contains three processes:
PROCESS_MUSIC
EXECUTE INITIALIZE_THE_SCENE
EXECUTE PROCESS_PANEL_WIDGET
INITIALIZE_THE_SCENE
SET TEST_BUTTON PANEL_BUTTON_STATUS TO ON
MOVE ' ' TO TEST_INPUT PANEL_INPUT_TEXT
DISPLAY PANEL PANEL_MUSIC
PROCESS_PANEL_WIDGET
ACCEPT PANEL PANEL_MUSIC
*** CHECK FOR BUTTON CLICK
IF RTG_PANEL_WIDGET_NAME IS EQUAL TO 'TEST_BUTTON'
MOVE 'I LIKE THE BEATLES!' TO TEST_INPUT PANEL_INPUT_TEXT.
DISPLAY PANEL PANEL_MUSIC
All caps as expected. After all, that's easier to read. Note the whitespace. It's significant again. All process names must start in column 0. The initial level of instructions must start on column 4. Deeper levels must be indented exactly 3 spaces. This isn't a big deal, though, because you aren't allowed to do things like nest conditionals. You want a nested conditional? Well just make another process and call it. And note the delicious COBOL-esque syntax!
You want loops? Easy:
EXECUTE NEXT_CALL
EXECUTE NEXT_CALL 5 TIMES
EXECUTE NEXT_CALL TOTAL CALL TIMES
EXECUTE NEXT_CALL UNTIL NO LINES ARE AVAILABLE
EXECUTE NEXT_CALL UNTIL CALLS_ANSWERED ARE EQUAL TO CALLS_WAITING
EXECUTE READ_MESSAGE UNTIL LEAD_CHARACTER IS A DELIMITER
Ugh.
Here is the contribution to my own question:
Origin LabTalk
My all-time favourite in this regard is Origin LabTalk.
In LabTalk the maximum length of a string variable identifier is one character.
That is, there are only 26 string variables at all. Even worse, some of them are used by Origin itself, and it is not clear which ones.
From the manual:
LabTalk uses the % notation to define
a string variable. A legal string
variable name must be a % character
followed by a single alphabetic
character (a letter from A to Z).
String variable names are
caseinsensitive. Of all the 26 string
variables that exist, Origin itself
uses 14.
Doors DXL
For me the second worst in my opinion is Doors DXL.
Programming languages can be divided into two groups:
Those with manual memory management (e.g. delete, free) and those with a garbage collector.
Some languages offer both, but DXL is probably the only language in the world that
supports neither. OK, to be honest this is only true for strings, but hey, strings aren't exactly
the most rarely used data type in requirements engineering software.
The consequence is that memory used by a string can never be reclaimed and
DOORS DXL leaks like sieve.
There are countless other quirks in DXL, just to name a few:
DXL function syntax
DXL arrays
Cold Fusion
I guess it's good for designers but as a programmer I always felt like one hand was tied behind my back.
The worst two languages I've worked with were APL, which is relatively well known for languages of its age, and TECO, the language in which the original Emacs was written. Both are notable for their terse, inscrutable syntax.
APL is an array processing language; it's extremely powerful, but nearly impossible to read, since every character is an operator, and many don't appear on standard keyboards.
TECO had a similar look, and for a similar reason. Most characters are operators, and this special purpose language was devoted to editing text files. It was a little better, since it used the standard character set. And it did have the ability to define functions, which was what gave life to emacs--people wrote macros, and only invoked those after a while. But figuring out what a program did or writing a new one was quite a challenge.
LOLCODE:
HAI
CAN HAS STDIO?
VISIBLE "HAI WORLD!"
KTHXBYE
Seriously, the worst programming language ever is that of Makefiles. Totally incomprehensible, tabs have a syntactic meaning and not even a debugger to find out what's going on.
I'm not sure if you meant to include scripting languages, but I've seen TCL (which is also annoying), but... the mIRC scripting language annoys me to no end.
Because of some oversight in the parsing, it's whitespace significant when it's not supposed to be. Conditional statements will sometimes be executed when they're supposed to be skipped because of this. Opening a block statement cannot be done on a separate line, etc.
Other than that it's just full of messy, inconsistent syntax that was probably designed that way to make very basic stuff easy, but at the same time makes anything a little more complex barely readable.
I lost most of my mIRC scripts, or I could have probably found some good examples of what a horrible mess it forces you to create :(
Regular expressions
It's a write only language, and it's hard to verify if it works correctly for the right inputs.
Visual Foxpro
I can't belive nobody has said this one:
LotusScript
I thinks is far worst than php at least.
Is not about the language itself which follows a syntax similar to Visual Basic, is the fact that it seem to have a lot of functions for extremely unuseful things that you will never (or one in a million times) use, but lack support for things you will use everyday.
I don't remember any concrete example but they were like:
"Ok, I have an event to check whether the mouse pointer is in the upper corner of the form and I don't have an double click event for the Form!!?? WTF??"
Twice I've had to work in 'languages' where you drag-n-dropped modules onto the page and linked them together with lines to show data flow. (One claimed to be a RDBMs, and the other a general purpose data acquisition and number crunching language.)
Just thinking of it makes me what to throttle someone. Or puke. Or both.
Worse, neither exposed a text language that you could hack directly.
I find myself avoid having to use VBScript/Visual Basic 6 the most.
I use primarily C++, javascript, Java for most tasks and dabble in ruby, scala, erlang, python, assembler, perl when the need arises.
I, like most other reasonably minded polyglots/programmers, strongly feel that you have to use the right tool for the job - this requires you to understand your domain and to understand your tools.
My issue with VBscript and VB6 is when I use them to script windows or office applications (the right domain for them) - i find myself struggling with the language (they fall short of being the right tool).
VBScript's lack of easy to use native data structures (such as associative containers/maps) and other quirks (such as set for assignment to objects) is a needless and frustrating annoyance, especially for a scripting language. Contrast it with Javascript (which i now use to program wscript/cscript windows and do activex automation scripts) which is far more expressive. While there are certain things that work better with vbscript (such as passing arrays back and forth from COM objects is slightly easier, although it is easier to pass event handlers into COM components with jscript), I am still surprised by the amount of coders that still use vbscript to script windows - I bet if they wrote the same program in both languages they would find that jscript works with you much more than vbscript, because of jscript's native hash data types and closures.
Vb6/VBA, though a little better than vbscript in general, still has many similar issues where (for their domain) they require much more boiler plate to do simple tasks than what I would like and have seen in other scripting languages.
In 25+ years of computer programming, by far the worst thing I've ever experienced was a derivative of MUMPS called Meditech Magic. It's much more evil than PHP could ever hope to be.
It doesn't even use '=' for assignment! 100^b assigns a value of 100 to b and is read as "100 goes to b". Basically, this language invented its own syntax from top to bottom. So no matter how many programming languages you know, Magic will be a complete mystery to you.
Here is 100 bottles of beer on the wall written in this abomination of a language:
BEERv1.1,
100^b,T("")^#,DO{b'<1 NN(b,"bottle"_IF{b=1 " ";"s "}_"of beer on the wall")^#,
N(b,"bottle"_IF{b=1 " ";"s "}_"of beer!")^#,
N("You take one down, pass it around,")^#,b-1^b,
N(b,"bottle"_IF{b=1 " ";"s "}_"of beer on the wall!")^#},
END;
TCL. It only compiles code right before it executes, so it's possible that if your code never went down branch A while testing, and one day, in the field it goes down branch A, it could have a SYNTAX ERROR!

Resources