What's the real difference between 3GLs and 4GLs? - programming-languages

I can't find anything that is useful to determine if a language is third generation or fourth. All I find is open statements like "higher level" and "closer to English" and some sources say that they are domain specific languages like SQL and others say that they can be general purpose. I'm really confused.
If 2GLs are the Assembly languages and 5GLs are the inference languages like Prolog, how do you determine if a programming language is a 3GL or a 4GL?

Most use of the terms was pure marketing -- "Oh, you're still using a third generation language? That's so last week!"
Behind that, there was a tiny bit of technical meaning though (at least for a little while, though many "4GLs" ignored it). The basic difference was (supposed to be that) third generation languages allowed you to manipulate only individual data items, where fourth generation languages allows you to manipulate groups of items as a group rather than individually.
Two obvious examples of this are SQL and APL. In SQL, you mostly work with sets. The result of a query is a set (not exactly a mathematical set, but at least somewhat similar). You can use and manipulate that set as a whole, merge it with other sets, etc. Until or unless you're exposing it to the outside world (e.g., with a cursor) you don't have to deal with the individual records/rows/tuples that make up that set.
In APL you get somewhat the same idea, except you're working with arrays instead of sets. To get an idea of what this means, let's assume you wanted to "rotate" an array so the currently-first element was moved to the end, and each other element was shifted ahead a spot. To do that in a typical 3GL (Fortran, Pascal, C, etc.) you'd write a loop that worked with the individual elements in the array. In APL, however, you have a single operator that will do that to the array as a whole, all in one operation. Even operations that work with individual items are generally trivial to apply to an entire array at once with the / operator, so (for example) the sum of all the elements in an array named a could be computed with +/a (or maybe that was /+a -- it's been a long time since I wrote any APL).
There are some pretty serious problems with the general idea of the distinction involved there though. One is that it placed a heavy emphasis on syntax -- obviously the actions involved required things like loops internally, so the distinction amounted to a syntax for an implicit loop. Another problem was that in quite a few cases you ended up with something of a blend of the two generations -- e.g., most BASICs being able to treat a string as a single thing, but requiring loops for similar operations on arrays/matrices. Finally, there was a little bit of a problem with relevance: although in a few special cases (like SQL) being able to work with a group/set/array/whatever of data as a whole really made a big difference -- but for the most part it did little to let people think and work with higher level abstractions (as was at least apparently the original intent).
That combined with a move toward languages that blurred the distinction between what was built in, and what was part of a library (or whatever). In C++, most ML-family languages, etc., it's trivial to write a function with arbitrary actions (including but not limited to loops) and attach it to an operator that's essentially indistinguishable from one that's built into the language.
It was a catchy phrase with a meaning most couldn't explain and even fewer cared about -- a prime candidate for being turned into pure marketspeak, usually translated roughly as: "you should pay me a lot for my slow, ugly, buggy CRUD application generator."

"Language generations" were a hot buzzword in the 1980s and early 1990s. They were always ill-defined, and little used in actual academic discourse.
The terms had little meaning at the time, and none now.

Related

Failing to see the difference between these two lines of thought in dynamic programming

Always there are multiple ways people describe differences in tabulation and memoization in dynamic programming, but I will summarize to what is normally said.
memoization is a where we add caching to a function, to make recursive calls take less computations. typically used on recursive functions for a top down solution that starts with the initial problem and then recursively calls itself to smaller problems
tabulation uses a table to keep track of subproblem results and works up in bottom up manner, solving smallest sub problems before larger ones in a iterative manner.
Well my question is whats the difference? Sometimes I look at different situations and the line is super blurred. Also, with memoization working in a "top down" fashion, its really just referring to the stack nature of it, and in that sense its still going to the base case, aka bottom and then using those results to build up to the final result, so how is that really different from a tabulation going from bottom up until its done? Or is it a situational case where tabulation aproaches don't involve recursion, the fact that a dynmaic programming problem uses it IS what differentiates the two different methods? If someone knowledgable could offer there thoughts it would be much appreciated
You're right that they're just two implementation methods for the same computation. A recursive formulation with memoization will fill in the memo cache with the same entries that an explicit tabular formulation will put in its table.
Explicit tabular formulations are strictly less useful, however. This is because they need more information about the problem in advance. They start by enumerating all possibly useful base cases and putting those in the table. (So what's "possibly useful?" That's the rub!) Then they enumerate the new "layer" of all possible problem versions that can be solved with the base cases. Then a layer of others that can be solved with those, etc. etc. This continues until it the "top level" problem turns up in a layer.
For the kinds of problems typically seen in textbooks and coding interviews, determining all useful base cases is deliberately easy. The problem parameters are 2 or 3 "dense" natural numbers, so the table of solutions can be a 2d or 3d array with all elements containing useful values. In many of these, you can prove that the current layer only depends on a few (possibly one) previous layer, so all the rest can be discarded, which saves memory.
Practical problems aren't often so nice. The parameter sets aren't small or aren't natural numbers, or - even when they are - they're sparse so that filling in all entries of an array would be a waste.
In these cases, memoization is the only reasonable choice. The top-down recursion determines the sub-problems (on down to the base cases) that need solving as they occur. Sparseness doesn't matter because the memo cache can store parameter sets as explicit keys. When the current layer doesn't need more than K previous ones, various strategies can still be applied to discard the others.

Randomized (as in QuickCheck) versus deterministic (as in SmallCheck) property checking

I know of two approaches to property checking:
Randomized approach (as in QuickCheck) makes you define a generator of random values for your types, then verifies your invariants for each of a large number of randomly generated cases. For example, in case of a vector space ℤⁿ (defined, say, as [Word]), it would generate vectors of any length and direction.
Deterministic approach (as in SmallCheck) works the same, except that it generates valus from simple and small to more complex and extensive deterministically, covering (as I understand) a small part of domain tightly. For example, in the same case as above, it would generate zero vector, then all vectors of length <= 1, then all vectors of length <= 2, and so on, eventually covering all interesting values (such as unit vectors), for which there may conceivably be unforeseen corner cases that break an invariant.
Did I get it correct? What are the benefits and downsides of either approach? What are the preferred use cases for either? Maybe, for best results, both approaches should be combined, say, covering some moderately sized "base" with a deterministic checker and then a random sample with a randomized one?
I'm looking for some solid, motivated best practice to use daily and advance from now on.
P.S. This question is not intended to provoke discussion. I am asking specifically what benefits either approach to property checking has matter-of-factly, not which is nicer, easier to use or better by any other measure that can be considered a matter of taste. An answer would provide either a critique of the problem posed, a mindful, mathematical observation of the nature of either method of property checking, or a case / experience report that justifies one of the approaches as preferable for a certain class of situations. In the end, I intend this question to generate a set of criteria that help determine which approach to testing will catch more programming mistakes. Put this way, I don't see how this question could be considered subjective in any way other than good, as outlined in the rules. There is a companion reddit thread that offers a channel for more liberal communication.

Generic graphing and charting solutions

I'm looking for a generic charting solution, ideally not a hosted one that provides the following features:
Charting a tuple of values where the values are:
1) A service identifier (e.g. CPU usage)
2) A client identifier within that service (e.g. server IP)
3) A value
4) A timestamp with millisecond/second resolution.
Optional:
I'd like to also extend the concept of a client identifier further, taking the above example further, I'd like to store statistics for each core separately, so, another identifier would be Core 1/Core 2..
Now, to make sure I'm clearly stating my problem, I don't want a utility that collects these statistics. I'd like something that stores them, but, this is also not mandatory, I can always store them in MySQL, or such.
What I'm looking for is something that takes values such as these, and charts them nicely, in a multitude of ways (timelines, motion, and the usual ones [pie, bar..]). Essentially, a nice visualization package that allows me to make use of all this data. I'd be collecting data from multiple services, multiple applications, and the datapoints will be of varying resolution. Some of the data will include multiple layers of nesting, some none. (For example, CPU would go down to Server IP, CPU#, whereas memory would only be Server IP, but would include a different identifier, i.e free/used/cached as the "secondary' identifier. Something like average request latency might not have a secondary identifier at all, in the case of ping). What I'm trying to get across is that having multiple layers of identifiers would be great. To add one final example of where multiple identifiers would be great: adding an extra identifier on top of ip/cpu#, namely, process name. I think the advantages of that are obvious.
For some applications, we might collect data at a very narrow scope, focusing on every aspect, in other cases, it might be a more general statistic. When stuff goes wrong, both come in useful, the first to quickly say "something just went wrong", and the second to say "why?".
Further, it would be a nice thing if the charting application threw out "bad" values, that is, if for some reason our monitoring program started to throw values of 300% CPU used on a single core for 10 seconds, it'd be nice if the charts themselves didn't reflect it in the long run. Some sort of smoothing, maybe? This could obviously be done at the data-layer though, so its not a requirement at all.
Finally, comparing two points in time, or comparing two different client identifiers of the same service etc without too much effort would be great.
I'm not partial to any specific language, although I'd prefer something in (one of the following) PHP, Python, C/C++, C#, as these are languages I'm familiar with. It doesn't have to be open source, it doesn't have to be a library, I'm open to using whatever fits my purpose the best.
More of a P.S than a requirement: I'd like to have pretty charts that are easy for non-technical people to understand, and act upon too (and like looking at!).
I'm open to clarifying, and, in advance, thanks for your time!
I am pretty sure that protovis meets all your requirements. But it has a bit of a learning curve. You are meant to learn by examples, and there are plenty to work from. It makes some pretty nice graphs by default. Every value can be a function, so you can do things like get rid of your "Bad" values.

I have a list of names, some of them are fake, I need to use NLP and Python 3.1 to keep the real names and throw out the fake names

I have no clue of where to start on this. I've never done any NLP and only programmed in Python 3.1, which I have to use. I'm looking at the site http://www.linkedin.com and I have to gather all of the public profiles and some of them have very fake names, like 'aaaaaa k dudujjek' and I've been told I can use NLP to find the real names, where would I even start?
This is a difficult problem to solve, and one which starts with acquiring valid given name & surname lists.
How large is the set of names that you're evaluating, and where do they come from? These are both important things for you to consider. If you're evaluating a small set of "American" names, your valid name lists will differ greatly from lists of Japanese or Indian names, for instance.
Your idea of scraping LinkedIn is on the right track, but you were right to catch the fake profile/name flaw. A better website would probably be something like IMDB (perhaps scraping names by iterating over different birth years), or Wikipedia's lists of most popular given names and most common surnames.
When it comes down to it, this is a precision vs. recall problem: in order to miss fewer fakes, you're inevitably going to throw out some real names. If you loosen up your restrictions, you'll get more fakes, but you'll also throw out fewer real names.
Several possibilities here, but the most obvious seems to be with HMMs, i.e. Hidden Markov Models. The NLTK kit includes [at least] one module for HMMs, although I must admit I never used it.
Another possible snag is that AFAIK, NTLK is not yet ported to Python 3.0
This said, and while I'm quite keen on using NLP techniques where applicable, I think that a process which would use several paradigms, including some NLP tricks may be a better solution for this particular problem. For example, storing even a reduced dictionary of common family names (and first names) in a traditional database may offer both a more reliable and more computationally efficient way of filtering a significant portion of the input data, leaving precious CPU resources to be spent on less obvious cases.
i am afraid this problem is not solveable if your list is even only minimally ‘open’ — if the names are eg customers from a small traditionally acting population, you might end up with a few hundred names for thousands of people. but generally you can hardly predict what is a real name and what is not, however unusual an arabic, chinese, or bantu name may look in a sample of, say, south english rural neighborhood names. i mean, ‘Ng’ is a common cantonese surname, and ‘O’ is common in korea, so assumptions may fail. there is this place in austria called ‘fucking’, so even looking out for four letter words is no guarantee for success.
what you could do is work through a sufficiently big sample of such names and sort them out manually. then, use all kinds of textprocessing tools and collect metrics. maybe you can derive a certain likelyhood for a name to be recognized as fake, maybe it will not be viable. you will never go beyond likelyhoods here, though.
as an aside, we used to use google maps and the telephone directory for validating customer data years ago. if google maps could find the place, we called the address validated. it is clear that under stricter requirements, true validation must go much further. let’s not forget the validation of such data is much more a social question than a linguistic one.

Determining what a word "is" - categorizing a token

I'm writing a bridge between the user and a search engine, not a search engine. Part of my value added will be inferring the intent of a query. The intent of a tracking number, stock symbol, or address is fairly obvious. If I can categorise a query, then I can decide if the user even needs to see search results. Of course, if I cannot, then they will see search results. I am currently designing this inference engine.
I'm writing a parser; it should take any given token and assign it a category. Here are some theoretical English examples:
"denver" is a USCITY and a PLACENAME
"aapl" is a NASDAQSYMBOL and a STOCKTICKERSYMBOL
"555 555 5555" is a USPHONENUMBER
I know that each of these cases will most likely require specific handling, however I'm not sure where to start.
Ideally I'd end up with something simple like:
queryCategory = magicCategoryFinder( query )
>print queryCategory
>"SOMECATEGORY or a list"
Natural language parsing is a complicated topic. One of the problems here is that determining what a word is depends on context and implied knowledge. Also, you're not so much interested in words as you are in groups of words. Consider, "New York City" is a place but its three words, two of which (new and city) have other meanings.
also you have to consider ambiguity, which is once again where context and implied knowledge comes in. For example, JAVA is (or was) a stock symbol for Sun Microsystems. It's also a programming language, a place and has meaning associated with coffee. How do you classify it? You'd need to know the context in which it was used.
And if you can solve that problem reliably you can make yourself very wealthy.
What's all this in aid of anyway?
To learn about "tagging" (the term of art for what you're trying to do), I suggest playing around with NLTK's tag module. More generally, NLTK, the Natural Language ToolKit, is an excellent toolkit (based on the Python programming language) for experimentation and learning in the field of Natural Language Processing (whether it's suitable for a given production application may be a different issue, esp. if said application requires very high speed processing on large volumes of data -- but, you have to walk before you can run!-).
You're bumping up against one of the hardest problems in computer science today... determining semantics from english context. This is the classic text mining problem and get into some very advanced topics. I thiink I would suggest thinking more about you're problem and see if you can a) go without categorization or b) perhaps utilize structural info such as document position or something to give you a hint (is either a city or placename or an undetermined) and maybe some lookup tables to help. ie stock symbols are pretty easy to create a pretty full lookup for. You might consider downloading CIA world factbook for a lookup of cities... etc.
As others have already pointed out, this is an exceptionally difficult task. The classic test is a pair of sentences:Time flies like an arrow.Fruit flies like a bananna.
In the first sentence, "flies" is a verb. In the second, it's part of a noun. In the first, "like" is an adverb, but in the second it's a verb. The context doesn't make this particularly easy to sort out either -- there's no obvious difference between "Time" and "Fruit" (both normally nouns). Likewise, "arrow" and "bananna" are both normally nouns.
It can be done -- but it really is decidedly non-trivial.
Although it might not help you much with disambiguation, you could use Cyc. It's a huge database of what things are that's intended to be used in AI applications (though I haven't heard any success stories).

Resources