About 10 years ago, I was using a computer language called INFO on an ancient Prime. It was a funky, but oddly useful language, and I'd like to find out more about it. For instance, can I get a copy that would run on a PC? However, if I google for "INFO" and "Computer Language," I'm not going to find it. I've tried. I think the company that created it was in the UK.
It was so easy to learn, that I was programming the first day, and yet was capable of true oddness. The programs had line numbers, and if the numbers started with an odd number, the commands were SQL like and worked on the entire set. If they started with an even number, the commands were a block that dealt with each item in the set one at a time.
So, a simple type of program could be something like this (I don't recall the syntax either)
100 select * from user_table
110 join with address_table
120 exclude duplicate addresses
200 for each record in set
205 print firstname, lastname
210 print addressLine1
220 print City, State, Zip
300 display "address labels printed"
It was capable of quick and impressive applications, that were almost simple to understand, and yet had a sneaky power that I was only beginning to understand after two years at the job.
How do I find it? Can I get a copy that will work on a PC?
I believe that's the Pr1me Information language; it's now hiding out in IBM Universe. It was a competitor for Pick.
I think the correct name was "Prime Information". Searching for that phrase might get you some slightly more relevant results.
You can find some info here.
Related
Help by editing my question title and tags is greatly appreciated!
Sometimes one participant in my corpus of "conversations" will refer to another participant using a nickname, usually an abbreviation or misspelling, but hereafter I'll just say "nicknames". Let's say I'm willing to manually tell my software whether or not I think various possible nicknames are in fact nicknames, but I want software to come up with a list of possible matches between the handle's that identify people, and the potential nicknames. How would I go about doing that?
Background on me and then my corpus: I have no experience doing natural language processing but I'm a competent data analyst with R. My data is produced by 70 teams, each forecasting the likelihood of 100 distinct events occurring some time in the future. The result that I have 70 x 100 = 7000 text files, containing the stream of forecasts participants make and the comments they include with their forecasts. I'll paste a very short snip of one of these text files below, this one had to do with whether the Malian government would enter talks with the MNLA:
02/12/2013 20:10: past_returns answered Yes: (50%)
I hadn't done a lot of research when I put in my previous
placeholder... I'm bumping up a lot due to DougL's forecast
02/12/2013 19:31: DougL answered Yes: (60%)
Weak President Traore wants talks if MNLA drops territorial claims.
Mali's military may not want talks. France wants talks. MNLA sugggests
it just needs autonomy. But in 7 weeks?
02/12/2013 10:59: past_returns answered No: (75%)
placeholder forecast...
http://www.irinnews.org/Report/97456/What-s-the-way-forward-for-Mali
My initial thoughts: Obviously I can start by providing the names I'm looking to match things up with... in the above example they would be past_returns and DougL (though there is no use of nicknames in the above). I wouldn't think it'd be that hard to get a computer to guess at minor misspellings (though I wouldn't personally know where to start). I can imagine that other tricks could be used, like assuming that a string is more likely to be a nickname if it is used much much more by one team, than by other teams. A nickname is more likely to refer to someone who spoke recently than someone who spoke long ago, or not at all on regarding this question. And they should be used in sentences in a manner similar to the way the full name/screenname is typically used in the corpus. But I'm interested to hear about simple approaches, as well as ones that try to consider more sophisticated techniques.
This could get about as complicated as you want to make it. From the semi-linguistic side of things, research topics would include Levenshtein Distance (for detecting minor misspellings of known names/nicknames) and Named Entity Recognition (for the task of detecting names/nicknames in the first place). Actually, NER's worth reading about, but existing systems might not help you much in your domain of forum handles and nicknames.
The first rough idea that comes to mind is that you could run a tokenized version of your corpus against an English dictionary (perhaps a dataset compiled from Wiktionary or something like WordNet) to find words that are candidates for names, then filter those through some heuristics (do they start with the same letters as known full names? Do they have a low Levenshtein distance from known names? Are they used more than once?).
You could also try some clustering or supervised ML algorithms against the non-word tokens. That might reveal some non-"word" tokens that often occur in the same threads as a given username; again, heuristics could help rule out some false positives.
Good luck; sounds like a fun problem - hope I mentioned at least one thing you hadn't already thought of.
I have to write some code working with locales. Is there a good introduction to the subject to get me started?
First posted at Everything you need to know about Locales
A long time ago when I was a senior developer in the Windows group at Microsoft, I was sent to the Far East to help get the F.E. version of Windows 3.1 shipped. That was my introduction to localizing software – basically being pushed in to the deep end of the pool and told to learn how to swim. This is where I learned that localization is a lot more than translation.
Note: One interesting thing we hit - the infamous Blue Screen of Death switched the screen into text mode. You can't display Asian languages in text mode. So we (and by we I mean me) came up with a system where we put the screen in VGA mode, stored the 12 pt. courier bitmap at the resolution for just the characters used in BSoD messages, and rendered it that way. You kids today have it so easy J.
So keep in mind that taking locale into account can lead to some very unexpected work.
The Locale
Ok, so forward to today. What is a locale and what do you need to know? A locale is fundamentally the language and country a program is running under. (There can also be a variant added to the country but use of this is extremely rare.) The locale is this combination but you can have any combination of these two parts. For example a Spanish national in Germany would set es_DE so that their user interface is in Spanish (es) but their country settings are in German(DE). Do not assume location based on language or vice-versa.
The language part of the locale is very simple - that's what language you want to display the text in your app in. If the user is a Spanish speaker, you want to display all text in Spanish. But what dialect of Spanish - it is quite different between Spain and Mexico (just as in America we spell color while in England it's colour). So the country can impact the language used, depending on the combination.
All languages that support locale specific resources (which is pretty much all of them today) use a fall-back system. They will first look for a resource for the language_country combination. While es_DE has probably never been done, there often is an es_MX and es_ES. So for a locale set to es_MX it will first look for the es_MX resource. If that is not found, it then looks for the es resource. This is the resource for that language, but not specific to any country. Generally this is copied from the largest country (economically) for that language. If that is not found, it then goes to the "general" resource which is almost always the native language the program was written in.
The theory behind this fallback is you only have to define different resources for the more specific resources - and that is very useful. But even more importantly, when new parts of the UI are made and you want to ship beta copies or you release before you can get everything translated, well then the translated parts are in localized but the untranslated parts still display - but in English. This annoys the snot out of users in other countries, but it does get them the program sooner. (Note: We use Sisulizer for translating our resources - good product.)
The second half is the country. This is used primarily for number and date/time settings. This spans the gamut from what the decimal and thousand separator symbols are (12,345.67 in the U.S. is 12 345,67 in Russia) to what calendar is in use. The way to handle this is by using the run-time classes available for all operations on these elements when interacting with a user. Classes exist for both parsing user entered values as well as displaying them.
Keep a clear distinction between values the user enters or are displayed to the user and values stored internally as data. A number is a string in an XML file but in the XML file it will be "12345.67" (unless someone did something very stupid). Keep your data strongly typed and only do the locale specific conversions when displaying or parsing text to/from the user. Storing data in a locale specific format will bite you in the ass sooner or later.
Chinese
Chinese does not have an alphabet but instead has a set of glyphs. The People's Republic of China several decades ago significantly revised how to draw the glyphs and this is called simplified. The Chinese glyphs used elsewhere continued with the original and that is called traditional. It is the exact same set of characters, but they are drawn differently. It is akin to our having both a text A and a script A - they both mean the same thing but are drawn quite differently.
This is more of a font issue than a translation issue, except that wording and usage has diverged a bit, in part due to the differences in approach between traditional and simplified Chinese. The end result is that you generally do want to have two Chinese language resources, one zh_CN (PRC) and one zh_TW (Taiwan). As to which should be the zh resource - that is a major geopolitical question and you're on your own (but keep in mind PRC has nukes - and you don't).
Strings with substituted values
So you need to display the message Display ("The operation had the error: " + msg); No, no, no! Because in another language the proper usage could be Display("The error: " + msg + " was caused by the operation"); Every modern run-time library has a construct where you can have a string resource "The operation had the error: {0}" and will then substitute in your msg at {0}. (Some use a syntax other than {0}, {1}, …)
You store these strings in a resource file that can be localized. Then when you need to display the message, you load it from the resources, substitute in the variables, and display it. The combination of this, plus the number & date/time formatters make it easy to build up these strings. And once you get used to them, you'll find it easier than the old approach. (If you are using Visual Studio - download and install ResourceRefactoringTool to make this trivial.)
Arabic, Hebrew, and complex scripts.
Arabic & Hebrew are called b-directional because parts of it are right to left while other parts are left to right. The text in Arabic/Hebrew are written and read right to left. But when you get to Latin text or numbers, you then jump to the left-most part and read that left to right, then jump back to where that started and read right to left again. And then there is punctuation and other non-letter characters where the rules depend on where they are used.
Here's the bottom line - it is incredibly complex and there is no way you are going to learn how it works unless you take this on as a full-time job. But not to worry, again the run-time libraries for most languages have classes to handle this. The key to this is the text for a line is stored in the order you read the characters. So in the computer memory it is in left to right order for the order you would read (not display) the characters. In this way everything works normally except when you display the text and determine moving the caret.
Complex scripts like Indic scripts have a different problem. While they are read left to right, you can have cases where some combinations of letters are placed one above the other, so the string is no wider on the screen when the second letter is added. This tends to require a bit of care with caret movement but nothing more.
We even have cases like this in English where ae is sometimes rendered as a single æ character. (When the human race invented languages, they were not thinking computer friendly.)
Don't Over-Stress it
It seems like a lot but it's actually quite simple. In most cases you need to display text based on the closest resource you have. And you use the number & date/time classes for all locales, including your native one. No matter where you live, most computer users are in another country speaking another language - so localizing well significantly increases your potential market.
And if you're a small company, consider offering a free copy for people who translate your product. When I created Page 2 Stage I offered a free copy (list price $79.95) for translating it - and got 28 translations. I also met some very nice people online in the process. For an enterprise level product, many times a VAR in another country will translate it for you at a reduced rate or even free if they see a good market potential. But in these cases, do the first translation in-house to get the kinks worked out.
One resource I find very useful is the Microsoft Language Portal where you can put in text in English and if that text is in any of the Microsoft products, it will give you the translation Microsoft used for a given language. This can give you a fast high-quality translation for up to 80% of your program in many cases.
Удачи! (Good Luck)
I'm developing a shopping comparison website, and the project is in a very advanced stage. We index 50 million products daily using merchant feeds from various affiliate networks. Most of the problems I had is already solved, including the majority of the performance bottlenecks.
What is my problem: Please, first of all, we are using apache solr with drupal BUT, this problem IS NOT specific to drupal or solr, if you do not have knowledge of them, it doesn't matter.
We receive product feeds from over 2000 different merchants, and those feeds are a mess. They have no specific pattern, each merchant send the feeds the way they want. We already solved many problems regarding this, but one remains. Normalizing the taxonomy terms for the faceted browsing functionality.
Suppose that I have a "Narrow by Brands" browsing facet on my website. Now suppose that 100 merchants offer products from Microsoft. Now comes the problem. Some merchants put in the "Brands" column of the data feed "Microsoft", others "Microsoft, Inc.", others "Microsoft Corporation" others "Products from Microsoft", etc... there is no specific pattern between merchants and worst, some individual merchants are so sloppy that they have different strings for the same brand IN THE SAME DATA FEED.
We do not want all those different brands appearing in the navigation. We have a manual solution to the problem where we manually map the imported brands to the "good" brands table ("Microsoft Corporation" -> "Microsoft", "Products from Microsoft" -> "Microsoft", etc..). We have something like 10,000 brands in the database and this is doable. The problem is when it comes with bigger things like "Authors". When we import books into the system, there are over 800,000 authors and we have the same problem and this is not doable by hand mapping. The problem is the same: "Tom Mike Apostol", "Tom M. Apostol", "Apostol, Tom M.", etc...
Does anybody know a good way to automatically solve this problem with an acceptable degree of accuracy (85%-95% accuracy)?
Thanks you for the help!
Some idea that comes to my mind, altough it's just a loose thought:
Convert names to initials (in your example: TMA). Treat '-' as spaces, so fe. Antoine de Saint-Exupéry would be ADSE. Problem here is how to treat ",", altough, it's common usage is to have surname before forename, so just swapping positions should work (so A,TM would be TM,A, get rid of comma - TMA).
Filters authors in database by those initials
For each intitial, if you have whole name (Tom, Apostol) check if it match, otherwise (M.) consider it a match automatically.
If you want some tolerance, you can compare names with Levenshtein distance and tolerate some differences (here you have Oracle implementation)
Names that match you treat as the same authors, to find the whole name, for each initial (T, M, A) you look up your filtered authors (after step 2) and try to find one without just initial (M.) but with whole name (Mike), if you can't find one, use initial. Therefore, each of examples you gave would be converted to the same value, which would be full name (Tom Mike Apostol).
Things that are worth to think about:
Include mappings for name synonyms (would be more likely maximally hundred of records, like Thomas <-> Tom
This way is crucial to have valid initials (no M instead of N etc.).
edit: I've coded such thing some time ago, when I had to identify a person by it's signature, ignoring scanning problems, people sometimes sign by Name S. Surname, or N.S. or just by Name Surname (which is another thing maybe you should consider in the solution, to allow the algorithm to ignore second name, altough in your situation it would be rather rare to ommit someone's second name I guess).
I have no clue of where to start on this. I've never done any NLP and only programmed in Python 3.1, which I have to use. I'm looking at the site http://www.linkedin.com and I have to gather all of the public profiles and some of them have very fake names, like 'aaaaaa k dudujjek' and I've been told I can use NLP to find the real names, where would I even start?
This is a difficult problem to solve, and one which starts with acquiring valid given name & surname lists.
How large is the set of names that you're evaluating, and where do they come from? These are both important things for you to consider. If you're evaluating a small set of "American" names, your valid name lists will differ greatly from lists of Japanese or Indian names, for instance.
Your idea of scraping LinkedIn is on the right track, but you were right to catch the fake profile/name flaw. A better website would probably be something like IMDB (perhaps scraping names by iterating over different birth years), or Wikipedia's lists of most popular given names and most common surnames.
When it comes down to it, this is a precision vs. recall problem: in order to miss fewer fakes, you're inevitably going to throw out some real names. If you loosen up your restrictions, you'll get more fakes, but you'll also throw out fewer real names.
Several possibilities here, but the most obvious seems to be with HMMs, i.e. Hidden Markov Models. The NLTK kit includes [at least] one module for HMMs, although I must admit I never used it.
Another possible snag is that AFAIK, NTLK is not yet ported to Python 3.0
This said, and while I'm quite keen on using NLP techniques where applicable, I think that a process which would use several paradigms, including some NLP tricks may be a better solution for this particular problem. For example, storing even a reduced dictionary of common family names (and first names) in a traditional database may offer both a more reliable and more computationally efficient way of filtering a significant portion of the input data, leaving precious CPU resources to be spent on less obvious cases.
i am afraid this problem is not solveable if your list is even only minimally ‘open’ — if the names are eg customers from a small traditionally acting population, you might end up with a few hundred names for thousands of people. but generally you can hardly predict what is a real name and what is not, however unusual an arabic, chinese, or bantu name may look in a sample of, say, south english rural neighborhood names. i mean, ‘Ng’ is a common cantonese surname, and ‘O’ is common in korea, so assumptions may fail. there is this place in austria called ‘fucking’, so even looking out for four letter words is no guarantee for success.
what you could do is work through a sufficiently big sample of such names and sort them out manually. then, use all kinds of textprocessing tools and collect metrics. maybe you can derive a certain likelyhood for a name to be recognized as fake, maybe it will not be viable. you will never go beyond likelyhoods here, though.
as an aside, we used to use google maps and the telephone directory for validating customer data years ago. if google maps could find the place, we called the address validated. it is clear that under stricter requirements, true validation must go much further. let’s not forget the validation of such data is much more a social question than a linguistic one.
The company I work for is in the business of sending press releases. We want to make it possible for interested parties to search for press releases based on a number of criteria, the most important being location. For example, someone might search for all news sent to New York City, Massachusetts, or ZIP code 89134, sent from a governmental institution, under the topic of "traffic". Or whatever.
The problem is, we've sent, literally, hundreds of thousands of press releases. Searching is slow and complex. For example, a press release sent to Queens, NY should show up in the search I mentioned above even though it wasn't specifically sent to New York City, because Queens is a subset of New York City. We may also want to implement "and" and "or" and negation and text search to the query to create complex searches. These searches also have to be fast enough to function as dynamic RSS feeds.
I really don't know anything about search theory, or how it's properly done. The way we are getting by right now is using a data mart to store the locations the releases were sent to in a single table. However, because of the subset thing mentioned above, the data mart is gigantic with millions of rows. And we haven't even implemented cities yet, and there are about 50,000 cities in the United States, which will exponentially increase the size of the data mart by so much I'm afraid it just won't work anymore.
Anyway, I realize this is not a simple question and there won't be a "do this" answer. However, I'm hoping one of you can point me in the right direction where I can learn about how massive searches are done? Because I really know nothing about it. And such a search engine is turning out to be incredibly difficult to make. Thanks! I know there must be a way because if Google can search the entire internet we must be able to search our own database :-)
Google can search the entire internet, and your data via a Google Appliance!