Algorithm to shorten city names to human readable codes - string

I am using the region and city in urls for this project. Now many regions and cities might have very long names while also combinations of shortened region/city information might lead to ambiguity.
Is there an easy approach to automatically shorten words in a way so they still make sense and are readable but are shortened without just cutting the end of?
Like turning Bremerhaven into Brmhvn or New Haven, Connecticut to NewHvn-Cnctct?

There are 2797245 cities in the list of world cities freely available from http://www.maxmind.com/app/worldcities
I would design the URL pattern similar to
{Country TLD}/{Abbreviation for state, province or prefecture}/{Trim of county or district}/{Trim of city}
Some examples,
www.example.com/US/NY/-/NewYork
www.example.com/US/NY/Westcheste/MtVernon (Mount Vernon in Westchester County, New York)
- county trimmed to 10 first characters. Also common words in city names abbreviated
www.example.com/DE/Bavaria/Munich
www.example.com/JP/Tohoku/Miyagi/Sendai (Sendai)
For the region, you may want to consider using the ISO 3166 code. So, the above examples for Munich and Sendai would look like
www.example.com/JP/Tohoku/JP-04/Sendai
www.example.com/DE/DE-BY/Munich
Other leads
HASC (Hierarchical administrative subdivision codes) which is used to
represent names of country subdivisions, such as states, province,
regions.
http://en.wikipedia.org/wiki/ISO_3166-2:US
http://www.statoids.com/uus.html
Hope it helps.

You're probably better advised to adopt an existing list of codes than to make up your own. For example you could use IATA codes or zip/postal codes, even telephone dialling codes (find your own link for these).

If you want the shorter version to make sense for humans I think this is incredibly complex as its not very obvious what will represent the full name properly.
example: my own city of Helsingborg. Given just this name i would split it down to one letter per Syllable. Hel-Sing-Borg -> HSB.
But I have never once seen anyone use this acronym. Everyone I know uses HBG.
In short, I would say its fairly easy to make a function that makes a logical acronym for any given game, but very hard to try and make one that is recognizable for a human.
If you just want to crop out some letters from a name, that would probably be a lot easier, but you'd probably want to talk to a English professor to understand what parts of names you can cut out without affecting the readability. But it is possible, and there are most likely meany publicly available studies on how we read words that you could reference.

Related

How to efficiently extract street from a string

I need to extract street from apartment description. I don't necessarily need it with number (not every listing have it anyway), but it would be appreciated.
MY 'SOLUTIONS'
1. Use regular expression. But then after reading many descriptions I realized that people often omit characteristic signs like writing 'street' equivalent before actual street name etc.
2. Create list containing possible street names. Because I know in which city the apartments will be, I can download every street name from this city, load it and then try to match any word from description with streets stored in list (probably some kind of hash table). But descriptions can be quite long and streets can consist of few words. Then probably I need to use this solution. Also address in description can have 1-2 characters miss-typed.
3. Hand over this work to third-party map engine. Cut few words from string, send it to some map engine (i.e. Google Maps), if it doesn't make sense for it, then try next few words etc.
Is there any better solution?
I feel like I have to assume things like address will always be typed correctly etc., otherwise complexity will be unacceptably high.
EDIT:
Programming language - I asked it as a generic question, not specific to language. But I'm using Python
String format - there is no exact string format, because everyone writes description as he wants. Usually the localization is somewhere in the beginning, but that's not always the case.
Make different format - I can't, because I'm scraping this data from different sites.

Heuristic to predict Name or Company

Problem
We are recieving strings and they may either represent a company name or a person's name. We need a heuristic to determine this.
Initial thoughts
Use an XML doc with either node Commercial String /Commercial or Personal String /Personal and score matching strings +1 (sorry dont know how to format XML in SO)
Cant just check for proper nouns. I.E. Bob's Company is a company where Bob Compton is a name
Need to return confidence level in some format. I can't think of how to do it as a percentage, all I can think to do is if it finds a match use an integer
Possible Commercial (all will be converted to lower case): co, co., inc, inc., etc (verbose versions of each)
I can get a English Name list from online
Question
Has anyone ran into this kind of domain problem before? What methods did you use? Any flashy way of solving this?
Thank You.
I haven't done this before, but some other thoughts:
Check for non-proper nouns (e.g. "and", "the", "piping"). In fact, if you have an English dictionary and a names list, any word that is not a name could be a good pointer to a company name.
A big problem is that some companies are just named after a person(s). "Fred Meyer", "J.C. Penney", and "Lockheed Martin" are examples of companies that look just like human names. There's likely no really good way around this (probably nothing easy anyway). If you can categorize first and last names, a double last name or last name only might be a good reason to lower the certainty.
I would agree with your integer idea. Unless you can do some very broad and very thorough testing, your percentages would probably be meaningless. I would probably run all the tests (returning name, company, or unknown) and compare the results, adding up an integer based on consistency in results.
Can you compare to a database of known company names?
E.g. in the UK: http://wck2.companieshouse.gov.uk
Of course, this doesn't help if it's actually someone's name, but there's a company with the same name.

Accurate algorithm for normalizing taxonomy terms?

I'm developing a shopping comparison website, and the project is in a very advanced stage. We index 50 million products daily using merchant feeds from various affiliate networks. Most of the problems I had is already solved, including the majority of the performance bottlenecks.
What is my problem: Please, first of all, we are using apache solr with drupal BUT, this problem IS NOT specific to drupal or solr, if you do not have knowledge of them, it doesn't matter.
We receive product feeds from over 2000 different merchants, and those feeds are a mess. They have no specific pattern, each merchant send the feeds the way they want. We already solved many problems regarding this, but one remains. Normalizing the taxonomy terms for the faceted browsing functionality.
Suppose that I have a "Narrow by Brands" browsing facet on my website. Now suppose that 100 merchants offer products from Microsoft. Now comes the problem. Some merchants put in the "Brands" column of the data feed "Microsoft", others "Microsoft, Inc.", others "Microsoft Corporation" others "Products from Microsoft", etc... there is no specific pattern between merchants and worst, some individual merchants are so sloppy that they have different strings for the same brand IN THE SAME DATA FEED.
We do not want all those different brands appearing in the navigation. We have a manual solution to the problem where we manually map the imported brands to the "good" brands table ("Microsoft Corporation" -> "Microsoft", "Products from Microsoft" -> "Microsoft", etc..). We have something like 10,000 brands in the database and this is doable. The problem is when it comes with bigger things like "Authors". When we import books into the system, there are over 800,000 authors and we have the same problem and this is not doable by hand mapping. The problem is the same: "Tom Mike Apostol", "Tom M. Apostol", "Apostol, Tom M.", etc...
Does anybody know a good way to automatically solve this problem with an acceptable degree of accuracy (85%-95% accuracy)?
Thanks you for the help!
Some idea that comes to my mind, altough it's just a loose thought:
Convert names to initials (in your example: TMA). Treat '-' as spaces, so fe. Antoine de Saint-Exupéry would be ADSE. Problem here is how to treat ",", altough, it's common usage is to have surname before forename, so just swapping positions should work (so A,TM would be TM,A, get rid of comma - TMA).
Filters authors in database by those initials
For each intitial, if you have whole name (Tom, Apostol) check if it match, otherwise (M.) consider it a match automatically.
If you want some tolerance, you can compare names with Levenshtein distance and tolerate some differences (here you have Oracle implementation)
Names that match you treat as the same authors, to find the whole name, for each initial (T, M, A) you look up your filtered authors (after step 2) and try to find one without just initial (M.) but with whole name (Mike), if you can't find one, use initial. Therefore, each of examples you gave would be converted to the same value, which would be full name (Tom Mike Apostol).
Things that are worth to think about:
Include mappings for name synonyms (would be more likely maximally hundred of records, like Thomas <-> Tom
This way is crucial to have valid initials (no M instead of N etc.).
edit: I've coded such thing some time ago, when I had to identify a person by it's signature, ignoring scanning problems, people sometimes sign by Name S. Surname, or N.S. or just by Name Surname (which is another thing maybe you should consider in the solution, to allow the algorithm to ignore second name, altough in your situation it would be rather rare to ommit someone's second name I guess).

I have a list of names, some of them are fake, I need to use NLP and Python 3.1 to keep the real names and throw out the fake names

I have no clue of where to start on this. I've never done any NLP and only programmed in Python 3.1, which I have to use. I'm looking at the site http://www.linkedin.com and I have to gather all of the public profiles and some of them have very fake names, like 'aaaaaa k dudujjek' and I've been told I can use NLP to find the real names, where would I even start?
This is a difficult problem to solve, and one which starts with acquiring valid given name & surname lists.
How large is the set of names that you're evaluating, and where do they come from? These are both important things for you to consider. If you're evaluating a small set of "American" names, your valid name lists will differ greatly from lists of Japanese or Indian names, for instance.
Your idea of scraping LinkedIn is on the right track, but you were right to catch the fake profile/name flaw. A better website would probably be something like IMDB (perhaps scraping names by iterating over different birth years), or Wikipedia's lists of most popular given names and most common surnames.
When it comes down to it, this is a precision vs. recall problem: in order to miss fewer fakes, you're inevitably going to throw out some real names. If you loosen up your restrictions, you'll get more fakes, but you'll also throw out fewer real names.
Several possibilities here, but the most obvious seems to be with HMMs, i.e. Hidden Markov Models. The NLTK kit includes [at least] one module for HMMs, although I must admit I never used it.
Another possible snag is that AFAIK, NTLK is not yet ported to Python 3.0
This said, and while I'm quite keen on using NLP techniques where applicable, I think that a process which would use several paradigms, including some NLP tricks may be a better solution for this particular problem. For example, storing even a reduced dictionary of common family names (and first names) in a traditional database may offer both a more reliable and more computationally efficient way of filtering a significant portion of the input data, leaving precious CPU resources to be spent on less obvious cases.
i am afraid this problem is not solveable if your list is even only minimally ‘open’ — if the names are eg customers from a small traditionally acting population, you might end up with a few hundred names for thousands of people. but generally you can hardly predict what is a real name and what is not, however unusual an arabic, chinese, or bantu name may look in a sample of, say, south english rural neighborhood names. i mean, ‘Ng’ is a common cantonese surname, and ‘O’ is common in korea, so assumptions may fail. there is this place in austria called ‘fucking’, so even looking out for four letter words is no guarantee for success.
what you could do is work through a sufficiently big sample of such names and sort them out manually. then, use all kinds of textprocessing tools and collect metrics. maybe you can derive a certain likelyhood for a name to be recognized as fake, maybe it will not be viable. you will never go beyond likelyhoods here, though.
as an aside, we used to use google maps and the telephone directory for validating customer data years ago. if google maps could find the place, we called the address validated. it is clear that under stricter requirements, true validation must go much further. let’s not forget the validation of such data is much more a social question than a linguistic one.

Algorithms for splitting personal names in parts

I'm looking for references on separating a name: "John A. Doe" in parts, first=John, middle=A., last=Doe. In Mexico we have paternal, maternal, first and second given names, and can be written in different permutations, so the problem is quite complex.
As it depends on data, we are working with matching software that calculates a score for every word so we can take decisions (it is based on a big database). The input data is not clean, it is imported from some government web pages and is human filtered so it could have junk that has to be recognized as well. Any suggestions?
[Edit]
Examples:
name:
Javier Abdul Córdoba Gándara
common permutations (or as it may appear in gvt data referring to same person):
Córdoba Gándara Javier Abdul
Javier A. Córdoba Gándara
Javier Abdul Córdoba G.
paternal=Córdoba
maternal=Gándara
first given:Javier
second given:Abdul
name: María de la Luz Sánchez Martínez
paternal:Sánchez
maternal: Martínez
first given: María de la Luz
name: Paloma Viridiana Alin Arias Medina
paternal: Arias
maternal: Medina
first given: Paloma
second given: Viridiana Alin
As I said what the meaning of each word depends on the score. One has no way of knowing that Viridiana and Alin are given names if not from the score.
We have a very strong database (80 million records or so) so we can get some use of the scoring system. I am designing some algorithm that uses that but looking for other references.
Unfortunately - and having done quite a bit of this work myself - your ideal algorithm will be very data specific, and you will need to work this out for your particular situation.
Of the total time and effort to develop this algorithm, I'd say the time will be split roughly as follows:
10% for general string manipulation
30% for the specific nature of the
data (Mexican name formats, data input quirks)
60% to cater for data quality / lack of
quality
And I believe that's quite generous towards the general string manipulation. Of course it depends whether you need quality results for all records, or only the 'clean' records etc, and if you are able to ignore the 'difficult' records it makes it a lot simpler.
Some general tips
If they are not required, remove non alphanumeric / whitespace characters
Split on spaces
Use hyphens / punctuation to identify surnames or family names
Initials (which are generally single
letters) are not surnames; i.e. they
must be first / middle
determine the level of confidence that you have programmatically identified the each name (and test this thoroughly). You may find there are subsets of data that contain similar patterns that need to be catered for individually (they may come from different sources etc)
You may need to add some natural language or machine learning to check. The problem of identifying author names (e.g. in scientific papers) is difficult as they can be reported with differing orders, degrees of abbreviation, elisions etc. If your database is dirty you will end with ambiguity whatever you do.

Resources