How to efficiently extract street from a string - string

I need to extract street from apartment description. I don't necessarily need it with number (not every listing have it anyway), but it would be appreciated.
MY 'SOLUTIONS'
1. Use regular expression. But then after reading many descriptions I realized that people often omit characteristic signs like writing 'street' equivalent before actual street name etc.
2. Create list containing possible street names. Because I know in which city the apartments will be, I can download every street name from this city, load it and then try to match any word from description with streets stored in list (probably some kind of hash table). But descriptions can be quite long and streets can consist of few words. Then probably I need to use this solution. Also address in description can have 1-2 characters miss-typed.
3. Hand over this work to third-party map engine. Cut few words from string, send it to some map engine (i.e. Google Maps), if it doesn't make sense for it, then try next few words etc.
Is there any better solution?
I feel like I have to assume things like address will always be typed correctly etc., otherwise complexity will be unacceptably high.
EDIT:
Programming language - I asked it as a generic question, not specific to language. But I'm using Python
String format - there is no exact string format, because everyone writes description as he wants. Usually the localization is somewhere in the beginning, but that's not always the case.
Make different format - I can't, because I'm scraping this data from different sites.

Related

Excel - Extracting parts of a string with an undefined amount of characters?

Lets say I have 4 columns named "Street number", "Street name", "Suburb" and "State", and three plaintext sentences: "10 magic road sunshine VIC", "105 calder street taylors lakes VIC" and "3 new road airdale QLD".
Right now, I add in special characters between the sections and get it to extract between those, but that's pretty inefficient. Is there anyway to sort though the data and extract the required bits for each column without modifying the data?
No there isn't. You need human brain power to identify if the words belong in the street name or the suburb, and even humans might need help doing that if they are unfamiliar with the locations. There is no logic that can be applied that magically makes that distinction.
What you are describing is generally referred to as "address breakdown" - companies get paid good money for developing algorithms which automate this process, and those algorithms usually require separators between different address parts, and are based heavily on official address products - such as UK Postcode Address File (PAF)
However, a possible alternative - provided you don't have too many addresses, and don't need to do this operation very often - might be to use Google Geocoding Api.
You feed it address and the output is in xml, with qualified address parts: Example
There are many ways you could use that, but I quite like to couple it with google sheets ImportXML function, like so:

Methods for extracting locations from text?

What are the recommended methods for extracting locations from free text?
What I can think of is to use regex rules like "words ... in location". But are there better approaches than this?
Also I can think of having a lookup hash table table with names for countries and cities and then compare every extracted token from the text to that of the hash table.
Does anybody know of better approaches?
Edit: I'm trying to extract locations from tweets text. So the issue of high number of tweets might also affect my choice for a method.
All rule-based approaches will fail (if your text is really "free"). That includes regex, context-free grammars, any kind of lookup... Believe me, I've been there before :-)
This problem is called Named Entity Recognition. Location is one of the 3 most studied classes (with Person and Organization). Stanford NLP has an open source Java implementation that is extremely powerful: http://nlp.stanford.edu/software/CRF-NER.shtml
You can easily find implementations in other programming languages.
Put all of your valid locations into a sorted list. If you are planning on comparing case-insensitive, make sure the case of your list already is normalized.
Then all you have to do is loop over individual "words" in your input text and at the start of each new word, start a new binary search in your location list. As soon as you find a no-match, you can skip the entire word and proceed with the next.
Possible problem: multi-word locations such as "New York", "3rd Street", "People's Republic of China". Perhaps all it takes, though, is to save the position of the first new word, if you find your bsearch leads you to a (possible!) multi-word result. Then, if the full comparison fails -- possibly several words later -- all you have to do is revert to this 'next' word, in relation to the previous one where you started.
As to what a "word" is: while you are preparing your location list, make a list of all characters that may appear inside locations. Only phrases that contain characters from this list can be considered a valid 'word'.
How fast are the tweets coming in? As in is it the full twitter fire hose or some filtering queries?
A bit more sophisticated approach, that is similar to what you described is using an NLP tool that is integrated to a gazetteer.
Very few NLP tools will keep up to twitter rates, and very few do very well with twitter because of all of the leet speak. The NLP can be tuned for precision or recall depending on your needs, to limit down performing lockups in the gazetteer.
I recommend looking at Rosoka(also Rosoka Cloud through Amazon AWS) and GeoGravy

Algorithm to shorten city names to human readable codes

I am using the region and city in urls for this project. Now many regions and cities might have very long names while also combinations of shortened region/city information might lead to ambiguity.
Is there an easy approach to automatically shorten words in a way so they still make sense and are readable but are shortened without just cutting the end of?
Like turning Bremerhaven into Brmhvn or New Haven, Connecticut to NewHvn-Cnctct?
There are 2797245 cities in the list of world cities freely available from http://www.maxmind.com/app/worldcities
I would design the URL pattern similar to
{Country TLD}/{Abbreviation for state, province or prefecture}/{Trim of county or district}/{Trim of city}
Some examples,
www.example.com/US/NY/-/NewYork
www.example.com/US/NY/Westcheste/MtVernon (Mount Vernon in Westchester County, New York)
- county trimmed to 10 first characters. Also common words in city names abbreviated
www.example.com/DE/Bavaria/Munich
www.example.com/JP/Tohoku/Miyagi/Sendai (Sendai)
For the region, you may want to consider using the ISO 3166 code. So, the above examples for Munich and Sendai would look like
www.example.com/JP/Tohoku/JP-04/Sendai
www.example.com/DE/DE-BY/Munich
Other leads
HASC (Hierarchical administrative subdivision codes) which is used to
represent names of country subdivisions, such as states, province,
regions.
http://en.wikipedia.org/wiki/ISO_3166-2:US
http://www.statoids.com/uus.html
Hope it helps.
You're probably better advised to adopt an existing list of codes than to make up your own. For example you could use IATA codes or zip/postal codes, even telephone dialling codes (find your own link for these).
If you want the shorter version to make sense for humans I think this is incredibly complex as its not very obvious what will represent the full name properly.
example: my own city of Helsingborg. Given just this name i would split it down to one letter per Syllable. Hel-Sing-Borg -> HSB.
But I have never once seen anyone use this acronym. Everyone I know uses HBG.
In short, I would say its fairly easy to make a function that makes a logical acronym for any given game, but very hard to try and make one that is recognizable for a human.
If you just want to crop out some letters from a name, that would probably be a lot easier, but you'd probably want to talk to a English professor to understand what parts of names you can cut out without affecting the readability. But it is possible, and there are most likely meany publicly available studies on how we read words that you could reference.

Heuristic to predict Name or Company

Problem
We are recieving strings and they may either represent a company name or a person's name. We need a heuristic to determine this.
Initial thoughts
Use an XML doc with either node Commercial String /Commercial or Personal String /Personal and score matching strings +1 (sorry dont know how to format XML in SO)
Cant just check for proper nouns. I.E. Bob's Company is a company where Bob Compton is a name
Need to return confidence level in some format. I can't think of how to do it as a percentage, all I can think to do is if it finds a match use an integer
Possible Commercial (all will be converted to lower case): co, co., inc, inc., etc (verbose versions of each)
I can get a English Name list from online
Question
Has anyone ran into this kind of domain problem before? What methods did you use? Any flashy way of solving this?
Thank You.
I haven't done this before, but some other thoughts:
Check for non-proper nouns (e.g. "and", "the", "piping"). In fact, if you have an English dictionary and a names list, any word that is not a name could be a good pointer to a company name.
A big problem is that some companies are just named after a person(s). "Fred Meyer", "J.C. Penney", and "Lockheed Martin" are examples of companies that look just like human names. There's likely no really good way around this (probably nothing easy anyway). If you can categorize first and last names, a double last name or last name only might be a good reason to lower the certainty.
I would agree with your integer idea. Unless you can do some very broad and very thorough testing, your percentages would probably be meaningless. I would probably run all the tests (returning name, company, or unknown) and compare the results, adding up an integer based on consistency in results.
Can you compare to a database of known company names?
E.g. in the UK: http://wck2.companieshouse.gov.uk
Of course, this doesn't help if it's actually someone's name, but there's a company with the same name.

Algorithms for splitting personal names in parts

I'm looking for references on separating a name: "John A. Doe" in parts, first=John, middle=A., last=Doe. In Mexico we have paternal, maternal, first and second given names, and can be written in different permutations, so the problem is quite complex.
As it depends on data, we are working with matching software that calculates a score for every word so we can take decisions (it is based on a big database). The input data is not clean, it is imported from some government web pages and is human filtered so it could have junk that has to be recognized as well. Any suggestions?
[Edit]
Examples:
name:
Javier Abdul Córdoba Gándara
common permutations (or as it may appear in gvt data referring to same person):
Córdoba Gándara Javier Abdul
Javier A. Córdoba Gándara
Javier Abdul Córdoba G.
paternal=Córdoba
maternal=Gándara
first given:Javier
second given:Abdul
name: María de la Luz Sánchez Martínez
paternal:Sánchez
maternal: Martínez
first given: María de la Luz
name: Paloma Viridiana Alin Arias Medina
paternal: Arias
maternal: Medina
first given: Paloma
second given: Viridiana Alin
As I said what the meaning of each word depends on the score. One has no way of knowing that Viridiana and Alin are given names if not from the score.
We have a very strong database (80 million records or so) so we can get some use of the scoring system. I am designing some algorithm that uses that but looking for other references.
Unfortunately - and having done quite a bit of this work myself - your ideal algorithm will be very data specific, and you will need to work this out for your particular situation.
Of the total time and effort to develop this algorithm, I'd say the time will be split roughly as follows:
10% for general string manipulation
30% for the specific nature of the
data (Mexican name formats, data input quirks)
60% to cater for data quality / lack of
quality
And I believe that's quite generous towards the general string manipulation. Of course it depends whether you need quality results for all records, or only the 'clean' records etc, and if you are able to ignore the 'difficult' records it makes it a lot simpler.
Some general tips
If they are not required, remove non alphanumeric / whitespace characters
Split on spaces
Use hyphens / punctuation to identify surnames or family names
Initials (which are generally single
letters) are not surnames; i.e. they
must be first / middle
determine the level of confidence that you have programmatically identified the each name (and test this thoroughly). You may find there are subsets of data that contain similar patterns that need to be catered for individually (they may come from different sources etc)
You may need to add some natural language or machine learning to check. The problem of identifying author names (e.g. in scientific papers) is difficult as they can be reported with differing orders, degrees of abbreviation, elisions etc. If your database is dirty you will end with ambiguity whatever you do.

Resources