Lets say I have 4 columns named "Street number", "Street name", "Suburb" and "State", and three plaintext sentences: "10 magic road sunshine VIC", "105 calder street taylors lakes VIC" and "3 new road airdale QLD".
Right now, I add in special characters between the sections and get it to extract between those, but that's pretty inefficient. Is there anyway to sort though the data and extract the required bits for each column without modifying the data?
No there isn't. You need human brain power to identify if the words belong in the street name or the suburb, and even humans might need help doing that if they are unfamiliar with the locations. There is no logic that can be applied that magically makes that distinction.
What you are describing is generally referred to as "address breakdown" - companies get paid good money for developing algorithms which automate this process, and those algorithms usually require separators between different address parts, and are based heavily on official address products - such as UK Postcode Address File (PAF)
However, a possible alternative - provided you don't have too many addresses, and don't need to do this operation very often - might be to use Google Geocoding Api.
You feed it address and the output is in xml, with qualified address parts: Example
There are many ways you could use that, but I quite like to couple it with google sheets ImportXML function, like so:
Related
I need to extract street from apartment description. I don't necessarily need it with number (not every listing have it anyway), but it would be appreciated.
MY 'SOLUTIONS'
1. Use regular expression. But then after reading many descriptions I realized that people often omit characteristic signs like writing 'street' equivalent before actual street name etc.
2. Create list containing possible street names. Because I know in which city the apartments will be, I can download every street name from this city, load it and then try to match any word from description with streets stored in list (probably some kind of hash table). But descriptions can be quite long and streets can consist of few words. Then probably I need to use this solution. Also address in description can have 1-2 characters miss-typed.
3. Hand over this work to third-party map engine. Cut few words from string, send it to some map engine (i.e. Google Maps), if it doesn't make sense for it, then try next few words etc.
Is there any better solution?
I feel like I have to assume things like address will always be typed correctly etc., otherwise complexity will be unacceptably high.
EDIT:
Programming language - I asked it as a generic question, not specific to language. But I'm using Python
String format - there is no exact string format, because everyone writes description as he wants. Usually the localization is somewhere in the beginning, but that's not always the case.
Make different format - I can't, because I'm scraping this data from different sites.
I have a column which is made up of addresses as show below.
Address
1 Reid Street, Manchester, M1 2DF
12 Borough Road, London, E12,2FH
15 Jones Street, Newcastle, Tyne & Wear, NE1 3DN
etc .. etc....
I am wanting to split this into different columns to import into my SQL database. I have been trying to use Findstring to seperate by the comma but am having trouble when some addresses have more "sections" than others. ANy ideas whats the best way to go about this?
Many THanks
This is a requirements specification problem, not an implementation problem. The more you can afford to assume about the format of the addresses, the more detailed parsing you will be able to do; the other side of the same coin is that the less you will assume about the structure of the address, the fewer incorrect parses you will be blamed for.
It is crucial to determine whether you will only need to process UK postal emails, or whether worldwide addresses may occur.
Based on your examples, certain parts of the address seem to be always present, but please check this resource to determine whether they are really required in all UK email addresses.
If you find a match between the depth of parsing that you need, and the assumptions that you can safely make, you should be able to keep parsing by comma indexes (FINDSTRING); determine some components starting from the left, and some starting from the right of the string; and keep all that remains as an unparsed body.
It may also well happen that you will find that your current task is a mission impossible, especially in connection with international postal addresses. This is why most websites and other data collectors require the entry of postal address in an already parsed form by the user.
Excellent points raised by Hanika. Some of your parsing will depend on what your target destination looks like. As an ignorant yank, based on Hanika's link, I'd think your output would look something like
Addressee
Organisation
BuildingName
BuildingAddress
Locality
PostTown
Postcode
BasicsMet (boolean indicating whether minimum criteria for a good address has been met.)
In the US, just because an address could not be properly CASSed doesn't mean it couldn't be delivered - cip, my grandparent-in-laws live in enough small town that specifying their name and city is sufficient for delivery as local postal officials know who they are. For bulk mailings though, their address would not qualify for the bulk mailing rate and would default to first class mailing. I assume a similar scenario exists for UK mail
The general idea is for each row flowing through, you'll want to do your best to parse the data out into those buckets. The optimal solution for getting it "right" is to change the data entry method to validate and capture data into those discrete buckets. Since optimal never happens, it becomes your task to sort through the dross to find your gold.
Whilst you can write some fantastic expressions with FINDSTRING, I'd advise against it in this case as maintenance alone will drive you mad. Instead, add a Script Transformation and build your parsing logic in .NET (vb or c#). There will then be a cycle of running data through your transformation and having someone eyeball the results. If you find a new scenario, you go back and adjust your business rules. It's ugly, it's iterative and it's prone to producing results that a human wouldn't have.
Alternatives to rolling your address standardisation logic
buy it. Eventually your business needs outpace your ability to cope with constantly changing business rules. There are plenty of vendors out there but I'm only familiar with US based ones
upgrade to SQL Server 2012 to use DQS (Data Quality Services). You'll probably still need to buy a product to build out your knowledge base but you could offload the business rule making task to a domain expert ("Hey you, you make peanuts an hour. Make sure all the addresses coming out of this look like addresses" was how they covered this in the beginning of one of my jobs).
I am using the region and city in urls for this project. Now many regions and cities might have very long names while also combinations of shortened region/city information might lead to ambiguity.
Is there an easy approach to automatically shorten words in a way so they still make sense and are readable but are shortened without just cutting the end of?
Like turning Bremerhaven into Brmhvn or New Haven, Connecticut to NewHvn-Cnctct?
There are 2797245 cities in the list of world cities freely available from http://www.maxmind.com/app/worldcities
I would design the URL pattern similar to
{Country TLD}/{Abbreviation for state, province or prefecture}/{Trim of county or district}/{Trim of city}
Some examples,
www.example.com/US/NY/-/NewYork
www.example.com/US/NY/Westcheste/MtVernon (Mount Vernon in Westchester County, New York)
- county trimmed to 10 first characters. Also common words in city names abbreviated
www.example.com/DE/Bavaria/Munich
www.example.com/JP/Tohoku/Miyagi/Sendai (Sendai)
For the region, you may want to consider using the ISO 3166 code. So, the above examples for Munich and Sendai would look like
www.example.com/JP/Tohoku/JP-04/Sendai
www.example.com/DE/DE-BY/Munich
Other leads
HASC (Hierarchical administrative subdivision codes) which is used to
represent names of country subdivisions, such as states, province,
regions.
http://en.wikipedia.org/wiki/ISO_3166-2:US
http://www.statoids.com/uus.html
Hope it helps.
You're probably better advised to adopt an existing list of codes than to make up your own. For example you could use IATA codes or zip/postal codes, even telephone dialling codes (find your own link for these).
If you want the shorter version to make sense for humans I think this is incredibly complex as its not very obvious what will represent the full name properly.
example: my own city of Helsingborg. Given just this name i would split it down to one letter per Syllable. Hel-Sing-Borg -> HSB.
But I have never once seen anyone use this acronym. Everyone I know uses HBG.
In short, I would say its fairly easy to make a function that makes a logical acronym for any given game, but very hard to try and make one that is recognizable for a human.
If you just want to crop out some letters from a name, that would probably be a lot easier, but you'd probably want to talk to a English professor to understand what parts of names you can cut out without affecting the readability. But it is possible, and there are most likely meany publicly available studies on how we read words that you could reference.
I'm developing a shopping comparison website, and the project is in a very advanced stage. We index 50 million products daily using merchant feeds from various affiliate networks. Most of the problems I had is already solved, including the majority of the performance bottlenecks.
What is my problem: Please, first of all, we are using apache solr with drupal BUT, this problem IS NOT specific to drupal or solr, if you do not have knowledge of them, it doesn't matter.
We receive product feeds from over 2000 different merchants, and those feeds are a mess. They have no specific pattern, each merchant send the feeds the way they want. We already solved many problems regarding this, but one remains. Normalizing the taxonomy terms for the faceted browsing functionality.
Suppose that I have a "Narrow by Brands" browsing facet on my website. Now suppose that 100 merchants offer products from Microsoft. Now comes the problem. Some merchants put in the "Brands" column of the data feed "Microsoft", others "Microsoft, Inc.", others "Microsoft Corporation" others "Products from Microsoft", etc... there is no specific pattern between merchants and worst, some individual merchants are so sloppy that they have different strings for the same brand IN THE SAME DATA FEED.
We do not want all those different brands appearing in the navigation. We have a manual solution to the problem where we manually map the imported brands to the "good" brands table ("Microsoft Corporation" -> "Microsoft", "Products from Microsoft" -> "Microsoft", etc..). We have something like 10,000 brands in the database and this is doable. The problem is when it comes with bigger things like "Authors". When we import books into the system, there are over 800,000 authors and we have the same problem and this is not doable by hand mapping. The problem is the same: "Tom Mike Apostol", "Tom M. Apostol", "Apostol, Tom M.", etc...
Does anybody know a good way to automatically solve this problem with an acceptable degree of accuracy (85%-95% accuracy)?
Thanks you for the help!
Some idea that comes to my mind, altough it's just a loose thought:
Convert names to initials (in your example: TMA). Treat '-' as spaces, so fe. Antoine de Saint-Exupéry would be ADSE. Problem here is how to treat ",", altough, it's common usage is to have surname before forename, so just swapping positions should work (so A,TM would be TM,A, get rid of comma - TMA).
Filters authors in database by those initials
For each intitial, if you have whole name (Tom, Apostol) check if it match, otherwise (M.) consider it a match automatically.
If you want some tolerance, you can compare names with Levenshtein distance and tolerate some differences (here you have Oracle implementation)
Names that match you treat as the same authors, to find the whole name, for each initial (T, M, A) you look up your filtered authors (after step 2) and try to find one without just initial (M.) but with whole name (Mike), if you can't find one, use initial. Therefore, each of examples you gave would be converted to the same value, which would be full name (Tom Mike Apostol).
Things that are worth to think about:
Include mappings for name synonyms (would be more likely maximally hundred of records, like Thomas <-> Tom
This way is crucial to have valid initials (no M instead of N etc.).
edit: I've coded such thing some time ago, when I had to identify a person by it's signature, ignoring scanning problems, people sometimes sign by Name S. Surname, or N.S. or just by Name Surname (which is another thing maybe you should consider in the solution, to allow the algorithm to ignore second name, altough in your situation it would be rather rare to ommit someone's second name I guess).
I'm looking for references on separating a name: "John A. Doe" in parts, first=John, middle=A., last=Doe. In Mexico we have paternal, maternal, first and second given names, and can be written in different permutations, so the problem is quite complex.
As it depends on data, we are working with matching software that calculates a score for every word so we can take decisions (it is based on a big database). The input data is not clean, it is imported from some government web pages and is human filtered so it could have junk that has to be recognized as well. Any suggestions?
[Edit]
Examples:
name:
Javier Abdul Córdoba Gándara
common permutations (or as it may appear in gvt data referring to same person):
Córdoba Gándara Javier Abdul
Javier A. Córdoba Gándara
Javier Abdul Córdoba G.
paternal=Córdoba
maternal=Gándara
first given:Javier
second given:Abdul
name: María de la Luz Sánchez Martínez
paternal:Sánchez
maternal: Martínez
first given: María de la Luz
name: Paloma Viridiana Alin Arias Medina
paternal: Arias
maternal: Medina
first given: Paloma
second given: Viridiana Alin
As I said what the meaning of each word depends on the score. One has no way of knowing that Viridiana and Alin are given names if not from the score.
We have a very strong database (80 million records or so) so we can get some use of the scoring system. I am designing some algorithm that uses that but looking for other references.
Unfortunately - and having done quite a bit of this work myself - your ideal algorithm will be very data specific, and you will need to work this out for your particular situation.
Of the total time and effort to develop this algorithm, I'd say the time will be split roughly as follows:
10% for general string manipulation
30% for the specific nature of the
data (Mexican name formats, data input quirks)
60% to cater for data quality / lack of
quality
And I believe that's quite generous towards the general string manipulation. Of course it depends whether you need quality results for all records, or only the 'clean' records etc, and if you are able to ignore the 'difficult' records it makes it a lot simpler.
Some general tips
If they are not required, remove non alphanumeric / whitespace characters
Split on spaces
Use hyphens / punctuation to identify surnames or family names
Initials (which are generally single
letters) are not surnames; i.e. they
must be first / middle
determine the level of confidence that you have programmatically identified the each name (and test this thoroughly). You may find there are subsets of data that contain similar patterns that need to be catered for individually (they may come from different sources etc)
You may need to add some natural language or machine learning to check. The problem of identifying author names (e.g. in scientific papers) is difficult as they can be reported with differing orders, degrees of abbreviation, elisions etc. If your database is dirty you will end with ambiguity whatever you do.