Geocoding street addresses with no suffixes - python-3.x

Situation:
I have been tasked with geocoding and plotting addresses to a map of a city for a friend of the family.
I have a list of over 400 addresses. Included in those addresses are PO Boxes and addresses with Street Number, Direction, Street Name, Street Suffix (some do not have this), City, and Zip Code.
I tried to geocode all of the addresses with Geopy and Nominatim.
Issue:
I noticed that those addresses without street suffixes and PO Boxes could not be geocoded.
What I have done:
I have read most posts dealing with addresses, read the Geopy notes and google searched until the cows came home.
I ended up stumbling across a geocoding website that PO boxes could not be mapped and that street suffix is required for mapping.
http://www.gis.harvard.edu/services/blog/geocoding-best-practices
Question:
Is there a way to search for the street suffix of each street that is missing a street suffix?
Is there another free service or library that can be utilized other than Nominatim and Geopy that can utilize the information I have and not require me to look up each individual street suffix in google maps?
Please advise!

I found out that using Geopy with Google's API can find the correct addresses that services like Nominatim, OpenCage and OpenMapquest will not fine.
There is one downside, the autocomplete can make it hard to determine if the address is the correct address.

First, speaking to the need to find an address that is missing a street suffix, you need to use address completion from an address validation service. Services that do address validation/verification use postal service data (and other data) and match address search inputs to real addresses. If the search input is not sufficiently specific, address validation services may return a handful of potential matches. Here is an example of a non-specific address (missing the State, zip code, and the street suffix) that returns two real addresses that match the search input. SmartyStreets can normally fill in the missing street suffix.
Second, speaking to the PO Box problem: some address services can give you geocode information, as well as other information that you may believe isn't available. For instance, this search shows the SmartyStreets service matching a PO Box number (that I just made up) to the local post office. The latitude and longitude in the response JSON corresponds to the post office when I search it on Google Maps.
Third, speaking to the problem of having a list of addresses: there are various address services that allow batch processes. For instance, it's a fairly common feature to allow a user to upload a spreadsheet of addresses. Here is the information page for SmartyStreets' tool.
There are multiple address services that can help you do all or some of these things. Depending on the service, they will provide some free functionality or have free tiers if you don't do very many searches. I am not aware of a service that does everything you need for free. You could probably use a few services together, like the Google Maps API to Geopy, etc, but it would take effort to code up a script to put them all together.
Full disclosure: I worked for SmartyStreets.

Related

Is there a way to fix internationalization inconsistencies in Google Places API calls?

I'm using the Google Places API for my project and I found that different places will have different localized names in their address components.
For instance:
some places in Lisbon (Portugal) will have a locality name of Lisbon while others will have Lisboa.
some places in Barcelona will have a administrative_area_level_1 name of Catalonia, while others will have Catalunya.
My questions are:
Is there a way to get consistent results using a same reference language?
Is there a way to help Google fix this inconsistent behavior?
ps: my purpose it to be able to perform text-based search from Google Places API data, and these localization differences are not helping.
This may be working as intended, or an inconsistency in how some places have their address registered on Google Maps.
It is the intended behavior that the local language (Lisbon, Cataluña) is used for street-level addresses, while the user's preferred language is used for the other places (postal codes and political entites). Reverse geocoding while preferring non-local language shows this:
https://maps.googleapis.com/maps/api/geocode/json?&latlng=38.716322,-9.149895&language=en
street_address: R. Cecílio de Sousa 84, 1200-009 Lisboa, Portugal
locality: Lisbon, Portugal
https://maps.googleapis.com/maps/api/geocode/json?&latlng=41.738528,1.851196&language=en
street_address: C-16C, 2, 08243 Manresa, Barcelona (Cataluña), Spain
postal_code: 08243 Manresa, Barcelona (Catalunya), Spain
(Cataluña/Catalunya/Catalonia is not usually in formatted_address)
However, there may be addresses that were registered without correctly linking to the appropriate political entities, e.g. using "Catalonia" as a hard-coded address component instead of as a reference to the administrative_area_level_1 itself. These would appear with inconsistent names, even for street-level addresses. Such should be rare, but please consider filing a bug when found.

IP to world region

Is there an easy way to detect user region using IP address? I'd assume that each region will have a specific IPv4 range assigned.
I don't want to rely on any 3rd party service (except initial data import)
Precision does not need to be 100%, but should be reasonably high (at least ~80%)
I only want to guess users region (Europe, Asia, Africa, ...). No need for city/country.
Not sure what language you're using. But you can do it with GeoIP or similar modules. If you dont want any 3rd party libraries at all you can make a function based on IANA numbers https://www.iana.org/number. Especially look under "IPv4 Address Space".

Parsing addresses with ambiguous data

I have data of phone numbers and village names collected from the villagers via forms. Because of various reasons the data is inaccurate or incomplete.
The idea is to validate these two data points before adding them to the data base/store.
The phone numbers are being formatted programmatically and validated via an external API. (That gives me the service provider and province information).
The problem is with the addresses.
No standardized address line. Tons of ambiguity.
Numeric street names and door numbers exist.
Input string will sometimes contain an addressee.
Possible solutions I can think of
Reverse geocoding helps. But not very accurate when it comes to Indian context. The Google TOS also prohibits automated queries. (correct me if I'm wrong here)
Soundexing. Again not very accurate with Indian data.
I understand it's difficult to such highly unstructured data, but I'm looking for a ways to achieve atleast enough accuracy to map addresses to the nearest point of interest.
Queries
Given a village name from the villager who might spell it wrong or incorrectly or abbreviate it how do I get the correct official name of the village and location?
Any possible ways to sanitize bad location/addresses or decode complex/poorly formed addresses?
Are there any machine learning solutions that can help so I can learn from every computation?(I have 0 knowledge on ML, do correct me if I'm wrong here.)
What you want is a geolocation system that works with informal text input. I have a previously used a Text-based geolocation model trained on Twitter data.
To solve your problem, you need training data in the form of:
informal_text village_name
If you have access to such data (e.g. using the addresses which can be geolocated) then you can train a text-based classifier that given a new informal address can predict where on the map it points to. In your case every village becomes a class label. You can use scikit-learn to train the classifier.

SSIS Split String address

I have a column which is made up of addresses as show below.
Address
1 Reid Street, Manchester, M1 2DF
12 Borough Road, London, E12,2FH
15 Jones Street, Newcastle, Tyne & Wear, NE1 3DN
etc .. etc....
I am wanting to split this into different columns to import into my SQL database. I have been trying to use Findstring to seperate by the comma but am having trouble when some addresses have more "sections" than others. ANy ideas whats the best way to go about this?
Many THanks
This is a requirements specification problem, not an implementation problem. The more you can afford to assume about the format of the addresses, the more detailed parsing you will be able to do; the other side of the same coin is that the less you will assume about the structure of the address, the fewer incorrect parses you will be blamed for.
It is crucial to determine whether you will only need to process UK postal emails, or whether worldwide addresses may occur.
Based on your examples, certain parts of the address seem to be always present, but please check this resource to determine whether they are really required in all UK email addresses.
If you find a match between the depth of parsing that you need, and the assumptions that you can safely make, you should be able to keep parsing by comma indexes (FINDSTRING); determine some components starting from the left, and some starting from the right of the string; and keep all that remains as an unparsed body.
It may also well happen that you will find that your current task is a mission impossible, especially in connection with international postal addresses. This is why most websites and other data collectors require the entry of postal address in an already parsed form by the user.
Excellent points raised by Hanika. Some of your parsing will depend on what your target destination looks like. As an ignorant yank, based on Hanika's link, I'd think your output would look something like
Addressee
Organisation
BuildingName
BuildingAddress
Locality
PostTown
Postcode
BasicsMet (boolean indicating whether minimum criteria for a good address has been met.)
In the US, just because an address could not be properly CASSed doesn't mean it couldn't be delivered - cip, my grandparent-in-laws live in enough small town that specifying their name and city is sufficient for delivery as local postal officials know who they are. For bulk mailings though, their address would not qualify for the bulk mailing rate and would default to first class mailing. I assume a similar scenario exists for UK mail
The general idea is for each row flowing through, you'll want to do your best to parse the data out into those buckets. The optimal solution for getting it "right" is to change the data entry method to validate and capture data into those discrete buckets. Since optimal never happens, it becomes your task to sort through the dross to find your gold.
Whilst you can write some fantastic expressions with FINDSTRING, I'd advise against it in this case as maintenance alone will drive you mad. Instead, add a Script Transformation and build your parsing logic in .NET (vb or c#). There will then be a cycle of running data through your transformation and having someone eyeball the results. If you find a new scenario, you go back and adjust your business rules. It's ugly, it's iterative and it's prone to producing results that a human wouldn't have.
Alternatives to rolling your address standardisation logic
buy it. Eventually your business needs outpace your ability to cope with constantly changing business rules. There are plenty of vendors out there but I'm only familiar with US based ones
upgrade to SQL Server 2012 to use DQS (Data Quality Services). You'll probably still need to buy a product to build out your knowledge base but you could offload the business rule making task to a domain expert ("Hey you, you make peanuts an hour. Make sure all the addresses coming out of this look like addresses" was how they covered this in the beginning of one of my jobs).

Smart ways to determine if a search request is an address or a business for use in Google APIS?

Here is the problem, I have an app with a search bar, the user can input something like "18th Street" or "Starbucks" and it uses the Google Geocoding and Local Search APIs respectively to get results.
I'm wondering is there a smart way to determine whether or not a given input string is an address that needs to be Geocoded, or a business name that needs to use Local Search.
Obviously I could try and handroll something, but I'm wondering if someone has already done this or Google provides such functionality themselves.
Cheers.
The first thing that comes to mind would be a regular expression that looks for a street address, but the important question is how your system would qualify an address.
It's reasonable enough to match something that is going to be fairly consistent in format like a fully qualified street address, but when it's something like "18th Street" how do you know they don't actually want a restaurant called "18th Street"? What you might consider is a regular expression that loosely attempts to match a street address and, if it finds one, call the Geocoding. In the event no results are returned by Geocoding, then default to a Local Search.
It turns out Local Search by default does this and processes Geocodes and business searches. There is some coarseness to it, but I guess that is to be expected
You can change this behaviour by specifying
mrt? This optional argument
specifies which type of listing the
user is interested in. Valid values
include:
* blended - request KML, Local Business Listings, and Geocode results
* kmlonly - request KML and Geocode results
* localonly - request Local Business Listings and Geocode results

Resources