suggestcompletion endpoint doesn't respect locale parameter - foursquare
I'm working with the venues/suggestcompletion endpoint, and I've noticed that it does not respect the locale parameter when returning locations. It seems to translate category names though.
Example:
https://api.foursquare.com/v2/venues/suggestcompletion?client_id=CLIENT_ID&client_secret=CLIENT_SECRET&locale=en&v=20141204&m=foursquare&ll=35.6894875,139.69170639999993&limit=5&query=jiro
Returns:
{"meta":{"code":200},"response":{"minivenues":[{"id":"4b0587a0f964a5205e9d22e3","name":"Live Music JIROKICHI","location":{"address":"高円寺北2-3-4","crossStreet":"高円寺ビル B1F","city":"杉並区","state":"東京都","postalCode":"166-0002","country":"JP","lat":35.70566356501932,"lng":139.65131118893623,"distance":4071},"categories":[{"id":"4bf58dd8d48988d1e7931735","name":"Jazz Club","pluralName":"Jazz Clubs","shortName":"Jazz Club","icon":{"prefix":"https:\/\/ss3.4sqi.net\/img\/categories_v2\/arts_entertainment\/musicvenue_jazzclub_","suffix":".png"}},{"id":"4d4b7104d754a06370d81259","name":"Arts & Entertainment","pluralName":"Arts & Entertainment","shortName":"Arts & Entertainment","icon":{"prefix":"https:\/\/ss3.4sqi.net\/img\/categories_v2\/arts_entertainment\/default_","suffix":".png"}},{"id":"4bf58dd8d48988d1e5931735","name":"Music Venue","pluralName":"Music Venues","shortName":"Music Venue","icon":{"prefix":"https:\/\/ss3.4sqi.net\/img\/categories_v2\/arts_entertainment\/musicvenue_","suffix":".png"}},{"id":"4bf58dd8d48988d1e9931735","name":"Rock Club","pluralName":"Rock Clubs","shortName":"Rock Club","icon":{"prefix":"https:\/\/ss3.4sqi.net\/img\/categories_v2\/arts_entertainment\/musicvenue_rockclub_","suffix":".png"}}]},{"id":"4df88a6f1495b7db1b0ec39b","name":"とんこつ大盛野菜ラーメン JIRORAMO","location":{"address":"相武台1-4507","crossStreet":"相武台ビル1F","city":"座間市","state":"神奈川県","postalCode":"","country":"JP","lat":35.49931385465121,"lng":139.40754349524184,"distance":33313},"categories":[{"id":"4bf58dd8d48988d1d1941735","name":"Ramen \/ Noodle House","pluralName":"Ramen \/ Noodle House","shortName":"Ramen \/ Noodles","icon":{"prefix":"https:\/\/ss3.4sqi.net\/img\/categories_v2\/food\/ramen_","suffix":".png"}},{"id":"4d4b7105d754a06374d81259","name":"Food","pluralName":"Food","shortName":"Food","icon":{"prefix":"https:\/\/ss3.4sqi.net\/img\/categories_v2\/food\/default_","suffix":".png"}}]},{"id":"4df982cad4c064db03a97ae1","name":"Betterabe Bistro Jiro","location":{"address":"富ヶ谷1-2-8","crossStreet":"","city":"渋谷区","state":"東京都","postalCode":"","country":"JP","lat":35.668034699164494,"lng":139.69092056158243,"distance":2389},"categories":[{"id":"4bf58dd8d48988d110941735","name":"Italian Restaurant","pluralName":"Italian Restaurants","shortName":"Italian","icon":{"prefix":"https:\/\/ss3.4sqi.net\/img\/categories_v2\/food\/italian_","suffix":".png"}},{"id":"4d4b7105d754a06374d81259","name":"Food","pluralName":"Food","shortName":"Food","icon":{"prefix":"https:\/\/ss3.4sqi.net\/img\/categories_v2\/food\/default_","suffix":".png"}},{"id":"4bf58dd8d48988d10c941735","name":"French Restaurant","pluralName":"French Restaurants","shortName":"French","icon":{"prefix":"https:\/\/ss3.4sqi.net\/img\/categories_v2\/food\/french_","suffix":".png"}}]},{"id":"4b76c40ff964a520975d2ee3","name":"Jiro's Bar","location":{"address":"両国3-26-6","crossStreet":"駅前ストリート","city":"東京","state":"","postalCode":"","country":"JP","lat":35.695693,"lng":139.793755,"distance":9251},"categories":[{"id":"4bf58dd8d48988d116941735","name":"Bar","pluralName":"Bars","shortName":"Bar","icon":{"prefix":"https:\/\/ss3.4sqi.net\/img\/categories_v2\/nightlife\/pub_","suffix":".png"}},{"id":"4d4b7105d754a06376d81259","name":"Nightlife Spot","pluralName":"Nightlife Spots","shortName":"Nightlife","icon":{"prefix":"https:\/\/ss3.4sqi.net\/img\/categories_v2\/nightlife\/default_","suffix":".png"}}]},{"id":"530b393b498eedd60dc1375b","name":"BAR JIROISE","location":{"address":"","crossStreet":"","city":"","state":"","postalCode":"","country":"JP","lat":35.658016204833984,"lng":139.69776916503906,"distance":3545},"categories":[{"id":"4bf58dd8d48988d116941735","name":"Bar","pluralName":"Bars","shortName":"Bar","icon":{"prefix":"https:\/\/ss3.4sqi.net\/img\/categories_v2\/nightlife\/pub_","suffix":".png"}},{"id":"4d4b7105d754a06376d81259","name":"Nightlife Spot","pluralName":"Nightlife Spots","shortName":"Nightlife","icon":{"prefix":"https:\/\/ss3.4sqi.net\/img\/categories_v2\/nightlife\/default_","suffix":".png"}}]}]}}
Notice that the city names are in Japanese, yet I did specify en as the locale.
The venues/search endpoint returns city names in English, as expected.
Is this by design or a bug?
Related
How to correct UndefinedVariableError in python?
When I implement code for one paricular value of state name (see Last_ residence in code) andhrapradesh.query('Duration_of_residence=="All durations of residence" & Last_residence_R_or_U=="Urban" & Last_residence=="Jammu & Kashmir"',inplace=True) print(andhrapradesh['Total_migrants'].sum()) It gives desired sum of outflow value for that state from pandas csv. But when I tried to calculate for all possible names of state it's giving me error "UndefinedVariableError: name 'Jammu & Kashmir' is not defined" states = ["Jammu & Kashmir","Punjab",'Himachal Pradesh'] for name in states: andhrapradesh.query(f'Duration_of_residence=="All durations of residence" & Last_residence_R_or_U=="Urban" & Last_residence=={name}',inplace=True) print(andhrapradesh['Total_migrants'].sum()) can you please figure out why it showing error and how can I do it for all values in list states.
You need quotes around name like you had in your manual code: df.query(f"... Last_residence == \"{name}\"") so that it is seen as, e.g., "Punjab" as a string to check against but not bare Punjab which is sought specially, e.g., as a column name, hence the error. (can use single quotes too...) However, there's a better way: you can prefix a variable with # symbol, so you don't even need the f-string: df.query("... Last_residence == #name")
excel vba - remove certain type of text in article
Excel vba I am trying to remove certain unwanted text ( from JSON ) from a article (basically different news article) through Excel VBA [{"body2":""},{"body2":" [{"body2":""} "} ,{"body2":" \n \t "}] E.g of article that needs to be edited First example [{"body2":"AA Engages the Community Despite Training and Operations\n\n"},{"body2":"We thank Mr tin Tan for his suggestions in the forum letter titled \"Give time off for SDE to volunteer\", dated 25 Jan 2014."},{"body2":"hhh serves the critical need of our way of life. As such, training and operations is the main priority. Nevertheless, the AA is actively involved in supporting our community. Our units organise community engagement projects regularly, such as visiting homes to help and encourage fellow people, and engaging students through the Schools Partnership Programme. These projects allow our people to participate in community service whilst fulfilling theirservice commitments. In 2013, our soldiers supported more than 50 homes and engaged more than 30 schools.\n"},{"body2":"G.Huang\nHead\nJoint Manpower Department"},{"body2":"\nSDE's reply was published as \"AA Engages the Community Despite Training and Operations\", The NEWs Times, 28 Jan 14"}] Second example [{"body2":""},{"body2":"Local universities offer SDE options for early entry"},{"body2":"AA would like to respond to issues raised from letters to the News Times on enabling full-time staff to enter universities earlier."},{"body2":"AA is mindful that Service requires staff to delay their personal pursuits, and will not make the duration of full-time service longer than necessary. The current two years of full-time is necessary to train our staff to the required level of proficiency so that they can be part of the standing force to meet operational needs."},{"body2":"subsequently, completes his full-time service."},{"body2":"AA grants disruptions to stuffs to allow them to commence their university studies at the same time as This ensures that they are no better off or worse off than their peers in the same cohort, who have also served NS."},{"body2":"In terms of entry to local universities, there are more options now."},{"body2":"In addition, the University of Technology and Design has a unique academic calendar that starts in May, ahead of other universities, and its eight-semester course can be completed in 3.5 year."},{"body2":"Tm\nDirector Manpower\ndepert\n"},{"body2":" "},{"body2":"SDE was published as \"Local varsities offer staff options for early entry: AA\", The News Times, 12 Jun 12"},{"body2":""}] What i have done so far Public Function RemoveN(RemN As String) As String Dim PosN As Integer ' remove any \n PosN = InStr(1, RemN, "\n") Do While PosN > 0 RemN = Replace(RemN, "\n", "") PosN = InStr(1, RemN, "\n") Loop RemoveN = RemN End Function Public Function RemoveT(RemT As String) As String Dim Post As Integer ' remove any \n Post = InStr(1, RemT, "\t") Do While Post > 0 RemT = Replace(RemT, "\t", "") Post = InStr(1, RemT, "\t") Loop RemoveT = RemT End Function Public Function RemoveSlash(Rems As String) As String Dim Poss As Integer ' remove any \n Poss = InStr(1, Rems, "\") Do While Poss > 0 Rems = Replace(Rems, "\", "") Poss = InStr(1, Rems, "\") Loop RemoveSlash = Rems End Function
This routine using regular expressions appears to clean up your article as you state. In this example body2 can be any word that does not include the colon. But if you specifically want to look for body2, replace the [^:] section with body2 Option Explicit Function CleanUp(S As String) As String Dim RE As Object Const sPat As String = "\[?\{[^:]+:""|""}]?|\\n|\\""|,(?=\{)" Set RE = CreateObject("vbscript.regexp") With RE .Global = True .Pattern = sPat 'Replace pattern with a space CleanUp = WorksheetFunction.Trim(.Replace(S, " ")) 'cleanup spaces before commas .Pattern = "\s+," CleanUp = .Replace(CleanUp, ",") End With End Function Explanation of Regex \[?\{[^:]+:"|"}]?|\\n|\\"|,(?=\{) Options: Case insensitive; ^$ don’t match at line breaks Match this alternative \[?\{[^:]+:" [Match the character “” literally \[? Between zero and one times, as many times as possible, giving back as needed (greedy) ? Match the character “{” literally \{ Match any character that is NOT the colon character [^:]+ Between one and unlimited times, as many times as possible, giving back as needed (greedy) + Match the character string “:"” literally :" Or match this alternative "}]? Match the character string “"}” literally "} [Match the character “]” literally]4 ]? Between zero and one times, as many times as possible, giving back as needed (greedy) ? Or match this alternative \\n Match the backslash character \\ Match the character “n” literally n Or match this alternative \\" Match the backslash character \\ Match the character “"” literally " Or match this alternative ,(?=\{) Match the character “,” literally , Assert that the regex below can be matched starting at this position (positive lookahead) (?=\{) Match the character “{” literally \{ Created with RegexBuddy
NumberFormat.parse() does not work for FRANCE Locale space as thousand seperator
I have written below Java code to see how locales behave with numbers. I am facing with FRENCH style. double n = 123456789.123; System.out.println("US "+ NumberFormat.getNumberInstance(Locale.US).format(n)); //###,###.### System.out.println("FRENCH "+ NumberFormat.getNumberInstance(Locale.FRENCH).format(n)); // # ###,## System.out.println("GERMAN "+ NumberFormat.getNumberInstance(Locale.GERMAN).format(n)); // ###.###,## System.out.println(NumberFormat.getNumberInstance(Locale.US).parse("123,451.23")); System.out.println(NumberFormat.getNumberInstance(Locale.GERMANY).parse("123.451,23")); System.out.println(NumberFormat.getNumberInstance(Locale.FRANCE).parse("123 451,23")); OUTPUT US 123,456,789.123 FRENCH 123 456 789,123 GERMAN 123.456.789,123 123451.23 123451.23 123 As you can see space is used as thousands separator for FRENCH locale. But when I tried to generate number "123 451,23" it does not recognize space as thousands separator. Is this the expected behavior ? EDIT: As a workaround I replaced space with ".". So number becomes a GERMANY format. And then convert it using that locale. input = input.replace(" ", "."); // Now "123 451,23" is "123.451,23" So which is same as german System.out.println(NumberFormat.getNumberInstance(Locale.GERMANY).parse(input)); OUTPUT 123451.23
This is a known issue in old JDKs. Upgrade it or you will this issue
Tell if specific char in string is a long char or a short char
Be prepared, this is one of those hard questions. In Farsi or Persian language ی which sounds like y or i and is written in 4 different shapes according to it's place in word. I'll call ی as YA from now for simplification. take a look at this image All YA characters are painted in red, in the first word YA is attached to it's previous (right , in Farsi we right from RIGHT to LEFT) character and is free at the end whereas the last YA (3rd word, left-most red char) is free both from left or right. Having said this long story, I want to find out if a part of a string ends with long YA (YA without points) or short YA (YA with two points beneath it). i.e تحصیلداری (the 3rd word) ends with long YA but تحصیـ which is a part of 3rd word does not ends with short YA. Question: How can I say تحصیلداری ends whit which unicode? I just have a simple string, "تحصیلداری", how can I convert its characters to unicode? I tried the unicodes string unicodes = ""; foreach (char c in "تحصیلداری") { unicodes += c+" "+((int)c).ToString() + Environment.NewLine; } MessageBox.Show(unicodes); result : but at the end of the day unfortunately all YAs have the same unicode. Bad news : YA was an example, a real one though. There are also a dozen of other characters like YA with different appearances too. Additional info : using this useful link about unicodes I found unicode of different YAs
We solved similar problem the way bellow: We had a core banking application, the customer sub-system needed a full text search on customers name, family, father name etc. Different encoding, legacy migrated data, keyboard layouts and Farsi fonts ... made search process inaccurate. We overcame the problem by replacing problematic characters with some standard one and saving the standard string for search purpose. After several iterations, the replacement is as bellow that may come in handy: Formula="UPPER(REPLACE(REPLACE(REPLACE (REPLACE(REPLACE(REPLACE(REPLACE(REPLACE (REPLACE(REPLACE(REPLACE(REPLACE(REPLACE (REPLACE(REPLACE(REPLACE(REPLACE(REPLACE (REPLACE(REPLACE(REPLACE(REPLACE (REPLACE(FirsName || LastName || FatherName, chr(32),''), chr(13),''), chr(9),''), chr(10),''), '-',''), '-',''), 'آ','ا'), 'أ', 'ا'), 'ئ', 'ي'), 'ي', 'ي'), 'ك', 'ک'), 'آإئؤةي','اايوهي'), 'ء',''), 'شأل','شاال'), 'ا.','اله'), '.',''), 'الله','اله'), 'ؤ','و'), 'إ','ا'), 'ة','ه'), ' ا لله','اله'), 'ا لله','اله'), ' ا لله','اله'))"
Despite there are different YEHs in Unicode, it must noticed that all presentation forms of YEHs are same Unicode character with code 0x06cc. You can not determine presentation forms by their Unicode code. But you can reach your goal be checking to see what characters is before or after YEH. You can also use Fardis to see Unicode codes of strings.
Parse usable Street Address, City, State, Zip from a string [closed]
Closed. This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 8 years ago. Improve this question Problem: I have an address field from an Access database which has been converted to SQL Server 2005. This field has everything all in one field. I need to parse out the address's individual sections into their appropriate fields in a normalized table. I need to do this for approximately 4,000 records, and it needs to be repeatable. Assumptions: Assume an address in the US (for now) assume that the input string will sometimes contain an addressee (the person being addressed) and/or a second street address (i.e. Suite B) states may be abbreviated zip code could be standard 5 digits or zip+4 there are typos in some instances UPDATE: In response to the questions posed, standards were not universally followed; I need need to store the individual values, not just geocode and errors means typo (corrected above) Sample Data: A. P. Croll & Son 2299 Lewes-Georgetown Hwy, Georgetown, DE 19947 11522 Shawnee Road, Greenwood DE 19950 144 Kings Highway, S.W. Dover, DE 19901 Intergrated Const. Services 2 Penns Way Suite 405 New Castle, DE 19720 Humes Realty 33 Bridle Ridge Court, Lewes, DE 19958 Nichols Excavation 2742 Pulaski Hwy Newark, DE 19711 2284 Bryn Zion Road, Smyrna, DE 19904 VEI Dover Crossroads, LLC 1500 Serpentine Road, Suite 100 Baltimore MD 21 580 North Dupont Highway Dover, DE 19901 P.O. Box 778 Dover, DE 19903
I've done a lot of work on this kind of parsing. Because there are errors you won't get 100% accuracy, but there are a few things you can do to get most of the way there, and then do a visual BS test. Here's the general way to go about it. It's not code, because it's pretty academic to write it, there's no weirdness, just lots of string handling. (Now that you've posted some sample data, I've made some minor changes) Work backward. Start from the zip code, which will be near the end, and in one of two known formats: XXXXX or XXXXX-XXXX. If this doesn't appear, you can assume you're in the city, state portion, below. The next thing, before the zip, is going to be the state, and it'll be either in a two-letter format, or as words. You know what these will be, too -- there's only 50 of them. Also, you could soundex the words to help compensate for spelling errors. before that is the city, and it's probably on the same line as the state. You could use a zip-code database to check the city and state based on the zip, or at least use it as a BS detector. The street address will generally be one or two lines. The second line will generally be the suite number if there is one, but it could also be a PO box. It's going to be near-impossible to detect a name on the first or second line, though if it's not prefixed with a number (or if it's prefixed with an "attn:" or "attention to:" it could give you a hint as to whether it's a name or an address line. I hope this helps somewhat.
I think outsourcing the problem is the best bet: send it to the Google (or Yahoo) geocoder. The geocoder returns not only the lat/long (which aren't of interest here), but also a rich parsing of the address, with fields filled in that you didn't send (including ZIP+4 and county). For example, parsing "1600 Amphitheatre Parkway, Mountain View, CA" yields { "name": "1600 Amphitheatre Parkway, Mountain View, CA, USA", "Status": { "code": 200, "request": "geocode" }, "Placemark": [ { "address": "1600 Amphitheatre Pkwy, Mountain View, CA 94043, USA", "AddressDetails": { "Country": { "CountryNameCode": "US", "AdministrativeArea": { "AdministrativeAreaName": "CA", "SubAdministrativeArea": { "SubAdministrativeAreaName": "Santa Clara", "Locality": { "LocalityName": "Mountain View", "Thoroughfare": { "ThoroughfareName": "1600 Amphitheatre Pkwy" }, "PostalCode": { "PostalCodeNumber": "94043" } } } } }, "Accuracy": 8 }, "Point": { "coordinates": [-122.083739, 37.423021, 0] } } ] } Now that's parseable!
The original poster has likely long moved on, but I took a stab at porting the Perl Geo::StreetAddress:US module used by geocoder.us to C#, dumped it on CodePlex, and think that people stumbling across this question in the future may find it useful: US Address Parser On the project's home page, I try to talk about its (very real) limitations. Since it is not backed by the USPS database of valid street addresses, parsing can be ambiguous and it can't confirm nor deny the validity of a given address. It can just try to pull data out from the string. It's meant for the case when you need to get a set of data mostly in the right fields, or want to provide a shortcut to data entry (letting users paste an address into a textbox rather than tabbing among multiple fields). It is not meant for verifying the deliverability of an address. It doesn't attempt to parse out anything above the street line, but one could probably diddle with the regex to get something reasonably close--I'd probably just break it off at the house number.
I've done this in the past. Either do it manually, (build a nice gui that helps the user do it quickly) or have it automated and check against a recent address database (you have to buy that) and manually handle errors. Manual handling will take about 10 seconds each, meaning you can do 3600/10 = 360 per hour, so 4000 should take you approximately 11-12 hours. This will give you a high rate of accuracy. For automation, you need a recent US address database, and tweak your rules against that. I suggest not going fancy on the regex (hard to maintain long-term, so many exceptions). Go for 90% match against the database, do the rest manually. Do get a copy of Postal Addressing Standards (USPS) at http://pe.usps.gov/cpim/ftp/pubs/Pub28/pub28.pdf and notice it is 130+ pages long. Regexes to implement that would be nuts. For international addresses, all bets are off. US-based workers would not be able to validate. Alternatively, use a data service. I have, however, no recommendations. Furthermore: when you do send out the stuff in the mail (that's what it's for, right?) make sure you put "address correction requested" on the envelope (in the right place) and update the database. (We made a simple gui for the front desk person to do that; the person who actually sorts through the mail) Finally, when you have scrubbed data, look for duplicates.
After the advice here, I have devised the following function in VB which creates passable, although not always perfect (if a company name and a suite line are given, it combines the suite and city) usable data. Please feel free to comment/refactor/yell at me for breaking one of my own rules, etc.: Public Function parseAddress(ByVal input As String) As Collection input = input.Replace(",", "") input = input.Replace(" ", " ") Dim splitString() As String = Split(input) Dim streetMarker() As String = New String() {"street", "st", "st.", "avenue", "ave", "ave.", "blvd", "blvd.", "highway", "hwy", "hwy.", "box", "road", "rd", "rd.", "lane", "ln", "ln.", "circle", "circ", "circ.", "court", "ct", "ct."} Dim address1 As String Dim address2 As String = "" Dim city As String Dim state As String Dim zip As String Dim streetMarkerIndex As Integer zip = splitString(splitString.Length - 1).ToString() state = splitString(splitString.Length - 2).ToString() streetMarkerIndex = getLastIndexOf(splitString, streetMarker) + 1 Dim sb As New StringBuilder For counter As Integer = streetMarkerIndex To splitString.Length - 3 sb.Append(splitString(counter) + " ") Next counter city = RTrim(sb.ToString()) Dim addressIndex As Integer = 0 For counter As Integer = 0 To streetMarkerIndex If IsNumeric(splitString(counter)) _ Or splitString(counter).ToString.ToLower = "po" _ Or splitString(counter).ToString().ToLower().Replace(".", "") = "po" Then addressIndex = counter Exit For End If Next counter sb = New StringBuilder For counter As Integer = addressIndex To streetMarkerIndex - 1 sb.Append(splitString(counter) + " ") Next counter address1 = RTrim(sb.ToString()) sb = New StringBuilder If addressIndex = 0 Then If splitString(splitString.Length - 2).ToString() <> splitString(streetMarkerIndex + 1) Then For counter As Integer = streetMarkerIndex To splitString.Length - 2 sb.Append(splitString(counter) + " ") Next counter End If Else For counter As Integer = 0 To addressIndex - 1 sb.Append(splitString(counter) + " ") Next counter End If address2 = RTrim(sb.ToString()) Dim output As New Collection output.Add(address1, "Address1") output.Add(address2, "Address2") output.Add(city, "City") output.Add(state, "State") output.Add(zip, "Zip") Return output End Function Private Function getLastIndexOf(ByVal sArray As String(), ByVal checkArray As String()) As Integer Dim sourceIndex As Integer = 0 Dim outputIndex As Integer = 0 For Each item As String In checkArray For Each source As String In sArray If source.ToLower = item.ToLower Then outputIndex = sourceIndex If item.ToLower = "box" Then outputIndex = outputIndex + 1 End If End If sourceIndex = sourceIndex + 1 Next sourceIndex = 0 Next Return outputIndex End Function Passing the parseAddress function "A. P. Croll & Son 2299 Lewes-Georgetown Hwy, Georgetown, DE 19947" returns: 2299 Lewes-Georgetown Hwy A. P. Croll & Son Georgetown DE 19947
I've been working in the address processing domain for about 5 years now, and there really is no silver bullet. The correct solution is going to depend on the value of the data. If it's not very valuable, throw it through a parser as the other answers suggest. If it's even somewhat valuable you'll definitely need to have a human evaluate/correct all the results of the parser. If you're looking for a fully automated, repeatable solution, you probably want to talk to a address correction vendor like Group1 or Trillium.
SmartyStreets has a new feature that extracts addresses from arbitrary input strings. (Note: I don't work at SmartyStreets.) It successfully extracted all addresses from the sample input given in the question above. (By the way, only 9 of those 10 addresses are valid.) Here's some of the output: And here's the CSV-formatted output of that same request: ID,Start,End,Segment,Verified,Candidate,Firm,FirstLine,SecondLine,LastLine,City,State,ZIPCode,County,DpvFootnotes,DeliveryPointBarcode,Active,Vacant,CMRA,MatchCode,Latitude,Longitude,Precision,RDI,RecordType,BuildingDefaultIndicator,CongressionalDistrict,Footnotes 1,32,79,"2299 Lewes-Georgetown Hwy, Georgetown, DE 19947",N,,,,,,,,,,,,,,,,,,,,,, 2,81,119,"11522 Shawnee Road, Greenwood DE 19950",Y,0,,11522 Shawnee Rd,,Greenwood DE 19950-5209,Greenwood,DE,19950,Sussex,AABB,199505209226,Y,N,N,Y,38.82865,-75.54907,Zip9,Residential,S,,AL,N# 3,121,160,"144 Kings Highway, S.W. Dover, DE 19901",Y,0,,144 Kings Hwy,,Dover DE 19901-7308,Dover,DE,19901,Kent,AABB,199017308444,Y,N,N,Y,39.16081,-75.52377,Zip9,Commercial,S,,AL,L# 4,190,232,"2 Penns Way Suite 405 New Castle, DE 19720",Y,0,,2 Penns Way Ste 405,,New Castle DE 19720-2407,New Castle,DE,19720,New Castle,AABB,197202407053,Y,N,N,Y,39.68332,-75.61043,Zip9,Commercial,H,,AL,N# 5,247,285,"33 Bridle Ridge Court, Lewes, DE 19958",Y,0,,33 Bridle Ridge Cir,,Lewes DE 19958-8961,Lewes,DE,19958,Sussex,AABB,199588961338,Y,N,N,Y,38.72749,-75.17055,Zip7,Residential,S,,AL,L# 6,306,339,"2742 Pulaski Hwy Newark, DE 19711",Y,0,,2742 Pulaski Hwy,,Newark DE 19702-3911,Newark,DE,19702,New Castle,AABB,197023911421,Y,N,N,Y,39.60328,-75.75869,Zip9,Commercial,S,,AL,A# 7,341,378,"2284 Bryn Zion Road, Smyrna, DE 19904",Y,0,,2284 Bryn Zion Rd,,Smyrna DE 19977-3895,Smyrna,DE,19977,Kent,AABB,199773895840,Y,N,N,Y,39.23937,-75.64065,Zip7,Residential,S,,AL,A#N# 8,406,450,"1500 Serpentine Road, Suite 100 Baltimore MD",Y,0,,1500 Serpentine Rd Ste 100,,Baltimore MD 21209-2034,Baltimore,MD,21209,Baltimore,AABB,212092034250,Y,N,N,Y,39.38194,-76.65856,Zip9,Commercial,H,,03,N# 9,455,495,"580 North Dupont Highway Dover, DE 19901",Y,0,,580 N DuPont Hwy,,Dover DE 19901-3961,Dover,DE,19901,Kent,AABB,199013961803,Y,N,N,Y,39.17576,-75.5241,Zip9,Commercial,S,,AL,N# 10,497,525,"P.O. Box 778 Dover, DE 19903",Y,0,,PO Box 778,,Dover DE 19903-0778,Dover,DE,19903,Kent,AABB,199030778781,Y,N,N,Y,39.20946,-75.57012,Zip5,Residential,P,,AL, I was the developer who originally wrote the service. The algorithm we implemented is a bit different from any specific answers here, but each extracted address is verified against the address lookup API, so you can be sure if it's valid or not. Each verified result is guaranteed, but we know the other results won't be perfect because, as has been made abundantly clear in this thread, addresses are unpredictable, even for humans sometimes.
This won't solve your problem, but if you only needed lat/long data for these addresses, the Google Maps API will parse non-formatted addresses pretty well. Good suggestion, alternatively you can execute a CURL request for each address to Google Maps and it will return the properly formatted address. From that, you can regex to your heart's content.
+1 on James A. Rosen's suggested solution as it has worked well for me, however for completists this site is a fascinating read and the best attempt I've seen in documenting addresses worldwide: http://www.columbia.edu/kermit/postal.html
Are there any standards at all in the way that the addresses are recorded? For example: Are there always commas or new-lines separating street1 from street2 from city from state from zip? Are address types (road, street, boulevard, etc) always spelled out? always abbreviated? Some of each? Define "error". My general answer is a series of Regular Expressions, though the complexity of this depends on the answer. And if there is no consistency at all, then you may only be able to achieve partial success with a Regex (ie: filtering out zip code and state) and will have to do the rest by hand (or at least go through the rest very carefully to make sure you spot the errors).
Another request for sample data. As has been mentioned I would work backwards from the zip. Once you have a zip I would query a zip database, store the results, and remove them & the zip from the string. That will leave you with the address mess. MOST (All?) addresses will start with a number so find the first occurrence of a number in the remaining string and grab everything from it to the (new) end of the string. That will be your address. Anything to the left of that number is likely an addressee. You should now have the City, State, & Zip stored in a table and possibly two strings, addressee and address. For the address, check for the existence of "Suite" or "Apt." etc. and split that into two values (address lines 1 & 2). For the addressee I would punt and grab the last word of that string as the last name and put the rest into the first name field. If you don't want to do that, you'll need to check for salutation (Mr., Ms., Dr., etc.) at the start and make some assumptions based on the number of spaces as to how the name is made up. I don't think there's any way you can parse with 100% accuracy.
Try www.address-parser.com. We use their web service, which you can test online
Based on the sample data: I would start at the end of the string. Parse a Zip-code (either format). Read end to first space. If no Zip Code was found Error. Trim the end then for spaces and special chars (commas) Then move on to State, again use the Space as the delimiter. Maybe use a lookup list to validate 2 letter state codes, and full state names. If no valid state found, error. Trim spaces and commas from the end again. City gets tricky, I would actually use a comma here, at the risk of getting too much data in the city. Look for the comma, or beginning of the line. If you still have chars left in the string, shove all of that into an address field. This isn't perfect, but it should be a pretty good starting point.
If it's human entered data, then you'll spend too much time trying to code around the exceptions. Try: Regular expression to extract the zip code Zip code lookup (via appropriate government DB) to get the correct address Get an intern to manually verify the new data matches the old
This won't solve your problem, but if you only needed lat/long data for these addresses, the Google Maps API will parse non-formatted addresses pretty well.
RecogniContact is a Windows COM object that parses US and European addresses. You can try it right on http://www.loquisoft.com/index.php?page=8
You might want to check this out!! http://jgeocoder.sourceforge.net/parser.html Worked like a charm for me.
This type of problem is hard to solve because of underlying ambiguities in the data. Here is a Perl based solution that defines a recursive descent grammar tree based on regular expressions to parse many valid combination of street addresses: http://search.cpan.org/~kimryan/Lingua-EN-AddressParse-1.20/lib/Lingua/EN/AddressParse.pm . This includes sub properties within an address such as: 12 1st Avenue N Suite # 2 Somewhere CA 12345 USA It is similar to http://search.cpan.org/~timb/Geo-StreetAddress-US-1.03/US.pm mentioned above, but also works for addresses that are not from the USA, such as the UK, Australia and Canada. Here is the output for one of your sample addresses. Note that the name section would need to be removed first from "A. P. Croll & Son 2299 Lewes-Georgetown Hwy, Georgetown, DE 19947" to reduce it to "2299 Lewes-Georgetown Hwy, Georgetown, DE 19947". This is easily achieved by removing all data up to the first number found in the string. Non matching part '' Error '0' Error descriptions '' Case all '2299 Lewes-Georgetown Hwy Georgetown DE 19947' COMPONENTS '' country '' po_box_type '' post_box '' post_code '19947' pre_cursor '' property_identifier '2299' property_name '' road_box '' street 'Lewes-Georgetown' street_direction '' street_type 'Hwy' sub_property_identifier '' subcountry 'DE' suburb 'Georgetown'
Since there is chance of error in word, think about using SOUNDEX combined with LCS algorithm to compare strings, this will help a lot !
using google API $d=str_replace(" ", "+", $address_url); $completeurl ="http://maps.googleapis.com/maps/api/geocode/xml?address=".$d."&sensor=true"; $phpobject = simplexml_load_file($completeurl); print_r($phpobject);
For ruby or rails developers there is a nice gem available called street_address. I have been using this on one of my project and it does the work I need. The only Issue I had was whenever an address is in this format P. O. Box 1410 Durham, NC 27702 it returned nil and therefore I had to replace "P. O. Box" with '' and after this it were able to parse it.
There are data services that given a zip code will give you list of street names in that zip code. Use a regex to extract Zip or City State - find the correct one or if a error get both. pull the list of streets from a data source Correct the city and state, and then street address. Once you get a valid Address line 1, city, state, and zip you can then make assumptions on address line 2..3
I don't know HOW FEASIBLE this would be, but I haven't seen this mentioned so I thought I would go ahead and suggest this: If you are strictly in the US... get a huge database of all zip codes, states, cities and streets. Now look for these in your addresses. You can validate what you find by testing if, say, the city you found exists in the state you found, or by checking if the street you found exists in the city you found. If not, chances are John isn't for John's street, but is the name of the addressee... Basically, get the most information you can and check your addresses against it. An extreme example would be to get A LIST OF ALL THE ADDRESSES IN THE US OF A and then find which one has the most relevant match to each of your addresses...
There is javascript port of perl Geo::StreetAddress::US package: https://github.com/hassansin/parse-address . It's regex-based and works fairly well.