I have address data. The address data is separated by commas and I need to display the city corresponding to this address in the next column on the sheet. The city itself is indicated in the address. Tell me how to select this city from the address, provided that in some addresses it is indicated at the beginning, in some at the end. thanks a lot
There's a ton of variability when it comes to addresses, how you'd split it out depends on the country(s), whether any of them are military, P.O. boxes, etc. Also obviously depends on how your data is formatted.
If your address data has some amount of standardization, like consistent comma separation of address parts, you may just need to use something like =SubField([Address], ',', 2), given an address like 1234 Fake Street, Chicago, IL 60601, where the SubString() function can split your string into one of its parts based on a delimiter, a comma in this case. You can use that function in both the Data Load script and in chart expressions.
There are many formats that won't work that cleanly, though:
Address
=SubField(Address, ',', 2)
Correctly Parsed?
1234 Fake Street, Chicago, IL 60601
Chicago
Yes
1234 Garbage ST., Nonsense, VT, USA
Nonsense
Yes
12 1ST ST NW, HAMPTON, IA 50441-1902
HAMPTON
Yes
1010 CLEAR ST, OTTAWA ON K1A 0B1, CANADA
OTTAWA ON K1A 0B1
No
4321 MAPLE ST, OAKTON MD 12345-6789
OAKTON MD 12345-6789
No
You can see several examples of address formats you may encounter on the USPS site here.
For more advanced parsing, you'll have to contend with the fact that Qlik doesn't have a Regex-like function that could otherwise prove to be useful in this case (unless you happen to have access to Qlik Web Connectors - see more here). You may be able to get clever with the Data Load script by using the SubField() function without the third parameter, like =SubField([Address], ','), using a comma or space as a delimiter (depends on your data) and then using some conditional logic, WHERE clauses, and aggregations to check for city-specific formatting from there.
Related
I have a list of data that I'm trying to analyse. It hasn't been tagged, so I am trying to set up a way of doing that. Let's say that my list is ice cream shops, the data is something like this:
Jack's Akron Ice Creamery
Gerry and Benn's Soft Serve
Macco's Strathfield Ice Creams and Treats
Auburn Ice Cream
Macco's Paddington Ice Creams and Treats
Patterson Ice Creamery
Cold Food Soft Serve of London
Jacks Cleveland Ice Creamery
Mrs Whipper Ice Creams Frankston
Mrs Whipper Frozen Treats Cranbourne
What I'd like to do is figure out which of these is the most prevalent in this data set (assuming I can regex and scrub away the punctuation differences).
From the above set, I would hope to surface:
Jack's
Macco's
Mrs Whipper
Soft Serve
Ice Creamery
Ice Creams and Treats
(note, this is a made up list, but I have a similarly random, but grouped list)
I have tried using the =MODE() and =INDEX() and =isnumber() functions with =sumifs() and others like these to no avail. I'd assume that the Google Sheets function of =SPLIT() would be of value, but the trouble is that I can't use Sheets due to our company policy.
Your last sentence makes me think you don’t have the newest version of excel, in which case this answer is pointless, but perhaps it will be useful for someone else. I haven’t come across a great workaround for the inability to use COUNTIF in LET so it’s a little messy, but it works.
=LET(x, TEXTJOIN(" ", 1, $A$2:$A$11),
y, FILTERXML("<t><s>"&SUBSTITUTE(x, " ", "</s><s>")&"</s></t>", "//s"),
z, UNIQUE(y),
p, CONCAT(y),
mycount, (LEN(p)-LEN(SUBSTITUTE(p,z,"")))/LEN(z),
mylist, SORT(IF(SEQUENCE(1,2)<2, z, mycount),2,-1),
IF(SEQUENCE(COUNTA(mylist),1)<=5, mylist, ""))
The steps are below:
Use LET to avoid helper columns and make the formula more readable.
Join all the cells into a string with TEXTJOIN and separate with a space.
Use FILTERXML to break up the string into individual cells at every instance of a space, and call this vector “y”.
Get the unique cells in “y” and call the new vector “z”.
Count all the matches of “z” in “y” by getting the difference in length of the whole string and the string minus instances of the word, divided by the length of the word (this is the workaround to COUNTIF)
Append these matches to the unique vector using SEQUENCE (thanks #P.b.!)
Sort by the number of matches, and finally elect to show only the top-5 words with the most matches
This gets it in one go, but from a practicality and speed standpoint, I would probably just use steps 2 and 3, paste the unique values, then use COUNTIF and delete the unneeded data.
There is the issue when one retrieves address from some web service search, you get multiple results for the same actual place. For example the "Reverse Geocoding API" by Google, example from documentation:
"277 Bedford Avenue, Brooklyn, NY 11211, USA"
"Grand St/Bedford Av, Brooklyn, NY 11211, USA"
"Grand St/Bedford Av, Brooklyn, NY 11249, USA"
"Bedford Av/Grand St, Brooklyn, NY 11211, USA"
"Brooklyn, NY 11211, USA"
"Williamsburg, Brooklyn, NY, USA"
Suppose I need to choose only 1 and the most detailed one, so naive solution is to return the one with maximum characters.
But just before it, I want to verify all the options are actually describing the same place. The appropriate CS topic is String metric. How can I apply these algorithms on this task? Some problems why most of the metrics not applicable in this situation:
The order of the words not the same.
Not all the word necessary should appear, for example the descriptor "St." etc.
Thanks,
I would not simply compare strings here. Try analysing the address and identifying the components. For example, in
277 Bedford Avenue, Brooklyn, NY 11211, USA
You can see that:
Items separated by commas represent different entities, although items not separated might also be different concepts.
Earlier items represent smaller areas, later items are larger. You have a specific location on a street, the street, the city, the state, the country. The last item won't always be the country, but you can check it against a list of countries and only if it fails that consider other options. Similarly a list of state codes allows you to identify the NY.
A long sequence of digits close to the end is probably a zip code.
A short(ish) number (always watch out for suffixes like 'th' and 'st') at the beginning is probably a street number.
And so on in between. Then you have a semantic representation. It's safe to say that most addresses are written in this way. Forms asking you for your address generally have the same fields.
(Actually in the case of Google you don't even have to figure this out for yourself, they tell you what the components are. They also tell you what the most specific thing is.)
For the next one, similar things apply, but it's more complicated:
Grand St/Bedford Av, Brooklyn, NY 11211, USA
'Av' and 'St' need to be transformed into 'Avenue' and 'Street'. The meaning of the slash is not clear. We can treat it like a comma and consider "Grand St" and "Bedford Av" to be two different pieces of information. But from their position and the words "Street" and "Avenue", we know that the both represent the same kind of thing. So let's just say this place has two streets, and leave the exact meaning of that open. Perhaps it's a corner, perhaps the same street has two names.
Now when you compare the first two entities, you know that they have the same country, zip code, state, and city, so that's a good start but that's not very specific. The street of the first one is mentioned in the second one so that's good. The fact that the second one mentions an extra street is not really a problem. A problem would be two places with the streets (A, B) and (B, C). The street number is not there but that just means that the second location is less specific, so it's like the first is contained within the second.
You can safely conclude that the second, third, and fourth addresses are all the same. Only the zip code differs and that happens sometimes (zip codes are weird), there is too much that is the same elsewhere to dismiss a match. Also the zip codes are numerically close. If the country or state was different then they shouldn't match, but maybe create an alert so that a human is notified and can see if something is wrong. Also make sure that you have a proper dictionary normalising different names for the same place, e.g. NY == New York. For the fourth address, we know how to recognise it as having two streets, and we can disregard order (treat the streets as a set).
The fifth address is again just less information for smaller areas, so it contains the previous addresses. Note that if you only compare the third and fifth addresses they do not match. This shows that when you match the first two addresses you should 'merge' them and note that the two zip codes may be considered equivalent. Then later it will even be possible to say that "Brooklyn, NY 11211, USA" and "Brooklyn, NY 11249, USA" match.
The last address does not match any of the others. However this is only considering the plain string form. Google does actually mention Williamsburg for the first address.
I have 2002 addresses which have all been compiled into a single cell during the download process from my server; in most cases, the hash (#) symbol is used to separate fields (such as Line 1, Line 2, City, Postcode).
I have spent a lot of time trying combinations of LEFT, MID and other functions, but to no avail; the problem is that as there are so many addresses, and not all of them have the same number of characters for each field (such as Postcode - some will have 6 characters (including blank space), where some others will have five or more/fewer), there doesn't appear to be a one-size-fits-all solution that I can enter once and then use Excel's auto-fill handle/feature to complete the process for all records.
Here is a sample of my data (which has been anonymised):
44A THE ADDRESS#EALING#LONDON#W1 1WW#
541 PARSON PLACE#HENDON#LONDON#NW4 4WN#
SOMEBODY PRACTICE CHALKHILL PCC THE WELFORD CTR#11B CHALKHILL AVENUE#WIMBLEDONE MIDDX#HH9 9HH#
THE SEBELMONT MEDICAL CLINIC 18 EASTERN ROAD#SOUTHALL#MIDDLESEX#UN1 1NU#
130 FINGOVER COURT#REDBUS STREET#CAMBERWELL#SE5 5ES#
KING'S ELBOW MEDICAL CENTRE 17F STAGLAND LANE#KINGSBURY#MIDDX#NW9 9WN#
10 LADYFOOT ROAD RUISLIP#MIDDLESEX#HA4 4AH#
I want to be able to extract everything between the hash symbols (excluding/omitting the hash symbols themselves) and I am dedicating four columns to store this data: Address Line 1, AL2, AL3, Postcode.
Going by the first example (44A THE ADDRESS#EALING#LONDON#W1 1WW#) which resides in a single cell, I hope to achieve something like the following outcome:
AL1 AL2 AL3 POSTCODE
44A THE ADDRESS EALING LONDON W1 1WW
It doesn't matter if some of the address sections appear under the wrong column - I can very easily rectify this and can even add another column; I simply want to be able to extract the data from the single cell.
If you import the data as a text file, you can normally select the delimiter.
File->open
<select the file from the dialogue box>
This dialogue box should appear, after clicking next, it will appear as above, at which point, you can select a hash as a delimiter- instant self data sorting!
I have a list of addresses which are individual strings in an Excel spreadsheet:
123 Sesame St New York, NY 00000
123 Sesame Ct Atlanta, GA 11111
100 Sesame Way, 400 Jacksonville, FL 22222
As you can see above the third address is different. It has a suite number of 400 on what would normally be the street line 2 line. I am having trouble coming up with a formula that will parse the addresses above into its individual cells: Street 1 (with street 2 or suite information in this line), City, State and Zip.
My thought is to start from the right and extract information based on a space delimiter, but I am not sure how to do this. How would I go about this?
I guess you can do a combination of MID & FIND to extract parts of the address,
e.g.
=IF(IFERROR(MID(A1,1,FIND(",",A1,FIND(",",A1)+1)),1)=1,MID(A1, 1, FIND(",",A1)-1),MID(A1,1,FIND(",",A1,FIND(",",A1)+1)-1))
will extract the address from cell A1, depending of the number of commas it finds (1 or > 1).
ZIP and state won't be too difficult following the above mentioned pattern. I think the problem is extracting the city as you don't know where to set the limit between the city name and the street unless you have a finite set of street types, e.g. ct, st, way etc.
You can use a slightly shorter formula using SUBSTITUTE() and LEN() in addition to FIND() and LEFT():
=LEFT(A1,FIND("#",SUBSTITUTE(A1,",","#",LEN(A1)-LEN(SUBSTITUTE(A1,",",""))))-1)
The first part which gets executed is:
LEN(A1)-LEN(SUBSTITUTE(A1,",",""))
Which basically calculates the number of commas in the input string. This then goes into the next formula:
SUBSTITUTE(A1,",","#",[1])
Which substitutes the last occurrence of comma by # (if addresses have this, use another character which you won't find in the address).
=LEFT(A1,FIND("#",[2])-1)
And the last part is takes the characters up to the # we just inserted.
I need to write an algorithm that returns the closest match for a contact based on the name and address entered by the user. Both of these are troubling, since there are so many ways to enter a company name and address, for instance:
Company A, 123 Any Street Suite 200, Anytown, AK 99012
Comp. A, 123 Any St., Suite 200, Anytown, AK 99012
CA, 123 Any Street Ste 200, Anytown, AK 99012
I have looked at doing a Levenshtein distance on the Name, but that doesn't seem a great tool, since they could abbreviate the name. I am looking for something that matches on the most information possible.
My initial attempt was to limit the results first by the first 5 digits of the postal code and then try to filter down to one based on other information, but there must be a more standard approach to getting this done. I am working in .NET but will look at any code you can provide to get an idea on how to accomplish this.
I don't exactly now how this is accomplished, but all major delivery companies (FedEx, USPS, UPS) seem to have a way of matching an address you input against their database and transforming it to a normalized form. As I've seen this happen on multiple websites (Amazon comes to mind), I am assuming that there is an API to this functionality, but I don't know where to look for it and whether it is suitable for your purposes.
Just a thought though.
EDIT: I found the USPS API
I have solved this problem with a combination of address normalization, Metaphone, and Levenshtein distance. You will need to separate the name from the address since they have different characteristics. Here are the steps you need to do:
1) Narrow down you list of matches by using the (first six characters of the) zip code. Basically you will need to calculate the Levenshtein distance of the two strings and select the ones that have a distance of 1 or 2 at the most. You can potentially precompute a table of zip codes and their "Levenshtein neighbors" if you really need to speed up the search.
http://en.wikipedia.org/wiki/Levenshtein_distance
2) Convert all the address abbreviations to a standard format using the list of official prefix and suffix abbreviations from the USPS. This will help make sure your results for the next step are more uniform:
https://www.usps.com/send/official-abbreviations.htm
3) Convert the address to a short code using the Methaphone algorithm. This will get rid of most common spelling mistakes. Just make sure that your implementation can eliminate all non word characters, pass numbers intact and handle multiple words (make sure each word is separated by a single space):
http://en.wikipedia.org/wiki/Metaphone
4) Once you have the Methaphone result of the compare the address strings using the Levenshtein distance. Calculate a percentage of change score by dividing the result by the number of characters in the longer string.
5) Repeat steps 3 and 4 but now use the names instead of the addresses.
6) Compute the score for each entry using this formula: (Weight for address * Address score) + (Weight for name * Name score). Pick your weights based on what is more important. I would start with .9 for the address (since the address is more specific) and .1 for the name but the weights may depend on your application. Pick the entry with the lowest score. If the score is too high (say over .15 you may declare that there are no matches).
I think filtering based on zip code first would be the easiest, as finding it is fairly unambiguous. From there you can probably extract the city and street. I'm not sure how you would go about finding the name, but it seems matching it against the address if you already have a database of (name, address) pairs is feasible.
Dun & Bradstreet do this. They charge money because it's really hard. There's no "standard" solution. It's mostly a painful choice between a service like D&B or roll your own.
As a start, I'd probably do a word-indexed search. That would mean two stages:
Offline stage: Generate an index of all the addresses by their keywords. For example, "Company", "A" and "123" would all become an keywords for the address you provided above. You could do some stemming, which would mean for words like "street" you'd also add a word "st" into its index.
Online stage: The user gives you a search query. Break down the search query into all its keywords, and find all possible matches of each keyword in the database. Tally the number of matched keywords on each address. Then sort the results by the number of matched keywords. This should be able to be done quite quickly if there aren't too many matches, as its just a few sorted list merges and increments, followed finally by a sort.
Given that you know the domain of your problem, you could specialise the algorithm to use knowledge about the domain - for example the zip code filtering mentioned before.
Also just to enable me to provide you with a better answer, are you using an SQL database at all? I ask because the way I would do it is I'd store the keyword index in the SQL database, and then the SQL query to search by keyword becomes quite easy, since the database does all the work.
Maybe instead of using Levenshtein for the name only, it could be useful when used with the entire string representation of a contact. For instance, the distance of your first example to the second is 7 and to the third 9. Considering the strings have lengths 54, 50 and 45, this seems to be a relatively useful and quite simple similarity measure.
This is what I would do. I am not aware of algorithms, so I just use what makes sense.
I am assuming that the person would provide name, street address, city name, state name, and zipcode.
If the zipcode is provided in 9 numbers, or has a hyphen, I would strip it down to 5 numbers. I would search the database for all of the addresses that has that zipcode.[query 1]
Then I would compare the state letter with the one from the database. If it's not a match, then I would tell that to the user. Same goes for the city name.
From what I understand, a street name is not in numbers, only the house on a street had numbers in it. Further more, the house number is usually at the beginning unless it is house or suite number.
So I would do regex to search for the numbers and the next space or comma next to it. Then find position of the first word that does not has a period(.) or ends in comma. I have part of the street name, so I could do a comparison against the rows fetched earlier, or I would change the query to have the street name LIKE %streetName%.
I am guessing the database has a beginning number and ending number of the house on a block. I would check against that street row to see if the provided street number is on that street.
By now you would know the correct data to show, and could look up in a different table as to which name is associated with that house number. I am not sure why you want to compare it. Only use for name comparing would be if you want to find people whose address was not provided. You can look here for comparing string ways Similar String algorithm
If you can reliably figure out general structure of each address (perhaps by the suggestions in the other answers), your best bet would be to run the data through a USPS-certified (meaning: the results are reliable, accurate, and conform to federal standards) address verification service.
#RyanDelucchi, it is a fun problem, but only once you've solved it. So, #SteveBering, I would recommend submitting your list of contacts to a list processing service which will flag duplicates based on the address -- according to USPS guidelines.
Since I work in the address verification field, I would suggest SmartyStreets (which I work for) since it will deliver the most value to your specific need -- however, there are a few CASS-Certified vendors who will do basically similar things.