I was just wondering if there were some "standard" examples that everyone uses as a basis for explaining the nature of problem that requires a Hash table. What are some well-known problems in the real world that can see great benefits from using a Hash table?
*EDIT: also, a little background or explanation as to why the problem's nature benefits with a Hash Table would be of help! Thanks
A real world example: Suppose I stay in a hotel for a few days, because I attend a congress on hashing. At the end of the day, when I return to the hotel, I ask the desk clerk if there are any messages for me. Behind his back is a dovecot-like cupboard, with 26 entries, labeled A to Z. Because he knows my last name, he goes to the slot labeled W, and takes out three letters. One is for Robby Williams, one is for Jimmy Webb, and one is for me.
The clerk only had to inspect three letters. How many letters would he have to inspect if there would have been only one letter box?
When I want a user record in memory searchable by ID.
An alternative will be a list. But every time I would have to loop to find the User. Hash table will give me a user object in just one call.
When you go ice skating and you swap your shoes for ice skates. They take your shoes, put them in the ice skates box that includes your size, and give you the ice skates and a token which has the size (hash) and the shoe pair number (element in the hash box).
A cache, where in if new data comes in we overwrite the existing record using the key.So basically the cache would be used to store the most recent state.
Anytime you have a key (or attribute)-value list, hash tables (AKA: associative arrays) should spring to your mind:
foo['bar']="baz";
surname['joe']="shmoe";
Hashtables generalize the concept of 1Darrays (where keys are sequential integers, and the hash function is the identity) to the case where key values can be anything and the hash function is... well, this days is something you do not get to see often, as most languages will hide the gory details of hashing from your eyes with syntax similar to the one above.
Related
I have a python app with a database of businesses and I want to be able to search for businesses by name (for autocomplete purposes).
For example, consider the names "best buy", "mcdonalds", "sony" and "apple".
I would like "app" to return "apple", as well as "appel" and "ple".
"Mc'donalds" should return "mcdonalds".
"bst b" and "best-buy" should both return "best buy".
Which algorithm am I looking for, and does it have a python implementation?
Thanks!
The Levenshtein distance should do.
Look around - there are implementations in many languages.
Levenshtein distance will do this.
Note: this is a distance, you have to calculate it to every string in your database, which can be a big problem if you have a lot of entries.
If you have this problem then record all the typos the users make (typo=no direct match) and offline build a correction database which contains all the typo->fix mappings. some companies do this even more clever, eg: google watches how users correct their own typos and learns the mappings from this.
Soundex or Metaphone might work.
I think what you are looking for is a huge field of Data Quality and Data Cleansing. I fear if you could find a python implementation regarding this as it has to be something which cleanses considerable amount of data in db which could be of business value.
Levensthein distance goes in the right direction but only half the way. There are several tricks to get it to use the half matches as well.
One would be to use a subsequence dynamic time warping (DTW is actually a generalization of levensthein distance). For this you relax the start and end cases when calcualting the cost matrix. If you only relax one of the conditions you can get autocompletion with spell checking. I am not sure if there is a python implementation available, but if you want to implement it for yourself it should not be more than 10-20 LOC.
The other idea would be to use a Trie for speed up, which can do DTW/Levensthein on multiple results simultaniously (huge speedup if your database is large). There is a paper on Levensthein on Tries at IEEE, so you can find the algorithm there. Again for this you would need to relax the final boundary condition, so you get partial matches. However since you step down in the trie you just need to check when you have fully consumed the input and then return all leaves.
check this one http://docs.python.org/library/difflib.html
it should help you
I need a function that, given a salt integer and a value integer will return a small hash string. Calling the function with 1 and 56 might return "1AF3". Calling it with 2 and 56 might return "C2FA".
Background info:
I have a web app (written in C# if that matters) that stores employee Id values as integers. Users need to be able to see a consistent representation of that Id, but no user should see the actual Id, or the same representation of that Id as seen by another user.
For example, suppose there is an Employee with the Id of 56.
When User 1 logs in, wherever he sees that employee, he sees "1AF3" or something. He might see this employee on different pages in the app, and its Id should always be 1AF3 so he knows it's the same guy.
When User 2 logs in, should he encounter that same employee, he would always see "C2FA", or something. Same goes for User 2: wherever he is in the system, he would see that one employee represented by that same string.
Should User 2 look over the shoulder of User 1 while User 1 is logged in, User 2 should not be able to recognize any of his employees on User 1's screen, because this hash should be irreversible.
Does this make sense?
One additional requirement is that since the users will be discussing these employees in email, on the phone, and in faxes, the hash would need to be of a minimum size and not contain non-alphanumeric characters. 10 characters or fewer would be ideal.
Maybe there is a way to "collapse" a SHA-256 result into fewer characters since the whole alphabet could be used? I have no idea.
Update: Another walk-through
Thanks everyone for giving this a shot but it seems like I am doing a bad job explaining it or something.
Let's pretend you and me are both users of this system. You're Fred and I'm Chris. Your UserId is 2 and my UserId is 1. Let's also assume there are 5 Employees in the system. Employees are not users. You can think of them as products, or whatever you want. I'm just talking about 5 generic entities that you, Fred, and I, Chris, each deal with.
Fred, every time you log in, you need to be able to uniquely identify each employee. Every time I, Chris, log in, I also need to work with employees and I too will need to be able to identify them uniquely. But should I ever look over your shoulder while you are managing employees, I should not be able to figure out which ones you are managing.
So, while in the database, the employee IDs are 1, 2, 3, 4, and 5. You and I do not see them that way in our interface. I might see A, B, C, D, and E, and you might see F, G, H, I, and J. So while E and J both represent the same employee, I can't look at your screen while you are working with your Employee "J" and know that you are working with Employee 5, because for me, that employee is called Employee "E" for me.
So, Fred and Chris can each work with the same set of employees, but if they were to see each other's work, or discussion in an email, they would not be able to know what employees the other guy was talking about.
I was thinking I could achieve this "real-time user-dependent EmployeeID" by taking the real employee ID and hashing it using the user ID as the salt.
Since Fred and Chris each need to discuss employees over email and the telephone with their clients and customers, I'd like the IDs that they use in these discussions to be as simple as we can get them.
Conceptually, here is what you want:
You have a set of employee IDs which you can represent as element in a given space S. You also have some users, and you want each user to see a permutation of space S, which is specific to the user, and such that the details of that permutation cannot be guessed by any other user.
This calls for symmetric encryption. Namely, each employee ID is a numerical value (e.g. a 32-bit integer), and a user 'A' sees employee x as Ek(x), there k is a secret key which is specific to 'A' and that 'B' cannot guess. So you need two things:
a block cipher which can work with short values (e.g. 32-bit words);
a method which turns user ID into the user-specific key.
For the block cipher, the trouble is that short blocks are a security issue for the normal usage of a block cipher (i.e. to encrypt long messages). So all published, secure block ciphers use large blocks (64 bits or more). 64 bits can be represented over 11 characters by using uppercase and lowercase letters, and digits (6211 is somewhat greater than 264). If that's good enough for you, then use 3DES. If you want something smaller, you will have to design your own cipher, something which is not recommended at all. You may want to try KeeLoq: see this paper for pointers (KeeLoq is cryptographically "broken" but not too much, given your context). There is a generic method for building block ciphers with arbitrary block sizes, given a seekable stream cipher, but this is mostly theoretical (implementation requires waddling through high-precision floating point values, which can be done but is very slow).
For the user-specific key: you want something that the Web application can compute, but not users. This means that the Web application knows a secret key K; then, the user-specific encryption key can be the result of HMAC (with a good hash function, such as SHA-256) applied over the user ID, and using key K. You then truncate the HMAC output to the length you need for the user-specific key (for instance, 3DES needs a 24-byte key).
C# has TripleDES and HMAC/SHA-256 implementations (in System.Security.Cryptography namespace).
(There is no generally accepted secure standard for a block cipher with 32-bit blocks. This is still an open research area.)
There might be problems with this approach but you could do it like this:
Make an array holding all your symbols (say a 25 element array)
Hash your string using whatever hash function
Pick a number of octets out of the resulting hash (4 octets if you want 4 symbols in our resulting string) from predefined positions
For each octet compute index = octet % array_size. The index gives the position for each of your symbols
Again, I have almost zero experience with cryptography, hash functions and the like so you may want to take this with a grain of salt.
There are many ways to "de-anonymize" information. It would help if you could be more specific about the context and what "assets" you are really trying to protect here, against who. See our faq.
E.g., might one user know the number of another user? They could probably find it out quickly if they discovered thru other means the correspondence between 1AF3 and C2FA.
But specifically for your narrower question, a good hash will already be well-mixed, so I'd think you could just truncate, e.g., a SHA-256 hash value. But Thomas will probably know the definitive answer there.
Here are my thoughts getting to the point of it (I figure if you talked out your question, I'll talk out my answer. I'm guessing you'll find that helpful):
All hail Thomas, because he has clearly established his dominance.
0-9, A-F is a representation of the data. You can make it A-Z, 0-9, exclude some uncommon letters, and represent six bits per character.
You can basically say that all hashes have collisions. If you approach saturation, you'll end up with two people who have the same hash. Hashes are also one-way. You would need a mapping that allows reversal. If you have a reverse mapping, why not fill it with random strings which don't collide?
You are obfuscating a limited set of data. With a large and secret salt, you can prevent reversal. That said, you're trading one ID for another. The ID is still unique and constant, so I wonder how this enhances security.
I have some clients where if I were to see something like this, I'd put money that the employee ID was a SSN. I hope you're not doing that.
Employee ID and Employee Alternate ID are what you are coming up with. Since they have to be reversible to you but not the public, you need to store that in a two way pairing and keep it secret. Since there's risk of collision with a hash and you have to have a reverse map anyway, the alternate id might as well be a random string. An ID should be arbitrary anyway, and I would really like to know the perceived security benefit of your approach with two ids for one employee; it makes me think of Mission Impossible and the NOC list.
Just an idea for an approach based on the extra information you have added. The security on this idea is very very light and i'm would not recommend it if you think people are going to attempt to crack it, but it's worth throwing in the pot.
You could create a personal hash by bit-shifting the employee Id based on your own employee Id. Then by adding whatever extra obfuscation code you need to the resulting number, such as converting it to hex. E.g.
string hashedEmployeeId = (employeeIdToHash << myEmployeeId).ToString("X");
This will generate hashed employee Ids based on your own Id, but you may run into problems when the employee Ids get large (especially your own!)
Just to reiterate, this on it's own isn't really very secure but it might help you on your way.
Using 4 characters you would have a total of: 36^4 = 1679616.
You could permute all possibilities of employes togheter.
If you calculate de square root you get 1296.
You could then generate an ordered table with all the possibilities in the first column and then randomly distribute ids from 1 to 1296 in to oder columns. You would get something like this:
key a b
AAAA 386 67
AAAB 86 945
...
With this solution you would have a lookup table scalable up to 1296 employes. However if you consider adding an extra character to your key you would get a lot more possibilities (36^5)^0.5=7776.
With this solution gessing a key would give you one chance on 1296 or 7776 to see information about an employe.
May be performance would be an issue, but I tink you can manage this using a cache or may be even keeping all the data loaded in memory and use a kind of tree map to find corresponding key for two given ids.
I'm developing a shopping comparison website, and the project is in a very advanced stage. We index 50 million products daily using merchant feeds from various affiliate networks. Most of the problems I had is already solved, including the majority of the performance bottlenecks.
What is my problem: Please, first of all, we are using apache solr with drupal BUT, this problem IS NOT specific to drupal or solr, if you do not have knowledge of them, it doesn't matter.
We receive product feeds from over 2000 different merchants, and those feeds are a mess. They have no specific pattern, each merchant send the feeds the way they want. We already solved many problems regarding this, but one remains. Normalizing the taxonomy terms for the faceted browsing functionality.
Suppose that I have a "Narrow by Brands" browsing facet on my website. Now suppose that 100 merchants offer products from Microsoft. Now comes the problem. Some merchants put in the "Brands" column of the data feed "Microsoft", others "Microsoft, Inc.", others "Microsoft Corporation" others "Products from Microsoft", etc... there is no specific pattern between merchants and worst, some individual merchants are so sloppy that they have different strings for the same brand IN THE SAME DATA FEED.
We do not want all those different brands appearing in the navigation. We have a manual solution to the problem where we manually map the imported brands to the "good" brands table ("Microsoft Corporation" -> "Microsoft", "Products from Microsoft" -> "Microsoft", etc..). We have something like 10,000 brands in the database and this is doable. The problem is when it comes with bigger things like "Authors". When we import books into the system, there are over 800,000 authors and we have the same problem and this is not doable by hand mapping. The problem is the same: "Tom Mike Apostol", "Tom M. Apostol", "Apostol, Tom M.", etc...
Does anybody know a good way to automatically solve this problem with an acceptable degree of accuracy (85%-95% accuracy)?
Thanks you for the help!
Some idea that comes to my mind, altough it's just a loose thought:
Convert names to initials (in your example: TMA). Treat '-' as spaces, so fe. Antoine de Saint-Exupéry would be ADSE. Problem here is how to treat ",", altough, it's common usage is to have surname before forename, so just swapping positions should work (so A,TM would be TM,A, get rid of comma - TMA).
Filters authors in database by those initials
For each intitial, if you have whole name (Tom, Apostol) check if it match, otherwise (M.) consider it a match automatically.
If you want some tolerance, you can compare names with Levenshtein distance and tolerate some differences (here you have Oracle implementation)
Names that match you treat as the same authors, to find the whole name, for each initial (T, M, A) you look up your filtered authors (after step 2) and try to find one without just initial (M.) but with whole name (Mike), if you can't find one, use initial. Therefore, each of examples you gave would be converted to the same value, which would be full name (Tom Mike Apostol).
Things that are worth to think about:
Include mappings for name synonyms (would be more likely maximally hundred of records, like Thomas <-> Tom
This way is crucial to have valid initials (no M instead of N etc.).
edit: I've coded such thing some time ago, when I had to identify a person by it's signature, ignoring scanning problems, people sometimes sign by Name S. Surname, or N.S. or just by Name Surname (which is another thing maybe you should consider in the solution, to allow the algorithm to ignore second name, altough in your situation it would be rather rare to ommit someone's second name I guess).
I am interested in hearing about enterprise solutions for SSN handling. (I looked pretty hard for any pre-existing post on SO, including reviewing the terriffic SO automated "Related Questions" list, and did not find anything, so hopefully this is not a repeat.)
First, I think it is important to enumerate the reasons systems/databases use SSNs: (note—these are reasons for de facto current state—I understand that many of them are not good reasons)
Required for Interaction with External Entities. This is the most valid case—where external entities your system interfaces with require an SSN. This would typically be government, tax and financial.
SSN is used to ensure system-wide uniqueness.
SSN has become the default foreign key used internally within the enterprise, to perform cross-system joins.
SSN is used for user authentication (e.g., log-on)
The enterprise solution that seems optimum to me is to create a single SSN repository that is accessed by all applications needing to look up SSN info. This repository substitutes a globally unique, random 9-digit number (ASN) for the true SSN. I see many benefits to this approach. First of all, it is obviously highly backwards-compatible—all your systems "just" have to go through a major, synchronized, one-time data-cleansing exercise, where they replace the real SSN with the alternate ASN. Also, it is centralized, so it minimizes the scope for inspection and compliance. (Obviously, as a negative, it also creates a single point of failure.)
This approach would solve issues 2 and 3, without ever requiring lookups to get the real SSN.
For issue #1, authorized systems could provide an ASN, and be returned the real SSN. This would of course be done over secure connections, and the requesting systems would never persist the full SSN. Also, if the requesting system only needs the last 4 digits of the SSN, then that is all that would ever be passed.
Issue #4 could be handled the same way as issue #1, though obviously the best thing would be to move away from having users supply an SSN for log-on.
There are a couple of papers on this:
UC Berkely
Oracle Vault
I have found a trove of great information at the Securosis site/blog. In particular, this white paper does a great job of summarizing, comparing and contrasting database encryption and tokenization. It is more focused on the credit card (PCI) industry, but it is also helpful for my SSN purpose.
It should be noted that SSNs are PII, but are not private. SSNs are public information that be easily acquired from numerous sources even online. That said if SSNs are the basis of your DB primary key you have a severe security problem in your logic. If this problem is evident at a large enterprise then I would stop what you are doing and recommend a massive data migration RIGHT NOW.
As far as protection goes SSNs are PII that is both unique and small in payload, so I would protect that form of data no differently than a password for one time authentication. The last four of a SSNs is frequently used for verification or non-unique identification as it is highly unique when coupled with another data attribute and is not PII on its own. That said the last four of a SSN can be replicated in your DB for open alternative use.
I have come across a company, Voltage, that supplies a product which performs "format preserving encryption" (FPE). This substitutes an arbitrary, reversibly encrypted 9-digit number for the real SSN (in the example of SSN). Just in the early stages of looking into their technical marketing collateral...
I want to analyze answers to a web survey (Git User's Survey 2008 if one is interested). Some of the questions were free-form questions, like "How did you hear about Git?". With more than 3,000 replies analyzing those replies entirely by hand is out of the question (especially that there is quite a bit of free-form questions in this survey).
How can I group those replies (probably based on the key words used in response) into categories at least semi-automatically (i.e. program can ask for confirmation), and later how to tabularize (count number of entries in each category) those free-form replies (answers)? One answer can belong to more than one category, although for simplicity one can assume that categories are orthogonal / exclusive.
What I'd like to know is at least keyword to search for, or an algorithm (a method) to use. I would prefer solutions in Perl (or C).
Possible solution No 1. (partial): Bayesian categorization
(added 2009-05-21)
One solution I thought about would be to use something like algorithm (and mathematical method behind it) for Bayesian spam filtering, only instead of one or two categories ("spam" and "ham") there would be more; and categories itself would be created adaptively / interactively.
Text::Ngrams + Algorithm::Cluster
Generate some vector representation for each answer (e.g. word count) using Text::Ngrams.
Cluster the vectors using Algorithm::Cluster to determine the groupings and also the keywords which correspond to the groups.
You are not going to like this. But: If you do a survey and you include lots of free-form questions, you better be prepared to categorize them manually. If that is out of the question, why did you have those questions in the first place?
I've brute forced stuff like this in the past with quite large corpuses. Lingua::EN::Tagger, Lingua::Stem::En. Also the Net::Calais API is (unfortunately, as Thomposon Reuters are not exactly open source friendly) pretty useful for extracting named entities from text. Of course once you've cleaned up the raw data with this stuff, the actual data munging is up to you. I'd be inclined to suspect that frequency counts and a bit of mechanical turk cross-validation of the output would be sufficient for your needs.
Look for common words as keywords, but through away meaningless ones like "the", "a", etc. After that you get into natural language stuff that is beyond me.
It just dawned on me that the perfect solution for this is AAI (Artificial Artificial Intelligence). Use Amazon's Mechanical Turk. The Perl bindings are Net::Amazon::MechanicalTurk. At one penny per reply with a decent overlap (say three humans per reply) that would come to about $90 USD.