Scrambling Social Security Number - security

I'm working with a data set that deals with personal data (i.e. data that deals with people, not [necessarily] private data)... This data that changes over time, and the format is imposed by the client. I need something to use as a primary key, and unfortunately the only field that uniquely identifies a person and doesn't change unpredictably is SSN. The ID number (primary key) is going to be public facing, so I can't publish that, but I'm hoping to obscure it.
The result must be numeric.
The result may be up to 25 digits long.
The result must be unique.
The result should be as difficult as possible to reverse without a key, given the constraints above.
Is there a numeric cipher that would fit this?
Am I crazy for trying this?

Format perserving encryption sounds like a solution to your problems. Use this on the SSN and then you just have some random 10 digit number that you can pad out to the 25 digit id you need. If you do the padding right, you can even invert it (if you have the key). The point is that after running it through the format perserving encryption, you data is not sensitive.

A social security number is nine digits long, which means there are only 10^9 = 1,000,000,000 unique SSNs. Most operations you perform on a SSN can be bruteforced, so I suggest you just assign unique random 25-digit numbers to each SSN. The random 25-digit number is your public ID, and the relationship between each pair is totally private.
The random key is not dependent upon the data it is assigned to, so there is no way to retrieve the input from the output (if you think of it as a function).

Related

Create Random number sequence without duplicates in excel for almost million rows or records

I would want to create a random sequence of numbers in 11 digit format and that should run from 10000000000 to 999999999999 and each of the values should be unique and i would like to populate almost 20-50 million worth of records in excel without having to keep dragging all the way down at the bottom of the cell by clicking + button
I tried using RANDBETWEEN but seems like there are duplicates and i have to keep dragging which is a time consuming activity,is there any alternative better way to accomplish this ?
=RANDBETWEEN(10000000000,999999999999)
For that many unique numbers I suggest using an encryption, where the output is guaranteed unique for unique inputs.
Simply encrypt the numbers 0, 1, 2, ... for different unique inputs. You will need to use the same encryption key and other inputs (IV, nonce etc.) to guarantee unique outputs.
You will need to do some processing on the outputs to get them into the required range. Have a look at Format Preserving Encryption for some help with this.
As #BigBen pointed out, Excel is probably the wrong tool for this.

Randomize Account Numbers, but make an account number that appears twice that same random number

I'm trying to find out how I would go about randomizing account numbers in a file, and where I have the same account number making sure that number has the same random number.
I'm exporting a file to some consultants and obviously don't want them to have secure information, but I want them to be able to count the number of times an account number has appeared for reporting purposes.
For the sake of an answer, as mentioned in a comment:
Create a table that maps actual uniques to dummy random, then lookup the substitution in that table.

Auto generation of a sequence number in c# or SQL

I have a situation where user enters a number of length 9 digits. But some people will only enter three digits like 901. So when they enter three digits in the text box, i have to auto generate the rest of the 6 digits and insert it into the database. I should check this against data base table that what was the last auto generated number with 901 and insert next value to it.
I just need a suggestion not the complete solution, that i should do this c# or SQL. Which is the best way to do it.
Thanks
A few things. As Tim mentioned, seems the system should provide the user with the number, not the user providing the system with the number. At any rate you'll want to use a database which would at the least contain a series of primary keys [http://en.wikipedia.org/wiki/Unique_key] (numbers not to be duplicated). Se the key to auto increment. This will ensure no other user has the same number/id.

Generating unique combinations of text

I'm building a string of text from different parts. Group A + GROUP B + GROUP C + GROUP D.
The text is put together in this exact order. Each sentence is unique.
I randomly take one sentence from each group and put them together so the total combination of unique text would be A*B*C*D where A,B,C,D are the number of sentences in their respective group.
My problem is that how do i track that i don't generate duplicates in this way and when to know that i have used up all possible combinations?
Storing all possible combinations somewhere seems rather inefficient way to do this. So what options do i have?
As random strings of text are pulled from each group, simply store the starting position of the sentence within the group along with the length into a container like a dictionary or perhaps HashSet. This would act as the key to the container. If the number of sentences in each group is small enough, you might be able to pack the data into a single integer or long value, otherwise define a structure or class for it. The code should look in the container to see if the random combination generated has already been used. If it has been used, then loop until a unique one has been found. If the total number of combinations is small enough such that the user might go through them all, then pre-calculate the total-count and check to see if the container reaches that count, in which case some sort of exit processing should be performed.

Weighted search algorithm to find like contacts

I need to write an algorithm that returns the closest match for a contact based on the name and address entered by the user. Both of these are troubling, since there are so many ways to enter a company name and address, for instance:
Company A, 123 Any Street Suite 200, Anytown, AK 99012
Comp. A, 123 Any St., Suite 200, Anytown, AK 99012
CA, 123 Any Street Ste 200, Anytown, AK 99012
I have looked at doing a Levenshtein distance on the Name, but that doesn't seem a great tool, since they could abbreviate the name. I am looking for something that matches on the most information possible.
My initial attempt was to limit the results first by the first 5 digits of the postal code and then try to filter down to one based on other information, but there must be a more standard approach to getting this done. I am working in .NET but will look at any code you can provide to get an idea on how to accomplish this.
I don't exactly now how this is accomplished, but all major delivery companies (FedEx, USPS, UPS) seem to have a way of matching an address you input against their database and transforming it to a normalized form. As I've seen this happen on multiple websites (Amazon comes to mind), I am assuming that there is an API to this functionality, but I don't know where to look for it and whether it is suitable for your purposes.
Just a thought though.
EDIT: I found the USPS API
I have solved this problem with a combination of address normalization, Metaphone, and Levenshtein distance. You will need to separate the name from the address since they have different characteristics. Here are the steps you need to do:
1) Narrow down you list of matches by using the (first six characters of the) zip code. Basically you will need to calculate the Levenshtein distance of the two strings and select the ones that have a distance of 1 or 2 at the most. You can potentially precompute a table of zip codes and their "Levenshtein neighbors" if you really need to speed up the search.
http://en.wikipedia.org/wiki/Levenshtein_distance
2) Convert all the address abbreviations to a standard format using the list of official prefix and suffix abbreviations from the USPS. This will help make sure your results for the next step are more uniform:
https://www.usps.com/send/official-abbreviations.htm
3) Convert the address to a short code using the Methaphone algorithm. This will get rid of most common spelling mistakes. Just make sure that your implementation can eliminate all non word characters, pass numbers intact and handle multiple words (make sure each word is separated by a single space):
http://en.wikipedia.org/wiki/Metaphone
4) Once you have the Methaphone result of the compare the address strings using the Levenshtein distance. Calculate a percentage of change score by dividing the result by the number of characters in the longer string.
5) Repeat steps 3 and 4 but now use the names instead of the addresses.
6) Compute the score for each entry using this formula: (Weight for address * Address score) + (Weight for name * Name score). Pick your weights based on what is more important. I would start with .9 for the address (since the address is more specific) and .1 for the name but the weights may depend on your application. Pick the entry with the lowest score. If the score is too high (say over .15 you may declare that there are no matches).
I think filtering based on zip code first would be the easiest, as finding it is fairly unambiguous. From there you can probably extract the city and street. I'm not sure how you would go about finding the name, but it seems matching it against the address if you already have a database of (name, address) pairs is feasible.
Dun & Bradstreet do this. They charge money because it's really hard. There's no "standard" solution. It's mostly a painful choice between a service like D&B or roll your own.
As a start, I'd probably do a word-indexed search. That would mean two stages:
Offline stage: Generate an index of all the addresses by their keywords. For example, "Company", "A" and "123" would all become an keywords for the address you provided above. You could do some stemming, which would mean for words like "street" you'd also add a word "st" into its index.
Online stage: The user gives you a search query. Break down the search query into all its keywords, and find all possible matches of each keyword in the database. Tally the number of matched keywords on each address. Then sort the results by the number of matched keywords. This should be able to be done quite quickly if there aren't too many matches, as its just a few sorted list merges and increments, followed finally by a sort.
Given that you know the domain of your problem, you could specialise the algorithm to use knowledge about the domain - for example the zip code filtering mentioned before.
Also just to enable me to provide you with a better answer, are you using an SQL database at all? I ask because the way I would do it is I'd store the keyword index in the SQL database, and then the SQL query to search by keyword becomes quite easy, since the database does all the work.
Maybe instead of using Levenshtein for the name only, it could be useful when used with the entire string representation of a contact. For instance, the distance of your first example to the second is 7 and to the third 9. Considering the strings have lengths 54, 50 and 45, this seems to be a relatively useful and quite simple similarity measure.
This is what I would do. I am not aware of algorithms, so I just use what makes sense.
I am assuming that the person would provide name, street address, city name, state name, and zipcode.
If the zipcode is provided in 9 numbers, or has a hyphen, I would strip it down to 5 numbers. I would search the database for all of the addresses that has that zipcode.[query 1]
Then I would compare the state letter with the one from the database. If it's not a match, then I would tell that to the user. Same goes for the city name.
From what I understand, a street name is not in numbers, only the house on a street had numbers in it. Further more, the house number is usually at the beginning unless it is house or suite number.
So I would do regex to search for the numbers and the next space or comma next to it. Then find position of the first word that does not has a period(.) or ends in comma. I have part of the street name, so I could do a comparison against the rows fetched earlier, or I would change the query to have the street name LIKE %streetName%.
I am guessing the database has a beginning number and ending number of the house on a block. I would check against that street row to see if the provided street number is on that street.
By now you would know the correct data to show, and could look up in a different table as to which name is associated with that house number. I am not sure why you want to compare it. Only use for name comparing would be if you want to find people whose address was not provided. You can look here for comparing string ways Similar String algorithm
If you can reliably figure out general structure of each address (perhaps by the suggestions in the other answers), your best bet would be to run the data through a USPS-certified (meaning: the results are reliable, accurate, and conform to federal standards) address verification service.
#RyanDelucchi, it is a fun problem, but only once you've solved it. So, #SteveBering, I would recommend submitting your list of contacts to a list processing service which will flag duplicates based on the address -- according to USPS guidelines.
Since I work in the address verification field, I would suggest SmartyStreets (which I work for) since it will deliver the most value to your specific need -- however, there are a few CASS-Certified vendors who will do basically similar things.

Resources