How do I hash an integer into a very small string? - security

I need a function that, given a salt integer and a value integer will return a small hash string. Calling the function with 1 and 56 might return "1AF3". Calling it with 2 and 56 might return "C2FA".
Background info:
I have a web app (written in C# if that matters) that stores employee Id values as integers. Users need to be able to see a consistent representation of that Id, but no user should see the actual Id, or the same representation of that Id as seen by another user.
For example, suppose there is an Employee with the Id of 56.
When User 1 logs in, wherever he sees that employee, he sees "1AF3" or something. He might see this employee on different pages in the app, and its Id should always be 1AF3 so he knows it's the same guy.
When User 2 logs in, should he encounter that same employee, he would always see "C2FA", or something. Same goes for User 2: wherever he is in the system, he would see that one employee represented by that same string.
Should User 2 look over the shoulder of User 1 while User 1 is logged in, User 2 should not be able to recognize any of his employees on User 1's screen, because this hash should be irreversible.
Does this make sense?
One additional requirement is that since the users will be discussing these employees in email, on the phone, and in faxes, the hash would need to be of a minimum size and not contain non-alphanumeric characters. 10 characters or fewer would be ideal.
Maybe there is a way to "collapse" a SHA-256 result into fewer characters since the whole alphabet could be used? I have no idea.
Update: Another walk-through
Thanks everyone for giving this a shot but it seems like I am doing a bad job explaining it or something.
Let's pretend you and me are both users of this system. You're Fred and I'm Chris. Your UserId is 2 and my UserId is 1. Let's also assume there are 5 Employees in the system. Employees are not users. You can think of them as products, or whatever you want. I'm just talking about 5 generic entities that you, Fred, and I, Chris, each deal with.
Fred, every time you log in, you need to be able to uniquely identify each employee. Every time I, Chris, log in, I also need to work with employees and I too will need to be able to identify them uniquely. But should I ever look over your shoulder while you are managing employees, I should not be able to figure out which ones you are managing.
So, while in the database, the employee IDs are 1, 2, 3, 4, and 5. You and I do not see them that way in our interface. I might see A, B, C, D, and E, and you might see F, G, H, I, and J. So while E and J both represent the same employee, I can't look at your screen while you are working with your Employee "J" and know that you are working with Employee 5, because for me, that employee is called Employee "E" for me.
So, Fred and Chris can each work with the same set of employees, but if they were to see each other's work, or discussion in an email, they would not be able to know what employees the other guy was talking about.
I was thinking I could achieve this "real-time user-dependent EmployeeID" by taking the real employee ID and hashing it using the user ID as the salt.
Since Fred and Chris each need to discuss employees over email and the telephone with their clients and customers, I'd like the IDs that they use in these discussions to be as simple as we can get them.

Conceptually, here is what you want:
You have a set of employee IDs which you can represent as element in a given space S. You also have some users, and you want each user to see a permutation of space S, which is specific to the user, and such that the details of that permutation cannot be guessed by any other user.
This calls for symmetric encryption. Namely, each employee ID is a numerical value (e.g. a 32-bit integer), and a user 'A' sees employee x as Ek(x), there k is a secret key which is specific to 'A' and that 'B' cannot guess. So you need two things:
a block cipher which can work with short values (e.g. 32-bit words);
a method which turns user ID into the user-specific key.
For the block cipher, the trouble is that short blocks are a security issue for the normal usage of a block cipher (i.e. to encrypt long messages). So all published, secure block ciphers use large blocks (64 bits or more). 64 bits can be represented over 11 characters by using uppercase and lowercase letters, and digits (6211 is somewhat greater than 264). If that's good enough for you, then use 3DES. If you want something smaller, you will have to design your own cipher, something which is not recommended at all. You may want to try KeeLoq: see this paper for pointers (KeeLoq is cryptographically "broken" but not too much, given your context). There is a generic method for building block ciphers with arbitrary block sizes, given a seekable stream cipher, but this is mostly theoretical (implementation requires waddling through high-precision floating point values, which can be done but is very slow).
For the user-specific key: you want something that the Web application can compute, but not users. This means that the Web application knows a secret key K; then, the user-specific encryption key can be the result of HMAC (with a good hash function, such as SHA-256) applied over the user ID, and using key K. You then truncate the HMAC output to the length you need for the user-specific key (for instance, 3DES needs a 24-byte key).
C# has TripleDES and HMAC/SHA-256 implementations (in System.Security.Cryptography namespace).
(There is no generally accepted secure standard for a block cipher with 32-bit blocks. This is still an open research area.)

There might be problems with this approach but you could do it like this:
Make an array holding all your symbols (say a 25 element array)
Hash your string using whatever hash function
Pick a number of octets out of the resulting hash (4 octets if you want 4 symbols in our resulting string) from predefined positions
For each octet compute index = octet % array_size. The index gives the position for each of your symbols
Again, I have almost zero experience with cryptography, hash functions and the like so you may want to take this with a grain of salt.

There are many ways to "de-anonymize" information. It would help if you could be more specific about the context and what "assets" you are really trying to protect here, against who. See our faq.
E.g., might one user know the number of another user? They could probably find it out quickly if they discovered thru other means the correspondence between 1AF3 and C2FA.
But specifically for your narrower question, a good hash will already be well-mixed, so I'd think you could just truncate, e.g., a SHA-256 hash value. But Thomas will probably know the definitive answer there.

Here are my thoughts getting to the point of it (I figure if you talked out your question, I'll talk out my answer. I'm guessing you'll find that helpful):
All hail Thomas, because he has clearly established his dominance.
0-9, A-F is a representation of the data. You can make it A-Z, 0-9, exclude some uncommon letters, and represent six bits per character.
You can basically say that all hashes have collisions. If you approach saturation, you'll end up with two people who have the same hash. Hashes are also one-way. You would need a mapping that allows reversal. If you have a reverse mapping, why not fill it with random strings which don't collide?
You are obfuscating a limited set of data. With a large and secret salt, you can prevent reversal. That said, you're trading one ID for another. The ID is still unique and constant, so I wonder how this enhances security.
I have some clients where if I were to see something like this, I'd put money that the employee ID was a SSN. I hope you're not doing that.
Employee ID and Employee Alternate ID are what you are coming up with. Since they have to be reversible to you but not the public, you need to store that in a two way pairing and keep it secret. Since there's risk of collision with a hash and you have to have a reverse map anyway, the alternate id might as well be a random string. An ID should be arbitrary anyway, and I would really like to know the perceived security benefit of your approach with two ids for one employee; it makes me think of Mission Impossible and the NOC list.

Just an idea for an approach based on the extra information you have added. The security on this idea is very very light and i'm would not recommend it if you think people are going to attempt to crack it, but it's worth throwing in the pot.
You could create a personal hash by bit-shifting the employee Id based on your own employee Id. Then by adding whatever extra obfuscation code you need to the resulting number, such as converting it to hex. E.g.
string hashedEmployeeId = (employeeIdToHash << myEmployeeId).ToString("X");
This will generate hashed employee Ids based on your own Id, but you may run into problems when the employee Ids get large (especially your own!)
Just to reiterate, this on it's own isn't really very secure but it might help you on your way.

Using 4 characters you would have a total of: 36^4 = 1679616.
You could permute all possibilities of employes togheter.
If you calculate de square root you get 1296.
You could then generate an ordered table with all the possibilities in the first column and then randomly distribute ids from 1 to 1296 in to oder columns. You would get something like this:
key a b
AAAA 386 67
AAAB 86 945
...
With this solution you would have a lookup table scalable up to 1296 employes. However if you consider adding an extra character to your key you would get a lot more possibilities (36^5)^0.5=7776.
With this solution gessing a key would give you one chance on 1296 or 7776 to see information about an employe.
May be performance would be an issue, but I tink you can manage this using a cache or may be even keeping all the data loaded in memory and use a kind of tree map to find corresponding key for two given ids.

Related

What are some common examples of a Hash Table?

I was just wondering if there were some "standard" examples that everyone uses as a basis for explaining the nature of problem that requires a Hash table. What are some well-known problems in the real world that can see great benefits from using a Hash table?
*EDIT: also, a little background or explanation as to why the problem's nature benefits with a Hash Table would be of help! Thanks
A real world example: Suppose I stay in a hotel for a few days, because I attend a congress on hashing. At the end of the day, when I return to the hotel, I ask the desk clerk if there are any messages for me. Behind his back is a dovecot-like cupboard, with 26 entries, labeled A to Z. Because he knows my last name, he goes to the slot labeled W, and takes out three letters. One is for Robby Williams, one is for Jimmy Webb, and one is for me.
The clerk only had to inspect three letters. How many letters would he have to inspect if there would have been only one letter box?
When I want a user record in memory searchable by ID.
An alternative will be a list. But every time I would have to loop to find the User. Hash table will give me a user object in just one call.
When you go ice skating and you swap your shoes for ice skates. They take your shoes, put them in the ice skates box that includes your size, and give you the ice skates and a token which has the size (hash) and the shoe pair number (element in the hash box).
A cache, where in if new data comes in we overwrite the existing record using the key.So basically the cache would be used to store the most recent state.
Anytime you have a key (or attribute)-value list, hash tables (AKA: associative arrays) should spring to your mind:
foo['bar']="baz";
surname['joe']="shmoe";
Hashtables generalize the concept of 1Darrays (where keys are sequential integers, and the hash function is the identity) to the case where key values can be anything and the hash function is... well, this days is something you do not get to see often, as most languages will hide the gory details of hashing from your eyes with syntax similar to the one above.

Best Practices / Patterns for Enterprise Protection/Remediation of SSNs (Social Security Numbers)

I am interested in hearing about enterprise solutions for SSN handling. (I looked pretty hard for any pre-existing post on SO, including reviewing the terriffic SO automated "Related Questions" list, and did not find anything, so hopefully this is not a repeat.)
First, I think it is important to enumerate the reasons systems/databases use SSNs: (note—these are reasons for de facto current state—I understand that many of them are not good reasons)
Required for Interaction with External Entities. This is the most valid case—where external entities your system interfaces with require an SSN. This would typically be government, tax and financial.
SSN is used to ensure system-wide uniqueness.
SSN has become the default foreign key used internally within the enterprise, to perform cross-system joins.
SSN is used for user authentication (e.g., log-on)
The enterprise solution that seems optimum to me is to create a single SSN repository that is accessed by all applications needing to look up SSN info. This repository substitutes a globally unique, random 9-digit number (ASN) for the true SSN. I see many benefits to this approach. First of all, it is obviously highly backwards-compatible—all your systems "just" have to go through a major, synchronized, one-time data-cleansing exercise, where they replace the real SSN with the alternate ASN. Also, it is centralized, so it minimizes the scope for inspection and compliance. (Obviously, as a negative, it also creates a single point of failure.)
This approach would solve issues 2 and 3, without ever requiring lookups to get the real SSN.
For issue #1, authorized systems could provide an ASN, and be returned the real SSN. This would of course be done over secure connections, and the requesting systems would never persist the full SSN. Also, if the requesting system only needs the last 4 digits of the SSN, then that is all that would ever be passed.
Issue #4 could be handled the same way as issue #1, though obviously the best thing would be to move away from having users supply an SSN for log-on.
There are a couple of papers on this:
UC Berkely
Oracle Vault
I have found a trove of great information at the Securosis site/blog. In particular, this white paper does a great job of summarizing, comparing and contrasting database encryption and tokenization. It is more focused on the credit card (PCI) industry, but it is also helpful for my SSN purpose.
It should be noted that SSNs are PII, but are not private. SSNs are public information that be easily acquired from numerous sources even online. That said if SSNs are the basis of your DB primary key you have a severe security problem in your logic. If this problem is evident at a large enterprise then I would stop what you are doing and recommend a massive data migration RIGHT NOW.
As far as protection goes SSNs are PII that is both unique and small in payload, so I would protect that form of data no differently than a password for one time authentication. The last four of a SSNs is frequently used for verification or non-unique identification as it is highly unique when coupled with another data attribute and is not PII on its own. That said the last four of a SSN can be replicated in your DB for open alternative use.
I have come across a company, Voltage, that supplies a product which performs "format preserving encryption" (FPE). This substitutes an arbitrary, reversibly encrypted 9-digit number for the real SSN (in the example of SSN). Just in the early stages of looking into their technical marketing collateral...

Password complexity strategies - any evidence for them?

On more than one occasion I've been asked to implement rules for password selection for software I'm developing. Typical suggestions include things like:
Passwords must be at least N characters long;
Passwords must include lowercase, uppercase and numbers;
No reuse of the last M passwords (or passwords used within P days).
And so on.
Something has always bugged me about putting any restrictions on passwords though - by restricting the available passwords, you reduce the size of the space of all allowable passwords. Doesn't this make passwords easier to guess?
Equally, by making users create complex, frequently-changing passwords, the temptation to write them down increases, also reducing security.
Is there any quantitative evidence that password restriction rules make systems more secure?
If there is, what are the 'most secure' password restriction strategies to use?
Edit Ólafur Waage has kindly pointed out a Coding Horror article on dictionary attacks which has a lot of useful analysis in it, but it strikes me that dictionary attacks can be massively reduced (as Jeff suggests) by simply adding a delay following a failed authentication attempt.
With this in mind, what evidence is there that forced-complex passwords are more secure?
Something has always bugged me about
putting any restrictions on passwords
though - by restricting the available
passwords, you reduce the size of the
space of all allowable passwords.
Doesn't this make passwords easier to
guess?
In theory, yes. In practice, the "weak" passwords you disallow represent a tiny subset of all possible passwords that is disproportionately often chosen when there are no restrictions, and which attackers know to attack first.
Equally, by making users create
complex, frequently-changing
passwords, the temptation to write
them down increases, also reducing
security.
Correct. Forcing users to change passwords every month is a very, very bad idea, except perhaps in extreme high-security environments where everyone really understands the need for security.
Those kind of rules definitely help because it stops stupid users from using passwords like "mypassword", which unfortunately happens quite often.
So actually, you are forcing the users into an extremely large set of potential passwords. It doesn't matter that you are excluding the set of all passwords with only lowercase letters, because the remaining set is still orders of magnitude larger.
BUT my big pet peeves are password restrictions I've encountered on major sites, like
No special characters
Maximum length
Why would anyone do this? W.H.Y.????
A nice read up on this is Jeff's article on Dictionary Attacks.
Never prevent the user from doing what they really want, unless there is a technical limitation from doing so.
You may nag the hell out of the user for doing stupid things like using a dictionary word or a 3-character password, or only using numbers, but see #1 above.
There is no good technical reason to require only alphanumerics, or at least one capital letter, or at least one number; see #1 above.
I forget which website had this advice regarding passwords: "Pick a password that is very easy for you to remember, but very hard for someone else to guess." But then they proceeded to require at least one capital letter and one number.
The problem with passwords is that they are so ubiquitous that it is essentially impossible for any person without a photographic memory to actually remember them without writing them down, and therefore leaving a serious security hole should someone gain access to this list of written-down passwords.
The only way I am able to manage this for myself is to split most of my passwords -- and I just checked my list, I'm up to 130 so far! -- into two parts, one which is the same in all cases, and the other which is unique but simple. (I break this rule for sites requiring high-security like bank accounts.)
By requiring "complexity" as defined as multiple types of characters all present, is that it forces people into a disparate set of conventions for different sites, which makes it harder to remember the password in question.
The only reason I will acknowledge for sites limiting the set of allowable password characters, is that it needs to be typeable on a keyboard. If you have to assume the account needs to be accessed from multiple countries, then keyboards may not always support the same characters on the user's home keyboard.
One of these days I'll have to make a blog posting on the subject. :(
My old limit theorem:
As the security of the password approaches adequate, the probability that it will be on a sticky note attached to the computer or monitor approaches one.
One also might point out the recent fiasco over at twitter where one of their admin's password turned out to be "happiness", which fell to a dictionary attack.
For questions like this, I ask myself what Bruce Schneier would do - the linked article is about how to choose passwords which are hard to guess with typical attacks.
Also note that if you add a delay after a failed attempt, you might also want to add a delay after a successful attempt, otherwise the delay is simply a signal that the attack has failed an other attempt should be launched.
Whilst this does not directly answer your question, I personally find the most aggrevating rule I have encountered one whereby you could not reuse any password previously used. After working at the same place for a number of years, and having to change your password every 2/3 months, the ability to use a password I chose over a year ago would not seem to be particularly unsafe or unsecure. If I have used "safe" passwords in the past (Alphanumeric with changes in case), surely reusing them after a perios of say a year or 2 (depending on how regularly you have to change your password) would seem to be acceptable to me. It also means I am less likely to use "easier" passwords, which might happen if I can't think of anything easy to remember and difficult to guess!
First let me say that details such as minimum length, case sensitivity and required special characters should depend on who has access and what the password allows them to do. If it's a code to launch a nuclear missile, it should be more strict than a password to log in to play your paid online edition of Angry Birds.
But I've got a SPECIFIC beef with case sensitivity.
For starters, users hate it. The human brain thinks "A=a". Of course, developers brains' aren't usually typical. ;-) But developers are also inconvenienced by case sensitivity.
Second, the CapsLock key is too easy to hit by mistake. It's right between Tab and Shift keys, but it SHOULD be up above the Esc key. Its location was established long ago in the days of typewriters, which had no alternate font available. In those days it was useful to have it there.
All passwords have risk... You're balancing risk with ease-of-use, and yes, usability matters.
MY ARGUMENT:
Yes, case sensitivity is more secure for a given password length. But unless someone is making me do otherwise, I opt for a longer minimum password length. Even if we assume only letters and digits are allowed, each added character multiplies number of the possible passwords by 36.
Someone who's less lazy than me with math could tell you the difference in number of combinations between, say a minimum 8-character case-sensitive password, and a 12-character case-insensitive password. I think most users would prefer the latter.
Also, not all apps expose usernames to others, so there are potentially two fields the hacker may have to find.
I also prefer to allow spaces in passwords as long as the majority of the password isn't spaces.
In the project I'm developing now, my management screen allows the administrator to change password requirements, which apply to all future passwords. He can also force all users to update passwords (to new requirements) at any time after next logon. I do this because I feel my stuff doesn't need case-sensitivity, but the administrator (who probably paid me for the software) may disagree so I let that person decide.
The PIN for my bank card is only four digits. Since it's only numbers it's not case sensitive. And heck, it's my MONEY! If you consider nothing else, this sounds pretty insecure, were it not for the fact that the hacker has to steal my card to get my money. (And have his photo taken.)
One other beef: Developers who come onto StackOverflow and regurgitate hard-and-fast rules that they read in an article somewhere. "Never hard code anything." (As if that's possible.) "All queries must be parameterized" (not if the the user doesn't contribute to the query.) etc.
Please excuse the rant. ;-) I promise I respect disagreement.
Personally for this paticular problem I tend to give passwords a 'score' based on characteristics of the entered text, and refuse passwords that don't meet the score.
For example:
Contains Lower Case Letter +1
Contains different Lower Case Letter +1
Contains Upper Case Letter +1
Contains different Upper Case Letter +1
Contains Non-Alphanumeric character: +1
Contains different Non-Alphanumeric character: +1
Contains Number: +1
Contains Non Consecutive or repeated Second Number: +1
Length less than 8: -10
Length Greater than 12: +1
Contains Dictionary word: -4
Then only allowing passwords with a score greater than 4, (and providing the user feedback as they create their password via javascript)

Pin Generation

I am looking to develop a system in which i need to assign every user a unique pin code for security. The user will only enter this pin code as a means of identifying himself. Thus i dont want the user to be able to guess another users pincode. Assuming the max users i will have is 100000, how long should this pin code be?
e.g. 1234 4532 3423
Should i generate this code via some sort of algorithm? Or should i randomly generate it?
Basically I dont want people to be able to guess other peoples pincode and it should support enough number of users.
Am sorry if my question sounds a bit confusing but would gladly clarify any doubts.
thank you very much.
UPDATE
After reading all the posts below, I would like to add some more detail.
What i am trying to achieve is something very similar to a scratch card.
A user is given a card, which he/she must scratch to find the pin code.
Now using this pin code the user must be able to access my system.
I cannot add extra security (e.g. username and password), as then it will deter the user from using the scratch card. I want to make it as difficult as possible to guess the pincode within the limitations.
thankyou all for your amazing replies again.
4 random digits should be plenty if you append it to unique known userid (could still be number) [as recommended by starblue]
Pseudo random number generator should also be fine. You can store these in the DB using reversable encryption (AES) or one-way hashing
The main concern you have is how many times a person can incorrectly input the pin before they are locked out. This should be low, say around three...This will stop people guessing other peoples numbers.
Any longer than 6 digits and people will be forgetting them, or worse, writing them on a post-it note on their monitor.
Assuming an account locks with 3 incorrect attempts, then having a 4 digit pin plus a user ID component UserId (999999) + Pin (1234) gives you a 3/10000 chance of someone guessing. Is this acceptable? If not make the pin length 5 and get 3/100000
May I suggest an alternative approach? Take a look at Perfect Paper Passwords, and the derivatives it prompted .
You could use this "as is" to generate one-time PINs, or simply to generate a single PIN per user.
Bear in mind, too, that duplicate PINs are not of themselves an issue: any attack would then simply have to try multiple user-ids.
(Mileage warning: I am definitely not a security expert.)
Here's a second answer: from re-reading, I assume you don't want a user-id as such - you're just validating a set of issued scratch cards. I also assume you don't want to use alphabetic PINs.
You need to choose a PIN length such that the probability of guessing a valid PIN is less than 1/(The number of attempts you can protect against). So, for example, if you have 1 million valid PINs, and you want to protect against 10000 guesses, you'll need a 10-digit PIN.
If you use John Graham-Cumming's version of the Perfect Paper Passwords system, you can:
Configure this for (say) 10-digit decimal pins
Choose a secret IV/key phrase
Generate (say) the first million passwords(/PINs)
I suspect this is a generic procedure that could, for example, be used to generate 25-alphanumeric product ids, too.
Sorry for doing it by successive approximation; I hope that comes a bit nearer to what you're looking for.
If we assume 100,000 users maximum then they can have unique PINs with 0-99,999 ie. 5 digits.
However, this would make it easier to guess the PINs with the maximum number of users.
If you can restrict the number of attempts on the PIN then you can have a shorter PIN.
eg. maximum of 10 failed attempts per IP per day.
It also depends on the value of what you are protecting and how catastrophic it would be if the odd one did get out.
I'd go for 9 digits if you want to keep it short or 12 digits if you want a bit more security from automated guessing.
To generate the PINs, I would take a high resolution version of the time along with some salt and maybe a pseudo-random number, generate a hash and use the first 9 or 12 digits. Make sure there is a reasonable and random delay between new PIN generations so don't generate them in a loop, and if possible make them user initiated.
eg. Left(Sha1(DateTime + Salt + PseudoRandom),9)
Lots of great answers so far: simple, effective, and elegant!
I'm guessing the application is somewhat lottery-like, in that each user gets a scratch card and uses it to ask your application if "he's already won!" So, from that perspective, a few new issues come to mind:
War-dialing, or its Internet equivalent: Can a rogue user hit your app repeatedly, say guessing every 10-digit number in succession? If that's a possibility, consider limiting the number of attempts from a particular location. An effective way might be simply to refuse to answer more than, say, one attempt every 5 seconds from the same IP address. This makes machine-driven attacks inefficient and avoids the lockout problem.
Lockout problem: If you lock an account permanently after any number of failed attempts, you're prone to denial of service attacks. The attacker above could effectively lock out every user unless you reactivate the accounts after a period of time. But this is a problem only if your PINs consist of an obvious concatenation of User ID + Key, because an attacker could try every key for a given User ID. That technique also reduces your key space drastically because only a few of the PIN digits are truly random. On the other hand, if the PIN is simply a sequence of random digits, lockout need only be applied to the source IP address. (If an attempt fails, no valid account is affected, so what would you "lock"?)
Data storage: if you really are building some sort of lottery-like system you only need to store the winning PINs! When a user enters a PIN, you can search a relatively small list of PINs/prizes (or your equivalent). You can treat "losing" and invalid PINs identically with a "Sorry, better luck next time" message or a "default" prize if the economics are right.
Good luck!
The question should be, "how many guesses are necessary on average to find a valid PIN code, compared with how many guesses attackers are making?"
If you generate 100 000 5-digit codes, then obviously it takes 1 guess. This is unlikely to be good enough.
If you generate 100 000 n-digit codes, then it takes (n-5)^10 guesses. To work out whether this is good enough, you need to consider how your system responds to a wrong guess.
If an attacker (or, all attackers combined) can make 1000 guesses per second, then clearly n has to be pretty large to stop a determined attacker. If you permanently lock out their IP address after 3 incorrect guesses, then since a given attacker is unlikely to have access to more than, say, 1000 IP addresses, n=9 would be sufficient to thwart almost all attackers. Obviously if you will face distributed attacks, or attacks from a botnet, then 1000 IP addresses per attacker is no longer a safe assumption.
If in future you need to issue further codes (more than 100 000), then obviously you make it easier to guess a valid code. So it's probably worth spending some time now making sure of your future scaling needs before fixing on a size.
Given your scratch-card use case, if users are going to use the system for a long time, I would recommend allowing them (or forcing them) to "upgrade" their PIN code to a username and password of their choice after the first use of the system. Then you gain the usual advantages of username/password, without discarding the ease of first use of just typing the number off the card.
As for how to generate the number - presumably each one you generate you'll store, in which case I'd say generate them randomly and discard duplicates. If you generate them using any kind of algorithm, and someone figures out the algorithm, then they can figure out valid PIN codes. If you select an algorithm such that it's not possible for someone to figure out the algorithm, then that almost is a pseudo-random number generator (the other property of PRNGs being that they're evenly distributed, which helps here too since it makes it harder to guess codes), in which case you might as well just generate them randomly.
If you use random number generator algorithms, so you never have PIN like "00038384882" ,
starts with 0 (zeros), because integer numbers never begins with "0". your PIN must be started with 1-9 numbers except 0.
I have seen many PIN numbers include and begins many zeros, so you eliminate first million of numbers. Permutation need for calculations for how many numbers eliminated.
I think you need put 0-9 numbers in a hash, and get by randomly from hash, and make your string PIN number.
If you want to generate scratch-card type pin codes, then you must use large numbers, about 13 digits long; and also, they must be similar to credit card numbers, having a checksum or verification digit embedded in the number itself. You must have an algorithm to generate a pin based on some initial data, which can be a sequence of numbers. The resulting pin must be unique for each number in the sequence, so that if you generate 100,000 pin codes they must all be different.
This way you will be able to validate a number not only by checking it against a database but you can verify it first.
I once wrote something for that purpose, I can't give you the code but the general idea is this:
Prepare a space of 12 digits
Format the number as five digits (00000 to 99999) and spread it along the space in a certain way. For example, the number 12345 can be spread as __3_5_2_4__1. You can vary the way you spread the number depending on whether it's an even or odd number, or a multiple of 3, etc.
Based on the value of certain digits, generate more digits (for example if the third digit is even, then create an odd number and put it in the first open space, otherwise create an even number and put it in the second open space, e.g. _83_5_2_4__1
Once you have generated 6 digits, you will have only one open space. You should always leave the same open space (for example the next-to-last space). You will place the verification digit in that place.
To generate the verification digit you must perform some arithmetic operations on the number you have generated, for example adding all the digits in the odd positions and multiplying them by some other number, then subtracting all the digits in the even positions, and finally adding all the digits together (you must vary the algorithm a little based on the value of certain digits). In the end you have a verification digit which you include in the generated pin code.
So now you can validate your generated pin codes. For a given pin code, you generate the verification digit and check it against the one included in the pin. If it's OK then you can extract the original number by performing the reverse operations.
It doesn't sound so good because it looks like security through obscurity but it's the only way you can use this. It's not impossible for someone to guess a pin code but being a 12-digit code with a verification digit, it will be very hard since you have to try 1,000,000,000,000 combinations and you just have 100,000 valid pin codes, so for every valid pin code there are 10,000,000 invalid ones.
I should mention that this is useful for disposable pin codes; a person uses one of these codes only once, for example to charge a prepaid phone. It's not a good idea to use these pins as authentication tokens, especially if it's the only way to authenticate someone (you should never EVER authenticate someone only through a single piece of data; the very minimum is username+password)
It seems you want to use the pin code as the sole means of identification for users.
A workable solution would be to use the first five digits to identify the user,
and append four digits as a PIN code.
If you don't want to store PINs they can be computed by applying a cryptographically secure hash (SHA1 or better)
to the user number plus a system-wide secret code.
Should i generate this code via some
sort of algorithm?
No. It will be predictable.
Or should i randomly generate it?
Yes. Use a cryptographic random generator, or let the user pick their own PIN.
In theory 4 digits will be plenty as ATM card issuers manage to support a very large community with just that (and obviously, they can't be and do not need to be unique). However in that case you should limit the number of attempts at entering the PIN and lock them out after that many attempts as the banks do. And you should also get the user to supply a user ID (in the ATM case, that's effectively on the card).
If you don't want to limit them in that way, it may be best to ditch the PIN idea and use a standard password (which is essentially what your PIN is, just with a very short length and limited character set). If you absolutely must restrict it to numerics (because you have a PIN pad or something) then consider making 4 a (configurable) minimum length rather than the fixed length.
You shouldn't store the PIN in clear anywhere (e.g. salt and hash it like a password), however given the short length and limited char set it is always going to be vulnerable to a brute force search, given an easy way to verify it.
There are various other schemes that can be used as well, if you can tell us more about your requirements (is this a web app? embedded system? etc).
There's a difference between guessing the PIN of a target user, and that of any valid user. From your use case, it seems that the PIN is used to gain access to certain resource, and it is that resource that attackers may be after, not particular identities of users. If that's indeed the case, you will need to make valid PIN numbers sufficiently sparse among all possible numbers of the same number digits.
As mentioned in some answers, you need to make your PIN sufficiently random, regardless if you want to generate it from an algorithm. The randomness is usually measured by the entropy of the PIN.
Now, let's say your PIN is of entropy N, and there are 2^M users in your system (M < N), the probability that a random guess will yield a valid PIN is 2^{M-N}. (Sorry for the latex notations, I hope it's intuitive enough). Then from there you can determine if that probability is low enough given N and M, or compute the required N from the desired probability and M.
There are various ways to generate the PINs so that you won't have to remember every PIN you generated. But you will need a very long PIN to make it secure. This is probably not what you want.
I've done this before with PHP and a MySQL database. I had a permutations function that would first ensure that the number of required codes - $n, at length $l, with the number of characters, $c - was able to be created before starting the generation process.
Then, I'd store each new code to the database and let it tell me via UNIQUE KEY errors, that there was a collision (duplicate). Then keep going until I had made $n number of successfully created codes. You could of course do this in memory, but I wanted to keep the codes for use in a MS Word mail merge. So... then I exported them as a CSV file.

How easily can you guess a GUID that might be generated?

GUIDs get used a lot in creating session keys for web applications. I've always wondered about the safety of this practice. Since the GUID is generated based on information from the machine, and the time, along with a few other factors, how hard is it to guess of likely GUIDs that will come up in the future. Let's say you started 1000, or 10000 new sessions, to get a good dataset of the GUIDs being generated. Would this make it any easier to generate a GUID that might be used for another session. You wouldn't even have to guess a specific GUID, but just keep on trying GUIDs that might be generated at a specific period of time.
Here is some stuff from Wikipedia (original source):
V1 GUIDs which contain a MAC address
and time can be identified by the
digit "1" in the first position of the
third group of digits, for example
{2f1e4fc0-81fd-11da-9156-00036a0f876a}.
In my understanding, they don't really hide it.
V4 GUIDs use the later algorithm,
which is a pseudo-random number. These
have a "4" in the same position, for
example
{38a52be4-9352-453e-af97-5c3b448652f0}.
More specifically, the 'data3' bit
pattern would be 0001xxxxxxxxxxxx in
the first case, and 0100xxxxxxxxxxxx
in the second. Cryptanalysis of the
WinAPI GUID generator shows that,
since the sequence of V4 GUIDs is
pseudo-random, given the initial state
one can predict up to next 250 000
GUIDs returned by the function
UuidCreate1. This is why GUIDs
should not be used in cryptography, e.
g., as random keys.
GUIDs are guaranteed to be unique and that's about it. Not guaranteed to be be random or difficult to guess.
TO answer you question, at least for the V1 GUID generation algorithm if you know the algorithm, MAC address and the time of the creation you could probably generate a set of GUIDs one of which would be one that was actually generated. And the MAC address if it's a V1 GUID can be determined from sample GUIDs from the same machine.
Additional tidbit from wikipedia:
The OSF-specified algorithm for
generating new GUIDs has been widely
criticized. In these (V1) GUIDs, the
user's network card MAC address is
used as a base for the last group of
GUID digits, which means, for example,
that a document can be tracked back to
the computer that created it. This
privacy hole was used when locating
the creator of the Melissa worm. Most
of the other digits are based on the
time while generating the GUID.
.NET Web Applications call Guid.NewGuid() to create a GUID which is in turn ends up calling the CoCreateGuid() COM function a couple of frames deeper in the stack.
From the MSDN Library:
The CoCreateGuid function calls the
RPC function UuidCreate, which creates
a GUID, a globally unique 128-bit
integer. Use the CoCreateGuid function
when you need an absolutely unique
number that you will use as a
persistent identifier in a distributed
environment.To a very high degree of
certainty, this function returns a
unique value – no other invocation, on
the same or any other system
(networked or not), should return the
same value.
And if you check the page on UuidCreate:
The UuidCreate function generates a
UUID that cannot be traced to the
ethernet/token ring address of the
computer on which it was generated. It
also cannot be associated with other
UUIDs created on the same computer.
The last contains sentence is the answer to your question. So I would say, it is pretty hard to guess unless there is a bug in Microsoft's implementation.
If someone kept hitting a server with a continuous stream of GUIDs it would be more of a denial of service attack than anything else.
The possibility of someone guessing a GUID is next to nil.
Depends. It is hard if the GUIDs are set up sensibly, e.g. using salted secure hashes and you have plenty of bits. It is weak if the GUIDs are short and obvious.
You may well want to be taking steps to stop someone create 10000 new sessions anyway due to the server load this might create.
"GUIDs are guaranteed to be unique and that's about it". GUIDs are not garanteed to be unique. At least the ones generated by CoCreateGuid: "To a very high degree of certainty, this function returns a unique value – no other invocation, on the same or any other system (networked or not), should return the same value."

Resources