Related
I have always wondered how websites generates "share with others" links.
Some websites allow you to share a piece of data through a link in order to let people you sent the link to to be able to see the data or even edit it.
For example Google Drive, OneDrive, etc... They give you a (pretty short) link, but what guaranties me that it's not possible for someone to find this link "by luck" and access my data?
Like what if an attacker was trying all the possibilities of links: https://link.share.me/xxxxxxx till he finds some working ones?
Is there a certain length which almost guaranties that no one will find one link this way ? For example if a site generated 1000 links, if the endpoints are composed of 10 times a [A-Za-z0-9] like character (~8e17 possibilities), we just assume that it is secure enough ? If yes, at what probability or ratio between links and possibilities do we consider this kind of system as secure?
Is there a certain cryptographic or mathematic way of generating those links which assure us that a link cannot be found by anyone?
Thank you very much.
Probably the most important thing (besides entropy, which we will come back to in a second) is where you get random from. For this purpose you should use a cryptographic pseudo-random number generator (crypto prng). (As a sidenote, you could also use real random, but a real random source is very hard to come by, if you generate many links, you will likely run out of available random bits, so a crypto prng is probably good enough for your purpose, few applications do actually need real random numbers). Most languages and/or frameworks have a facility for this, in Ruby it is SecureRandom, in Java it's java.security.SecureRandom for example, in python it could be os.urandom and so on.
Ok so how long should it be. It somewhat depends on your other non-security requirements as well, for example sometimes these need to be easy to say over the phone, easy to type or something similar. Apart from these, what you should consider is entropy. Your idea of counting the number of all possible codes is a great start, let's just say that the entropy in the code is log2 (base 2 logarithm) of that number. So for a case sensitive, alphanumeric code that is 10 characters long, the entropy is log2((26+26+10)^10) = 59.5 bits. You can compute the entropy for any other length and character set the same way.
That might well be enough, what you should consider is your attacker. Will they be able to perform online attacks only (a lot slower), or offline too (can be very-very fast, especially with specialized hardware)? Also what is the impact if they find one, is it like financial data, or just a random funny picture, or the personal data of somebody, for which you are legally responsible in multiple jurisdictions (see GDPR in EU, or the California privacy laws)?
In general, you could say that 64 bits of entropy is probably good enough for many purposes, and 128 bits is a lot (except maybe for cryptographic keys and very high security applications). As the 59 bits above is.. well, almost 64, for lower security apps that could for example be a reasonable tradeoff for better usability.
So in short, there is no definitive answer, it depends on how you want to model this, and what security requirements you want to meet.
Two more things to consider are the validity of these codes, and how many will be issued (how dense will the space be).
I think the usual variables here are the character set for the code, and its length. Validity is more like a business requirement, and the density of codes will depend on your usage and also the length (which defines the size of your code space).
As an example, let's say you have 64 bits of entropy, you issued 10 million codes already, and your attacker can only perform online attacks by sending a request to your server, at a rate of say 100/second. These are likely huge overstatements towards the secure side.
That would mean there is a 0.17% chance somebody could find a valid code in a year. But would your attacker put so much effort into finding one single (random) valid code? Whether that's acceptable for you only depends on your specific case, only you can tell. If not, you can increase the length of the code for example.
I do not use OneDrive, but I can say from Google Drive that:
The links are not that short. I have just counted one and it's length is 32.
More than security, they probably made large links to do not run out of combinations as thousands of Drive files are shared each day. For security, Drive allows you to choose the users that can access to it. If you select "Everyone" then you should be sure that you don't have problem that anyone sees the content of the link. Even if the link cannot be found "by chance" there still exists the probability that someone else obtains the link from your friend and then shares it or that they are catched in proxies. Long links should be just complementary to other security measures.
Answering your questions:
Links of any length can be found, but longer links will require more time to be found. If you use all alphanumeric characters probably 30 is enough, but as I said they should not be the unique security in your system.
Just make them random, long and let the characters to be in a wide range.
What are the most secure sources of entropy to seed a random number generator? This question is language and platform independent and applies to any machine on a network. Ideally I'm looking for sources available to a machine in a cloud environment or server provided by a hosting company.
There are two important weaknesses to keep in mind. The use of time for sending a random number generator is a violation of CWE-337. The use of a small seed space would be a violation of CWE-339.
Here are a few thoughts. If you are impatient, skip to the conclusion, at the end.
1. What is a secure seed ?
Security is defined only relatively to an attack model. We want here a sequence of n bits, that has n bits of entropy with regards to the attacker: in plain words, that any of the possible 2n values for that sequence are equally probable from the attacker point of view.
This is a model which relates to the information available to the attacker. The application which generates and uses the seed (normally in a PRNG) knows the exact seed; whether the seed is "secure" is not an absolute property of the seed or even of the seed generation process. What matters is the amount of information that the attacker has about the generation process. This level of information varies widely depending on the situation; e.g. on a multi-user system (say Unix-like, with hardware-enforced separation of applications), precise timing of memory accesses can reveal information on how a nominally protected process reads memory. Even a remote attacker can obtain such information; this has been demonstrated (in lab conditions) on AES encryption (typical AES implementations use internal tables, with access patterns which depend on the key; the attacker forces cache misses and detects them through precise timing of responses of the server).
The seed lifetime must be taken into account. The seed is secure as long as it remains unknown to the attacker; this property must hold true afterwards. In particular, it shall not be possible to recover the seed from excerpts of the subsequent PRNG output. Ideally, even obtaining the complete PRNG state at some point should offer no clue as to whatever bits the PRNG produced beforehand.
The point I want to make here is that a seed is "secure" only if it is used in a context where it can remain secure, which more or less implies a cryptographically secure PRNG and some tamper-resistant storage. If such storage is available, then the most secure seed is the one that was generated once, a long time ago, and used in a secure PRNG hosted by tamper-resistant hardware.
Unfortunately, such hardware is expensive (it is called a HSM and costs a few hundreds or thousands of dollars), and that cost usually proves difficult to justify (a bad seed will not prevent a system from operating; this is the usual problem of untestability of security). Hence it is customary to go for "mostly software" solutions. Since software is not good at providing long-term confidential storage, the seed lifetime is arbitrarily shortened: a new seed is periodically obtained. In Fortuna, such reseeding is supposed to happen at least once every megabyte of generated pseudo-random data.
To sum up, in a setup without a HSM, a secure seed is one that can be obtained relatively readily (since we will do it quite often) using data that cannot be gathered by the attacker.
2. Mixing
Random data sources do not produce nice uniform bits (each bit having value 1 with probability exactly 0.5, and bit values are independent of each other). Instead, random sources produce values in a source-specific sets. These values can be encoded as sequences of bits, but you do not get your money worth: to have n bits of entropy you must have values which, when encoded, uses much more than n bits.
The cryptographic tool to use here is a PRF which accepts an input of arbitrary length, and produces an n-bit output. A cryptographically secure PRF of that kind is modeled as a random oracle: in short terms, it is not computationally feasible to predict anything about the oracle output on a given input without trying it.
Right now, we have hash functions. Hash functions must fulfill a few security properties, namely resistance to preimages, second preimages, and collisions. We usually analyze hash functions by trying to see how they depart from the random oracle model. There is an important point here: a PRF which follows the random oracle model will be a good hash function, but there can be good hash functions (in the sense of resistance to preimages and collisions) which nonetheless are easy to distinguish from a random oracle. In particular, the SHA-2 functions (SHA-256, SHA-512...) are considered to be secure, but depart from the random oracle model due to the "length extension attack" (given h(m), it is possible to compute h(m || m') for a partially constrained message m' without knowing m). The length extension attack does not seem to provide any shortcut into the creation of preimages or collisions, but it shows that those hash functions are not random oracles. For the SHA-3 competition, NIST stated that candidates should not allow such "length extension".
Hence, the mixing step is not easy. Your best bet is still, right now, to use SHA-256 or SHA-512, and switch to SHA-3 when it is chosen (this should happen around mid-2012).
3. Sources
A computer is a deterministic machine. To get some randomness, you have to mix in the result of some measures of the physical world.
A philosophical note: at some point you have to trust some smart guys, the kind who may wear lab coats or get paid to do fundamental research. When you use a hash function such as SHA-256, you are actually trusting a bunch of cryptographers when they tell you: we looked for flaws, real hard, and for several years, and found none. When you use a decaying bit of radioactive matter with a Geiger counter, you are trusting some physicists who say: we looked real hard for ways to predict when the next atom kernel will go off, but we found none. Note that, in that specific case, the "physicists" include people like Becquerel, Rutherford, Bohr or Einstein, and "real hard" means "more than a century of accumulated research", so you are not exactly in untrodden territory here. Yet there is still a bit of faith in security.
Some computers already include hardware which generates random data (i.e. which uses and measures a physical process which, as far as physicist can tell, is random enough). The VIA C3 (a line of x86-compatible CPU) have such hardware. Strangely enough, the Commodore 64, home computer from 30 years ago, also had a hardware RNG (or so says Wikipedia, at least).
Barring special hardware, you have to use whatever physical events you may get. Typically, you would use keystrokes, incoming ethernet packets, mouse movements, harddisk accesses... every event comes with some data, and occurs at a measurable instant (modern processors have very accurate clocks, thanks to cycle counters). Those instants, and the event data contents, can be accumulated as entropy sources. This is much easier for the operating system itself (which has direct access to the hardware) than for applications, so the normal way of collecting a seed is to ask the operating system (on Linux, this is called /dev/random or /dev/urandom [both have advantages and problems, choose your poison]; on Windows, call CryptGenRandom()).
An extreme case is pre-1.2 Java applets, before the addition of java.security.SecureRandom; since Java is very effective at isolating the application code from the hardware, obtaining a random seed was a tough challenge. The usual solution was to have two or three threads running concurrently and thread-switching madly, so that the number of thread switches per second was somewhat random (in effect, this tries to extract randomness through the timing of the OS scheduler actions, which depend on what also occurs on the machine, including hardware-related events). This was quite unsatisfactory.
A problem with time-related measures is that the attacker also knows what is the current time. If the attacker has applicative access to the machine, then he can read the cycle counter as well.
Some people have proposed using audio cards as sources of "white noise" by setting the amplifier to its max (even servers have audio nowadays). Others argue for powering up webcams (we know that webcam videos are "noisy" and that's good for randomness, even if the webcam is facing a wall); but servers with webcams are not common. You can also ping an external network server (e.g. www.google.com) and see how much time it takes to come back (but this could be observed by an attacker spying on the network).
The beauty of the mixing step, with a hash function, is that entropy can only accumulate; there is no harm in adding data, even if that data is not that random. Just stuff as much as possible through the hash function. Hash functions are quite fast (a good SHA-512 implementation will process more than 150 MB/s on a typical PC, using a single core) and seeding does not happen that often.
4. Conclusion
Use a HSM. They cost a few hundred or thousands of dollars, but aren't your secrets worth much more than that ? A HSM includes RNG hardware, runs the PRNG algorithm, and stores the seed with tamper resistance. Also, most HSM are already certified with regards to various national regulations (e.g. FIPS 140 in the US, and the EAL levels in Europe).
If you are so cheap that you will not buy a HSM, or if you want to protect data which is actually not very worthwhile, then build up a cryptographically secure PRNG using a seed obtained by hashing lots of physical measures. Anything which comes from some hardware should be hashed, along with the instant (read "cycle counter") at which that data was obtained. You should hash data by the megabyte here. Or, better yet, do not do it: simply use the facilities offered by your operating system, which already includes such code.
The most secure seed is the one which has the highest level of entropy (or most number of bits that can not be predicted). Time is a bad seed generally because it has a small entropy (ie. if you know when the transaction took place you can guess the time stamp to within a few bits). Hardware entropy sources (e.g. from decay processes) are very good because they yield one bit of entropy for every bit of seed.
Usually a hardware source is impractical for most needs, so this leads you to rely on mixing a number of low quality entropy sources to produce a higher one. Typically this is done by estimating the number of bits of entropy for each sample and then gathering enough samples so that the search space for the entropy source is large enough that it is impractical for an attacker to search (128 bits is a good rule of thumb).
Some sources which you can use are: current time in microseconds (typically very low entropy of 1/2 a bit depending on resolution and how easy it is for an attacker to guess), interarrival time of UI events etc.
Operating system sources such as /dev/random and the Windows CAPI random number generator often provide a pre-mixed source of these low-entropy sources, for example the Windows generator CryptGenRandom includes:
The current process ID (GetCurrentProcessID).
The current thread ID (GetCurrentThreadID).
The tick count since boot time (GetTickCount).
The current time (GetLocalTime).
Various high-precision performance
counters (QueryPerformanceCounter).-
An MD4 hash of the user's environment
block, which includes username,
computer name, and search path. [...]-
High-precision internal CPU counters, such as RDTSC, RDMSR, RDPMC
Some PRNGs have built-in strategies to allow the mixing of entropy from low quality sources to produce high quality results. One very good generator is the Fortuna generator. It specifically uses strategies which limit the risk if any of the entropy sources are compromised.
The most secure seed is a truly random one, which you can approximate in practical computing systems of today by using, listed in decreasing degrees of confidence:
Special hardware
Facilities provided by your operating system that try to capture chaotic events like disk reads and mouse movements (/dev/random). Another option on this "capture unpredictable events" line is to use an independent process or machine that captures what happens to it as an entropy pool, instead of the OS provided 'secure' random number generator, for an example, see EntropyPool
Using a bad seed (ie, time) and combine it with other data only known to you (for instance, hashing the time with a secret and some other criteria such as PIDs or internal state of the application/OS, so it doesn't necessarily increase and decrease according to time)
As an interesting take on one-time pads, whenever I'm engaged in espionage I have a system whereby I need only communicate a few letters. For example, the last time I was selling secret plans to build toasters to the Duchy of Grand Fenwick, I only needed to whisper:
enonH
to my confederate. She knew to get http://is.gd/enonH- (this is a "safe" expander URL which takes you to the is.gd expansion page which in turn points to a completely SFW image of a frog). This gave us 409k bits of one-time pad or - if I wink while whispering "enonH" - she knows to take the hash of the image and use that as a decoding key for my next transmission.
Because of the compression in JPEG images they tend to be relatively good sources of entropy as reported by ent:
$ ent frog.jpg
Entropy = 7.955028 bits
per byte.
Optimum compression would reduce the
size of this 51092 byte file by 0
percent.
Chi square distribution for 51092
samples is 4409.15, and randomly would
exceed this value 0.01 percent of the
times.
Arithmetic mean value of data bytes is
129.0884 (127.5 = random).
Monte Carlo value for Pi is 3.053435115 (error
2.81 percent).
Serial correlation coefficient is 0.052738 (totally
uncorrelated = 0.0).uncorrelated = 0.0).
Combine that with the nearly impossible to guess image that I directed her to and my secret toaster plans are safe from The Man.
The answer is /dev/random on a Linux machine. This is very close to a "real" random number generator, where as /dev/urandom can be generated by a PRNG if the entropy pool runs dry. The following quote is taken from the Linux kernel's random.c This entire file is a beautiful read, plenty of comments. The code its self was adopted from from PGP. Its beauty is not bounded by the constraints of C, which is marked by global structs wrapped by accessors. It is a simply awe inspiring design.
This routine gathers environmental
noise from device drivers, etc., and
returns good random numbers, suitable
for cryptographic use. Besides the
obvious cryptographic uses, these
numbers are also good for seeding
TCP sequence numbers, and other places
where it is desirable to have
numbers which are not only random, but
hard to predict by an attacker.
Theory of operation
Computers are very predictable devices. Hence it is extremely hard
to produce truly random numbers on a
computer --- as opposed to
pseudo-random numbers, which can
easily generated by using a
algorithm. Unfortunately, it is very
easy for attackers to guess the
sequence of pseudo-random number
generators, and for some
applications this is not acceptable.
So instead, we must try to gather
"environmental noise" from the
computer's environment, which must
be hard for outside attackers to
observe, and use that to generate
random numbers. In a Unix
environment, this is best done from
inside the kernel.
Sources of randomness from the environment include inter-keyboard
timings, inter-interrupt timings from
some interrupts, and other events
which are both (a) non-deterministic
and (b) hard for an outside observer
to measure. Randomness from these
sources are added to an "entropy
pool", which is mixed using a CRC-like
function. This is not
cryptographically strong, but it is
adequate assuming the randomness is
not chosen maliciously, and it is fast
enough that the overhead of doing it
on every interrupt is very reasonable.
As random bytes are mixed into the
entropy pool, the routines keep an
estimate of how many bits of
randomness have been stored into the
random number generator's internal
state.
When random bytes are desired, they are obtained by taking the SHA
hash of the contents of the "entropy
pool". The SHA hash avoids exposing
the internal state of the entropy
pool. It is believed to be
computationally infeasible to derive
any useful information about the
input of SHA from its output. Even if
it is possible to analyze SHA in
some clever way, as long as the amount
of data returned from the generator
is less than the inherent entropy in
the pool, the output data is totally
unpredictable. For this reason, the
routine decreases its internal
estimate of how many bits of "true
randomness" are contained in the
entropy pool as it outputs random
numbers.
If this estimate goes to zero, the routine can still generate random
numbers; however, an attacker may (at
least in theory) be able to infer
the future output of the generator
from prior outputs. This requires
successful cryptanalysis of SHA, which
is not believed to be feasible, but
there is a remote possibility. Nonetheless, these numbers should be
useful for the vast majority of
purposes.
...
Write an Internet radio client, use a random sample from the broadcast. Have a pool of several stations to choose from and/or fall back to.
James is correct. In addition, there is hardware that you can purchase that will give you random data. Not sure where I saw it, but I think I read that some sound cards come with such hardware.
You can also use a site like http://www.random.org/
If you read into crypto-theory, it becomes apparent that the most secure seed would be one generated by a chaotic event. Throughout recent history, covert operations have made use of what is known as a "One-time pad" which is proven impossible to crack. Normally these are generated through an assortment of atmospheric listening posts scattered about the middle of nowhere. Atmospheric noise is sufficiently chaotic to be considered random. The main problem with this method is that the logistics for a one time pad are considerable.
My suggestion to you is to find a sufficiently chaotic event to somehow extract data from.
4 - chosen by very random dice roll. :-)
OK, assuming that the client needs a strong seed, and you are using cloud computing here is a solution, for some hardware random number generators you can look here:
http://en.wikipedia.org/wiki/Hardware_random_number_generator
So, this assumes that each client has a public/private key pair, where the server knows the public key for each client.
To generate a key you can use something similar to what was done with PGP, in the beginning, where you take the difference in time between key strokes as someone types, as that won't be guessable.
So, the client submits a request for a random number.
The server uses a hardware generator, encrypts it with the public key, and signs this with the server's private key.
The client then can verify where it came from and then decrypt it.
This will ensure that you can generate a random number and pass it back in a secure fashion.
UPDATE:
Your best bet is to look in the Art of Computer Programming or any of the Numerical Methods book, or look at what Bruce Schneier has written, such as these links:
http://www.schneier.com/blog/archives/2006/06/random_number_g.html http://www.cryptosys.net/rng_algorithms.html
http://www.schneier.com/blog/archives/2006/06/random_number_g.html http://www.schneier.com/blog/archives/2006/06/random_number_g.html
Suggestions for Random Number Generation in Software, ftp://ftp.rsasecurity.com/pub/pdfs/bull-1.pdf
You can also look at having Crypto++ do the generation, or at least look at how Wei Dai did it, http://www.cryptopp.com/
Random.org offers a true random number generator web service, "seeded" by the atmospheric noise.
You get 200,000 random bits for free each day, up to the 1 million random bits cap after that you should top up your account, it gets as cheap as 4 million bits per dollar.
Simple solution if no additional random hardware are available.
Use milliseconds, mouseX and mouseY to generate a seed.
As the consensus is cryptographically strong random numbers must derived form hardware. Some processors have this functionality (Intel chips amonst others). Also sound cards can be used for this by measuring the low-bit fluctuations in the a-d converter.
But due to the hardware needs the is no language and platform independent answer.
Pretty much any larger OS will have support for secure random numbers. It is also tricky to implement a good random number generator with good output, since you will have to track the remaining entropy in the pool.
So the first step is to determine what language(s) you will be using.
Some do have strong random number support - if this is not the case you would have to abstract the generation to call platform-dependent random sources.
Depending on your security needs be weary of "online" sources since a man-in-the midde can be a serious threat for unauthenticated online sources.
Your most secure methods will come from nature. That is to say, something that happens outside of your computer system and beyond our ability to predict it's patterns.
For instance, many researchers into Cryptographically secure PRNGs will use radioactive decay as a model, others might look into fractals, and so forth. There are existing means of creating true RNGs
One of my favorite ways of implementing a PRNG is from user interaction with a computer. For instance, this post was not something that could be pre-determined by forward-engineering from my past series of posts. Where I left my mouse on my screen is very random, the trail it made is also random. Seeing from user-interactions is. Abuse from the means of providing specific input such that specific numbers are generated could be mitigated by using a 'swarm' of user inputs and calculating it's 'vector', as long as you do not have every user in your system as an Eve, you should be fine. This is not suitable for many applications, as your pool of numbers is directly proportional to user input. Implementing this may have it's own issues.
People interested in RNG have already done things such as:
Use a web cam, whatever the random blips in the screen hash out to, when that truck passes by, that's all random data.
As mentioned already, radiation
Atmosphere
User interaction (as mentioned)
What's going on within the system EDG.
Secure seeds come from nature.
edit:
Based on what you're looking at doing, I might suggest using an aggregation of your cloud server's EDG.
First you need to define the actual use/purpose of the random number generator and why do you think in has to pass so high security standard? The reason I ask is that you mentioned picking it from the could - if you are using it indeed for security purposes then securing the source and the channel to send it around is much more important than anyone's academic knit-picking.
Second element is the size of the actual random numbers you need - big seed is good but only if the number generated is also big - otherwise you'll just be reading the small part of the generated number and that will increase your risk.
Look into reconfigurable ciphers, rather than things like SHA or AES. Here are 2 research papers if you want to read and verify how and why they work:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.97.6594&rep=rep1&type=pdf
http://www.springerlink.com/index/q29t6v1p45515186.pdf
Or grab any reconfigurable GOST cipher source code you find on the net and then you an either feed it just any basic seed (like concatenated "ticker" plus a web server node ID (if it's in a web farm) plus a part of response on any internet news site that changes top news all the time or you can feed it highly controlled initial seed (which you can make on your own) and use a light pseudo-random sequence for selecting further cipher configurations. Even NSA can't break that one :-) Since it's always a different cipher. For actual crypto purposes one virtually has to use very controlled initial seed just to be able to replicate the sequence for validation. That's where we go back to first item - securing the source and distribution.
Use random.org they claim to offer true random numbers to anyone on the Internet and they also have an HTTP API which you can use. They offer both free and paid services.
disclaimer: i am not in any way affiliated with random.org
You can earn random numbers generated by radioactive decay. Sounds a little strange at first, but you get real random numbers out of this.
Radioactive Decay
Another Article
THIS IS A GUESS! Crypto geeks please correct if I've got it wrong
The official algorithm for UUID/GUID at this point returns a result that is run through a cryptographic hash function - it takes known information, such as time, mac addr, and a counter to form a UUID/GUID and then runs this through a cryptographic hash to ensure that the mac address cannot be extracted.
I believe you can XOR this down to the number of bits you require for a seed with a reasonably good guarantee that the resultant value is equally distributed over the number space defined by your desired bit count. Note I am not claiming this is secure, only that this action should produce a value that distributes evenly across the bit space over time.
(((PI X current thread ID) X current process ID) / tick count) x pi
Many websites have password strength checking tool, which tells you how strong your password is
Lets say I have
st4cK0v3rFl0W
which is always considered super strong, but when I do
st4cK0v3rFl0Wst4cK0v3rFl0W
it is suddenly super weak. I've also heard that when password have just small repeating sequence, it is much weaker.
But how possibly can the second one be weaker than the first one, when it is twice as long?
Sounds like the password strength checker is flawed. It's not a big issue, I suppose, but a repeated strong password is not weaker than the original password.
My guess is that it's simply trivial to check for someone attacking your password. Trying each password doubled and tripled too is only double or triple the work. However, including more possibly characters in a password, such as punctuation marks, raises the complexity of brute-forcing your password much more.
However, in practice, nearly every non-obvious (read: impervious to dictionary attacks [yes, that includes 1337ifying a dictionary word]) password with 8 or more characters can be considered reasonably secure. It's usually much less work to social engineer it from you in some way or just use a keylogger.
I guess because you need to type your password two times by using the keyboard, so for that maybe if some one is in front of you can notice it.
The algorithm is broken.
Either uses a doublet detection and immediately writes it off as bad. Or calculates a strength that is in some way relative to the string length, and the repeated string is weaker than the comparable totally random string of equal length.
It might be a flaw by the password trength checker - it recognises a pattern... A pattern is not good for a password, but in this case it is a pattern on a complex string... Another reason can be the one pointed out by answer from Wael Dalloul : Someone can see the repeated text when you type it. Any spies have two chances of seeing what you type...
The best reason that I could think of, comes from the Electronic Authentication Guide, published by NIST. It gives a general thumb rule on how to estimate entropy in a password.
Length is just one criteria for entropy. There is the password character set that is also involved, but these are not the only criteria. If you read Shannon's research on user selected passwords closely, you'll notice that higher entropy is assigned to initial bits, and and lesser entropy to the latter, since it is quite possible to infer the next bits of the password from the previous.
This is not to say that longer passwords are bad, just that long passwords with a poor selection of characters are just as likely to be weak as shorter passwords.
My guess is it can generate a more "obvious" hash.
For example abba -> a737y4gs, but abbaabba -> 1y3k1y3k , granted this is a silly example, but the idea is that repeating patterns in key would make hash appear less "random".
Whilst in practice the longer one is probably stronger, I think there may be potential weaknesses when you get into the nitty gritty of how encryption and ciphers work... possibly...
Other than that, I'd echo the other responses that the strength-checker you're using isn't taking all aspects into account very accurately.
Just a thought...
On more than one occasion I've been asked to implement rules for password selection for software I'm developing. Typical suggestions include things like:
Passwords must be at least N characters long;
Passwords must include lowercase, uppercase and numbers;
No reuse of the last M passwords (or passwords used within P days).
And so on.
Something has always bugged me about putting any restrictions on passwords though - by restricting the available passwords, you reduce the size of the space of all allowable passwords. Doesn't this make passwords easier to guess?
Equally, by making users create complex, frequently-changing passwords, the temptation to write them down increases, also reducing security.
Is there any quantitative evidence that password restriction rules make systems more secure?
If there is, what are the 'most secure' password restriction strategies to use?
Edit Ólafur Waage has kindly pointed out a Coding Horror article on dictionary attacks which has a lot of useful analysis in it, but it strikes me that dictionary attacks can be massively reduced (as Jeff suggests) by simply adding a delay following a failed authentication attempt.
With this in mind, what evidence is there that forced-complex passwords are more secure?
Something has always bugged me about
putting any restrictions on passwords
though - by restricting the available
passwords, you reduce the size of the
space of all allowable passwords.
Doesn't this make passwords easier to
guess?
In theory, yes. In practice, the "weak" passwords you disallow represent a tiny subset of all possible passwords that is disproportionately often chosen when there are no restrictions, and which attackers know to attack first.
Equally, by making users create
complex, frequently-changing
passwords, the temptation to write
them down increases, also reducing
security.
Correct. Forcing users to change passwords every month is a very, very bad idea, except perhaps in extreme high-security environments where everyone really understands the need for security.
Those kind of rules definitely help because it stops stupid users from using passwords like "mypassword", which unfortunately happens quite often.
So actually, you are forcing the users into an extremely large set of potential passwords. It doesn't matter that you are excluding the set of all passwords with only lowercase letters, because the remaining set is still orders of magnitude larger.
BUT my big pet peeves are password restrictions I've encountered on major sites, like
No special characters
Maximum length
Why would anyone do this? W.H.Y.????
A nice read up on this is Jeff's article on Dictionary Attacks.
Never prevent the user from doing what they really want, unless there is a technical limitation from doing so.
You may nag the hell out of the user for doing stupid things like using a dictionary word or a 3-character password, or only using numbers, but see #1 above.
There is no good technical reason to require only alphanumerics, or at least one capital letter, or at least one number; see #1 above.
I forget which website had this advice regarding passwords: "Pick a password that is very easy for you to remember, but very hard for someone else to guess." But then they proceeded to require at least one capital letter and one number.
The problem with passwords is that they are so ubiquitous that it is essentially impossible for any person without a photographic memory to actually remember them without writing them down, and therefore leaving a serious security hole should someone gain access to this list of written-down passwords.
The only way I am able to manage this for myself is to split most of my passwords -- and I just checked my list, I'm up to 130 so far! -- into two parts, one which is the same in all cases, and the other which is unique but simple. (I break this rule for sites requiring high-security like bank accounts.)
By requiring "complexity" as defined as multiple types of characters all present, is that it forces people into a disparate set of conventions for different sites, which makes it harder to remember the password in question.
The only reason I will acknowledge for sites limiting the set of allowable password characters, is that it needs to be typeable on a keyboard. If you have to assume the account needs to be accessed from multiple countries, then keyboards may not always support the same characters on the user's home keyboard.
One of these days I'll have to make a blog posting on the subject. :(
My old limit theorem:
As the security of the password approaches adequate, the probability that it will be on a sticky note attached to the computer or monitor approaches one.
One also might point out the recent fiasco over at twitter where one of their admin's password turned out to be "happiness", which fell to a dictionary attack.
For questions like this, I ask myself what Bruce Schneier would do - the linked article is about how to choose passwords which are hard to guess with typical attacks.
Also note that if you add a delay after a failed attempt, you might also want to add a delay after a successful attempt, otherwise the delay is simply a signal that the attack has failed an other attempt should be launched.
Whilst this does not directly answer your question, I personally find the most aggrevating rule I have encountered one whereby you could not reuse any password previously used. After working at the same place for a number of years, and having to change your password every 2/3 months, the ability to use a password I chose over a year ago would not seem to be particularly unsafe or unsecure. If I have used "safe" passwords in the past (Alphanumeric with changes in case), surely reusing them after a perios of say a year or 2 (depending on how regularly you have to change your password) would seem to be acceptable to me. It also means I am less likely to use "easier" passwords, which might happen if I can't think of anything easy to remember and difficult to guess!
First let me say that details such as minimum length, case sensitivity and required special characters should depend on who has access and what the password allows them to do. If it's a code to launch a nuclear missile, it should be more strict than a password to log in to play your paid online edition of Angry Birds.
But I've got a SPECIFIC beef with case sensitivity.
For starters, users hate it. The human brain thinks "A=a". Of course, developers brains' aren't usually typical. ;-) But developers are also inconvenienced by case sensitivity.
Second, the CapsLock key is too easy to hit by mistake. It's right between Tab and Shift keys, but it SHOULD be up above the Esc key. Its location was established long ago in the days of typewriters, which had no alternate font available. In those days it was useful to have it there.
All passwords have risk... You're balancing risk with ease-of-use, and yes, usability matters.
MY ARGUMENT:
Yes, case sensitivity is more secure for a given password length. But unless someone is making me do otherwise, I opt for a longer minimum password length. Even if we assume only letters and digits are allowed, each added character multiplies number of the possible passwords by 36.
Someone who's less lazy than me with math could tell you the difference in number of combinations between, say a minimum 8-character case-sensitive password, and a 12-character case-insensitive password. I think most users would prefer the latter.
Also, not all apps expose usernames to others, so there are potentially two fields the hacker may have to find.
I also prefer to allow spaces in passwords as long as the majority of the password isn't spaces.
In the project I'm developing now, my management screen allows the administrator to change password requirements, which apply to all future passwords. He can also force all users to update passwords (to new requirements) at any time after next logon. I do this because I feel my stuff doesn't need case-sensitivity, but the administrator (who probably paid me for the software) may disagree so I let that person decide.
The PIN for my bank card is only four digits. Since it's only numbers it's not case sensitive. And heck, it's my MONEY! If you consider nothing else, this sounds pretty insecure, were it not for the fact that the hacker has to steal my card to get my money. (And have his photo taken.)
One other beef: Developers who come onto StackOverflow and regurgitate hard-and-fast rules that they read in an article somewhere. "Never hard code anything." (As if that's possible.) "All queries must be parameterized" (not if the the user doesn't contribute to the query.) etc.
Please excuse the rant. ;-) I promise I respect disagreement.
Personally for this paticular problem I tend to give passwords a 'score' based on characteristics of the entered text, and refuse passwords that don't meet the score.
For example:
Contains Lower Case Letter +1
Contains different Lower Case Letter +1
Contains Upper Case Letter +1
Contains different Upper Case Letter +1
Contains Non-Alphanumeric character: +1
Contains different Non-Alphanumeric character: +1
Contains Number: +1
Contains Non Consecutive or repeated Second Number: +1
Length less than 8: -10
Length Greater than 12: +1
Contains Dictionary word: -4
Then only allowing passwords with a score greater than 4, (and providing the user feedback as they create their password via javascript)
GUIDs get used a lot in creating session keys for web applications. I've always wondered about the safety of this practice. Since the GUID is generated based on information from the machine, and the time, along with a few other factors, how hard is it to guess of likely GUIDs that will come up in the future. Let's say you started 1000, or 10000 new sessions, to get a good dataset of the GUIDs being generated. Would this make it any easier to generate a GUID that might be used for another session. You wouldn't even have to guess a specific GUID, but just keep on trying GUIDs that might be generated at a specific period of time.
Here is some stuff from Wikipedia (original source):
V1 GUIDs which contain a MAC address
and time can be identified by the
digit "1" in the first position of the
third group of digits, for example
{2f1e4fc0-81fd-11da-9156-00036a0f876a}.
In my understanding, they don't really hide it.
V4 GUIDs use the later algorithm,
which is a pseudo-random number. These
have a "4" in the same position, for
example
{38a52be4-9352-453e-af97-5c3b448652f0}.
More specifically, the 'data3' bit
pattern would be 0001xxxxxxxxxxxx in
the first case, and 0100xxxxxxxxxxxx
in the second. Cryptanalysis of the
WinAPI GUID generator shows that,
since the sequence of V4 GUIDs is
pseudo-random, given the initial state
one can predict up to next 250 000
GUIDs returned by the function
UuidCreate1. This is why GUIDs
should not be used in cryptography, e.
g., as random keys.
GUIDs are guaranteed to be unique and that's about it. Not guaranteed to be be random or difficult to guess.
TO answer you question, at least for the V1 GUID generation algorithm if you know the algorithm, MAC address and the time of the creation you could probably generate a set of GUIDs one of which would be one that was actually generated. And the MAC address if it's a V1 GUID can be determined from sample GUIDs from the same machine.
Additional tidbit from wikipedia:
The OSF-specified algorithm for
generating new GUIDs has been widely
criticized. In these (V1) GUIDs, the
user's network card MAC address is
used as a base for the last group of
GUID digits, which means, for example,
that a document can be tracked back to
the computer that created it. This
privacy hole was used when locating
the creator of the Melissa worm. Most
of the other digits are based on the
time while generating the GUID.
.NET Web Applications call Guid.NewGuid() to create a GUID which is in turn ends up calling the CoCreateGuid() COM function a couple of frames deeper in the stack.
From the MSDN Library:
The CoCreateGuid function calls the
RPC function UuidCreate, which creates
a GUID, a globally unique 128-bit
integer. Use the CoCreateGuid function
when you need an absolutely unique
number that you will use as a
persistent identifier in a distributed
environment.To a very high degree of
certainty, this function returns a
unique value – no other invocation, on
the same or any other system
(networked or not), should return the
same value.
And if you check the page on UuidCreate:
The UuidCreate function generates a
UUID that cannot be traced to the
ethernet/token ring address of the
computer on which it was generated. It
also cannot be associated with other
UUIDs created on the same computer.
The last contains sentence is the answer to your question. So I would say, it is pretty hard to guess unless there is a bug in Microsoft's implementation.
If someone kept hitting a server with a continuous stream of GUIDs it would be more of a denial of service attack than anything else.
The possibility of someone guessing a GUID is next to nil.
Depends. It is hard if the GUIDs are set up sensibly, e.g. using salted secure hashes and you have plenty of bits. It is weak if the GUIDs are short and obvious.
You may well want to be taking steps to stop someone create 10000 new sessions anyway due to the server load this might create.
"GUIDs are guaranteed to be unique and that's about it". GUIDs are not garanteed to be unique. At least the ones generated by CoCreateGuid: "To a very high degree of certainty, this function returns a unique value – no other invocation, on the same or any other system (networked or not), should return the same value."