What is the space complexity of a hash table? - hashmap

What is size of a hash table with 32 bit key and 32 bit pointers to values stored separately?
Is it going to be 2^32 slots * (4 Bytes (key) + 4 Bytes (pointers to values))
= 4 * 10^9 * (4 + 4) = 32GB ?
I am trying to understand space complexity of hash tables.

I think you are asking the wrong question. The space complexity of a datastructure indicates how much space it occupies in relation to the amount of elements it holds. For example a space complexity of O(1) would mean that the datastructure alway consumes constant space no matter how many elements you put in there. O(n) would mean that the space consumption grows linearly with the amount of elements in it.
A hashtable typically has a space complexity of O(n).
So to answer your question: It depends on the number of elements it currently stores and in real world also on the actual implementation.
A lower bound for the memory consumption of your hashtable is: (Number of Values to Store) * (SizeOf a Value). So if you want to store 1 million values in the hashtable and each occupies 4 bytes then it will consume at least 4 million bytes (roughly 4MB). Usually real world implementations use a bit more memory for infrastructure but again: this highly depends on the actual implementation and there is no way to find out for sure but to measure it.

Hash tables don't match hash function values and slots. The hash function is computed modulo the size of a reference vector that is much smaller than the hash function range. Because this value is fixed, it is not considered in the space complexity computation.
Consequently, the space complexity of every reasonable hash table is O(n).
In general, this works out quite well. While the key space may be large, the number of values to store is usually quite easily predictable. Certainly, the amount of memory that is functionally acceptable for data structure overhead is typically obvious.
This is why hash tables are so ubiquitous. They often provide the best data structure for a given task, mixing strictly bounded memory overhead with better than log2 n time complexity. I love binary trees but they don't usually beat hash tables.

Lets pretend we have a naive hashtable where the number of buckets is equal to double the size of the elements. That is O(2n) the number of elements which is O(n).
When the number of elements exceeds half of the number of available buckets, you need to create a new array of buckets, double the size and rehash all the elements to their new locations in the new array of buckets.
386 public V put(K key, V value) {
387 if (key == null)
388 return putForNullKey(value);
389 int hash = hash(key.hashCode());
390 int i = indexFor(hash, table.length);
391 for (Entry<K,V> e = table[i]; e != null; e = e.next) {
392 Object k;
393 if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {
394 V oldValue = e.value;
395 e.value = value;
396 e.recordAccess(this);
397 return oldValue;
398 }
399 }
401 modCount++;
402 addEntry(hash, key, value, i);
403 return null;
404 }
768 void addEntry(int hash, K key, V value, int bucketIndex) {
769 Entry<K,V> e = table[bucketIndex];
770 table[bucketIndex] = new Entry<K,V>(hash, key, value, e);
771 if (size++ >= threshold)
772 resize(2 * table.length);
773 }
471 void resize(int newCapacity) {
472 Entry[] oldTable = table;
473 int oldCapacity = oldTable.length;
474 if (oldCapacity == MAXIMUM_CAPACITY) {
475 threshold = Integer.MAX_VALUE;
476 return;
477 }
479 Entry[] newTable = new Entry[newCapacity];
480 transfer(newTable);
481 table = newTable;
482 threshold = (int)(newCapacity * loadFactor);
483 }
488 void transfer(Entry[] newTable) {
489 Entry[] src = table;
490 int newCapacity = newTable.length;
491 for (int j = 0; j < src.length; j++) {
492 Entry<K,V> e = src[j];
493 if (e != null) {
494 src[j] = null;
495 do {
496 Entry<K,V> next = e.next;
497 int i = indexFor(e.hash, newCapacity);
498 e.next = newTable[i];
499 newTable[i] = e;
500 e = next;
501 } while (e != null);
502 }
503 }
504 }
References:
HashMap.put
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/HashMap.java#HashMap.put%28java.lang.Object%2Cjava.lang.Object%29
Grepcode is down, you can take a look the openjdk repo here as a better reference:
http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/util/HashMap.java

Still there is no perfect answer to the question. I am not sure about the space occupied.
As per my understanding of the issue. The size is dynamic and varies with the size of input.
That is we start with a random number, hash table size, which is very less as compare to hash function value. Then we insert the input. Now, as the collision start occurring we dynamically double the hash table size.
This is the reason, I think, for O(n) complexity. Kindly correct me if I am wrong.

Related

Maximum repeating substring of size n

Find the substring of length n that repeats a maximum number of times in a given string.
Input: abbbabbbb# 2
Output: bb
My solution:
public static String mrs(String s, int m) {
int n = s.length();
String[] suffixes = new String[n-m+1];
for (int i = 0; i < n-m+1; i++) {
suffixes[i] = s.substring(i, i+m);
}
Arrays.sort(suffixes);
String ans = "", tmp=suffixes[0].substring(0,m);
int cnt = 1, max=0;
for (int i = 0; i < n-m; i++) {
if (suffixes[i].equals(suffixes[i+1])){
cnt++;
}else{
if(cnt>max){
max = cnt;
ans =tmp;
}
cnt=0;
tmp = suffixes[i];
}
}
return ans;
}
Can it be done better than the above O(nm) time and O(n) space solution?
For a string of length L and a given length k (not to mess up with n and m which the question interchanges at times), we can compute polynomial hashes of all substrings of length k in O(L) (see Wikipedia for some elaboration on this subproblem).
Now, if we map the hash values to the number of times they occur, we get the value which occurs most frequently in O(L) (with a HashMap with high probability, or in O(L log L) with a TreeMap).
After that, just take the substring which got the most frequent hash as the answer.
This solution does not take hash collisions into account.
The idea is to just reduce the probability of collisions enough for the application (if it's too high, use multiple hashes, for example).
If the application demands that we absolutely never give a wrong answer, we can check the answer in O(L) with another algorithm (KMP, for example), and re-run the whole solution with a different hash function as long as the answer turns out to be wrong.

Node.js - How to generate random numbers in specific range using crypto.randomBytes

How can I generate random numbers in a specific range using crypto.randomBytes?
I want to be able to generate a random number like this:
console.log(random(55, 956)); // where 55 is minimum and 956 is maximum
and I'm limited to use crypto.randomBytes only inside random function to generate random number for this range.
I know how to convert generated bytes from randomBytes to hex or decimal but I can't figure out how to get a random number in a specific range from random bytes mathematically.
To generate random number in a certain range you can use the following equation
Math.random() * (high - low) + low
But you want to use crypto.randomBytes instead of Math.random()
this function returns a buffer with randomly generated bytes. In turn, you need to convert the result of this function from bytes to decimal. this can be done using biguint-format package. To install this package simply use the following command:
npm install biguint-format --save
Now you need to convert the result of crypto.randomBytes to decimal, you can do that as follow:
var x= crypto.randomBytes(1);
return format(x, 'dec');
Now you can create your random function which will be as follow:
var crypto = require('crypto'),
format = require('biguint-format');
function randomC (qty) {
var x= crypto.randomBytes(qty);
return format(x, 'dec');
}
function random (low, high) {
return randomC(4)/Math.pow(2,4*8-1) * (high - low) + low;
}
console.log(random(50,1000));
Thanks to answer from #Mustafamg and huge help from #CodesInChaos I managed to resolve this issue. I made some tweaks and increase range to maximum 256^6-1 or 281,474,976,710,655. Range can be increased more but you need to use additional library for big integers, because 256^7-1 is out of Number.MAX_SAFE_INTEGER limits.
If anyone have same problem feel free to use it.
var crypto = require('crypto');
/*
Generating random numbers in specific range using crypto.randomBytes from crypto library
Maximum available range is 281474976710655 or 256^6-1
Maximum number for range must be equal or less than Number.MAX_SAFE_INTEGER (usually 9007199254740991)
Usage examples:
cryptoRandomNumber(0, 350);
cryptoRandomNumber(556, 1250425);
cryptoRandomNumber(0, 281474976710655);
cryptoRandomNumber((Number.MAX_SAFE_INTEGER-281474976710655), Number.MAX_SAFE_INTEGER);
Tested and working on 64bit Windows and Unix operation systems.
*/
function cryptoRandomNumber(minimum, maximum){
var distance = maximum-minimum;
if(minimum>=maximum){
console.log('Minimum number should be less than maximum');
return false;
} else if(distance>281474976710655){
console.log('You can not get all possible random numbers if range is greater than 256^6-1');
return false;
} else if(maximum>Number.MAX_SAFE_INTEGER){
console.log('Maximum number should be safe integer limit');
return false;
} else {
var maxBytes = 6;
var maxDec = 281474976710656;
// To avoid huge mathematical operations and increase function performance for small ranges, you can uncomment following script
/*
if(distance<256){
maxBytes = 1;
maxDec = 256;
} else if(distance<65536){
maxBytes = 2;
maxDec = 65536;
} else if(distance<16777216){
maxBytes = 3;
maxDec = 16777216;
} else if(distance<4294967296){
maxBytes = 4;
maxDec = 4294967296;
} else if(distance<1099511627776){
maxBytes = 4;
maxDec = 1099511627776;
}
*/
var randbytes = parseInt(crypto.randomBytes(maxBytes).toString('hex'), 16);
var result = Math.floor(randbytes/maxDec*(maximum-minimum+1)+minimum);
if(result>maximum){
result = maximum;
}
return result;
}
}
So far it works fine and you can use it as really good random number generator, but I strictly not recommending using this function for any cryptographic services. If you will, use it on your own risk.
All comments, recommendations and critics are welcome!
To generate numbers in the range [55 .. 956], you first generate a random number in the range [0 .. 901] where 901 = 956 - 55. Then add 55 to the number you just generated.
To generate a number in the range [0 .. 901], pick off two random bytes and mask off 6 bits. That will give you a 10 bit random number in the range [0 .. 1023]. If that number is <= 901 then you are finished. If it is bigger than 901, discard it and get two more random bytes. Do not attempt to use MOD, to get the number into the right range, that will distort the output making it non-random.
ETA: To reduce the chance of having to discard a generated number.
Since we are taking two bytes from the RNG, we get a number in the range [0 .. 65535]. Now 65535 MOD 902 is 591. Hence, if our two-byte random number is less than (65535 - 591), that is, less than 64944, we can safely use the MOD operator, since each number in the range [0 .. 901] is now equally likely. Any two-byte number >= 64944 will still have to be thrown away, as using it would distort the output away from random. Before, the chances of having to reject a number were (1024 - 901) / 1024 = 12%. Now the chances of a rejection are (65535 - 64944) / 65535 = 1%. We are far less likely to have to reject the randomly generated number.
running <- true
while running
num <- two byte random
if (num < 64944)
result <- num MOD 902
running <- false
endif
endwhile
return result + 55
The crypto package now has a randomInt() function. It was added in v14.10.0 and v12.19.0.
console.log(crypto.randomInt(55, 957)); // where 55 is minimum and 956 is maximum
The upper bound is exclusive.
Here is the (abridged) implementation:
// Largest integer we can read from a buffer.
// e.g.: Buffer.from("ff".repeat(6), "hex").readUIntBE(0, 6);
const RAND_MAX = 0xFFFF_FFFF_FFFF;
const range = max - min;
const excess = RAND_MAX % range;
const randLimit = RAND_MAX - excess;
while (true) {
const x = randomBytes(6).readUIntBE(0, 6);
// If x > (maxVal - (maxVal % range)), we will get "modulo bias"
if (x > randLimit) {
// Try again
continue;
}
const n = (x % range) + min;
return n;
}
See the full source and the official docs for more information.
So the issue with most other solutions are that they distort the distribution (which you probably would like to be uniform).
The pseudocode from #rossum lacks generalization. (But he proposed the right solution in the text)
// Generates a random integer in range [min, max]
function randomRange(min, max) {
const diff = max - min + 1;
// finds the minimum number of bit required to represent the diff
const numberBit = Math.ceil(Math.log2(diff));
// as we are limited to draw bytes, minimum number of bytes
const numberBytes = Math.ceil(numberBit / 4);
// as we might draw more bits than required, we look only at what we need (discard the rest)
const mask = (1 << numberBit) - 1;
let randomNumber;
do {
randomNumber = crypto.randomBytes(numberBytes).readUIntBE(0, numberBytes);
randomNumber = randomNumber & mask;
// number of bit might represent a numbers bigger than the diff, in that case try again
} while (randomNumber >= diff);
return randomNumber + min;
}
About performance concerns, basically the number is in the right range between 50% - 100% of the time (depending on the parameters). That is in the worst case scenario the loop is executed more than 7 times with less than 1% chance and practically, most of the time the loop is executed one or two times.
The random-js library acknowledges that most solution out there don't provide random numbers with uniform distributions and provides a more complete solution

How to generate a random string of a fixed length in Go?

I want a random string of characters only (uppercase or lowercase), no numbers, in Go. What is the fastest and simplest way to do this?
Paul's solution provides a simple, general solution.
The question asks for the "the fastest and simplest way". Let's address the fastest part too. We'll arrive at our final, fastest code in an iterative manner. Benchmarking each iteration can be found at the end of the answer.
All the solutions and the benchmarking code can be found on the Go Playground. The code on the Playground is a test file, not an executable. You have to save it into a file named XX_test.go and run it with
go test -bench . -benchmem
Foreword:
The fastest solution is not a go-to solution if you just need a random string. For that, Paul's solution is perfect. This is if performance does matter. Although the first 2 steps (Bytes and Remainder) might be an acceptable compromise: they do improve performance by like 50% (see exact numbers in the II. Benchmark section), and they don't increase complexity significantly.
Having said that, even if you don't need the fastest solution, reading through this answer might be adventurous and educational.
I. Improvements
1. Genesis (Runes)
As a reminder, the original, general solution we're improving is this:
func init() {
rand.Seed(time.Now().UnixNano())
}
var letterRunes = []rune("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
func RandStringRunes(n int) string {
b := make([]rune, n)
for i := range b {
b[i] = letterRunes[rand.Intn(len(letterRunes))]
}
return string(b)
}
2. Bytes
If the characters to choose from and assemble the random string contains only the uppercase and lowercase letters of the English alphabet, we can work with bytes only because the English alphabet letters map to bytes 1-to-1 in the UTF-8 encoding (which is how Go stores strings).
So instead of:
var letters = []rune("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
we can use:
var letters = []byte("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
Or even better:
const letters = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
Now this is already a big improvement: we could achieve it to be a const (there are string constants but there are no slice constants). As an extra gain, the expression len(letters) will also be a const! (The expression len(s) is constant if s is a string constant.)
And at what cost? Nothing at all. strings can be indexed which indexes its bytes, perfect, exactly what we want.
Our next destination looks like this:
const letterBytes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
func RandStringBytes(n int) string {
b := make([]byte, n)
for i := range b {
b[i] = letterBytes[rand.Intn(len(letterBytes))]
}
return string(b)
}
3. Remainder
Previous solutions get a random number to designate a random letter by calling rand.Intn() which delegates to Rand.Intn() which delegates to Rand.Int31n().
This is much slower compared to rand.Int63() which produces a random number with 63 random bits.
So we could simply call rand.Int63() and use the remainder after dividing by len(letterBytes):
func RandStringBytesRmndr(n int) string {
b := make([]byte, n)
for i := range b {
b[i] = letterBytes[rand.Int63() % int64(len(letterBytes))]
}
return string(b)
}
This works and is significantly faster, the disadvantage is that the probability of all the letters will not be exactly the same (assuming rand.Int63() produces all 63-bit numbers with equal probability). Although the distortion is extremely small as the number of letters 52 is much-much smaller than 1<<63 - 1, so in practice this is perfectly fine.
To make this understand easier: let's say you want a random number in the range of 0..5. Using 3 random bits, this would produce the numbers 0..1 with double probability than from the range 2..5. Using 5 random bits, numbers in range 0..1 would occur with 6/32 probability and numbers in range 2..5 with 5/32 probability which is now closer to the desired. Increasing the number of bits makes this less significant, when reaching 63 bits, it is negligible.
4. Masking
Building on the previous solution, we can maintain the equal distribution of letters by using only as many of the lowest bits of the random number as many is required to represent the number of letters. So for example if we have 52 letters, it requires 6 bits to represent it: 52 = 110100b. So we will only use the lowest 6 bits of the number returned by rand.Int63(). And to maintain equal distribution of letters, we only "accept" the number if it falls in the range 0..len(letterBytes)-1. If the lowest bits are greater, we discard it and query a new random number.
Note that the chance of the lowest bits to be greater than or equal to len(letterBytes) is less than 0.5 in general (0.25 on average), which means that even if this would be the case, repeating this "rare" case decreases the chance of not finding a good number. After n repetition, the chance that we still don't have a good index is much less than pow(0.5, n), and this is just an upper estimation. In case of 52 letters the chance that the 6 lowest bits are not good is only (64-52)/64 = 0.19; which means for example that chances to not have a good number after 10 repetition is 1e-8.
So here is the solution:
const letterBytes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
const (
letterIdxBits = 6 // 6 bits to represent a letter index
letterIdxMask = 1<<letterIdxBits - 1 // All 1-bits, as many as letterIdxBits
)
func RandStringBytesMask(n int) string {
b := make([]byte, n)
for i := 0; i < n; {
if idx := int(rand.Int63() & letterIdxMask); idx < len(letterBytes) {
b[i] = letterBytes[idx]
i++
}
}
return string(b)
}
5. Masking Improved
The previous solution only uses the lowest 6 bits of the 63 random bits returned by rand.Int63(). This is a waste as getting the random bits is the slowest part of our algorithm.
If we have 52 letters, that means 6 bits code a letter index. So 63 random bits can designate 63/6 = 10 different letter indices. Let's use all those 10:
const letterBytes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
const (
letterIdxBits = 6 // 6 bits to represent a letter index
letterIdxMask = 1<<letterIdxBits - 1 // All 1-bits, as many as letterIdxBits
letterIdxMax = 63 / letterIdxBits // # of letter indices fitting in 63 bits
)
func RandStringBytesMaskImpr(n int) string {
b := make([]byte, n)
// A rand.Int63() generates 63 random bits, enough for letterIdxMax letters!
for i, cache, remain := n-1, rand.Int63(), letterIdxMax; i >= 0; {
if remain == 0 {
cache, remain = rand.Int63(), letterIdxMax
}
if idx := int(cache & letterIdxMask); idx < len(letterBytes) {
b[i] = letterBytes[idx]
i--
}
cache >>= letterIdxBits
remain--
}
return string(b)
}
6. Source
The Masking Improved is pretty good, not much we can improve on it. We could, but not worth the complexity.
Now let's find something else to improve. The source of random numbers.
There is a crypto/rand package which provides a Read(b []byte) function, so we could use that to get as many bytes with a single call as many we need. This wouldn't help in terms of performance as crypto/rand implements a cryptographically secure pseudorandom number generator so it's much slower.
So let's stick to the math/rand package. The rand.Rand uses a rand.Source as the source of random bits. rand.Source is an interface which specifies a Int63() int64 method: exactly and the only thing we needed and used in our latest solution.
So we don't really need a rand.Rand (either explicit or the global, shared one of the rand package), a rand.Source is perfectly enough for us:
var src = rand.NewSource(time.Now().UnixNano())
func RandStringBytesMaskImprSrc(n int) string {
b := make([]byte, n)
// A src.Int63() generates 63 random bits, enough for letterIdxMax characters!
for i, cache, remain := n-1, src.Int63(), letterIdxMax; i >= 0; {
if remain == 0 {
cache, remain = src.Int63(), letterIdxMax
}
if idx := int(cache & letterIdxMask); idx < len(letterBytes) {
b[i] = letterBytes[idx]
i--
}
cache >>= letterIdxBits
remain--
}
return string(b)
}
Also note that this last solution doesn't require you to initialize (seed) the global Rand of the math/rand package as that is not used (and our rand.Source is properly initialized / seeded).
One more thing to note here: package doc of math/rand states:
The default Source is safe for concurrent use by multiple goroutines.
So the default source is slower than a Source that may be obtained by rand.NewSource(), because the default source has to provide safety under concurrent access / use, while rand.NewSource() does not offer this (and thus the Source returned by it is more likely to be faster).
7. Utilizing strings.Builder
All previous solutions return a string whose content is first built in a slice ([]rune in Genesis, and []byte in subsequent solutions), and then converted to string. This final conversion has to make a copy of the slice's content, because string values are immutable, and if the conversion would not make a copy, it could not be guaranteed that the string's content is not modified via its original slice. For details, see How to convert utf8 string to []byte? and golang: []byte(string) vs []byte(*string).
Go 1.10 introduced strings.Builder. strings.Builder is a new type we can use to build contents of a string similar to bytes.Buffer. Internally it uses a []byte to build the content, and when we're done, we can obtain the final string value using its Builder.String() method. But what's cool in it is that it does this without performing the copy we just talked about above. It dares to do so because the byte slice used to build the string's content is not exposed, so it is guaranteed that no one can modify it unintentionally or maliciously to alter the produced "immutable" string.
So our next idea is to not build the random string in a slice, but with the help of a strings.Builder, so once we're done, we can obtain and return the result without having to make a copy of it. This may help in terms of speed, and it will definitely help in terms of memory usage and allocations.
func RandStringBytesMaskImprSrcSB(n int) string {
sb := strings.Builder{}
sb.Grow(n)
// A src.Int63() generates 63 random bits, enough for letterIdxMax characters!
for i, cache, remain := n-1, src.Int63(), letterIdxMax; i >= 0; {
if remain == 0 {
cache, remain = src.Int63(), letterIdxMax
}
if idx := int(cache & letterIdxMask); idx < len(letterBytes) {
sb.WriteByte(letterBytes[idx])
i--
}
cache >>= letterIdxBits
remain--
}
return sb.String()
}
Do note that after creating a new strings.Buidler, we called its Builder.Grow() method, making sure it allocates a big-enough internal slice (to avoid reallocations as we add the random letters).
8. "Mimicing" strings.Builder with package unsafe
strings.Builder builds the string in an internal []byte, the same as we did ourselves. So basically doing it via a strings.Builder has some overhead, the only thing we switched to strings.Builder for is to avoid the final copying of the slice.
strings.Builder avoids the final copy by using package unsafe:
// String returns the accumulated string.
func (b *Builder) String() string {
return *(*string)(unsafe.Pointer(&b.buf))
}
The thing is, we can also do this ourselves, too. So the idea here is to switch back to building the random string in a []byte, but when we're done, don't convert it to string to return, but do an unsafe conversion: obtain a string which points to our byte slice as the string data.
This is how it can be done:
func RandStringBytesMaskImprSrcUnsafe(n int) string {
b := make([]byte, n)
// A src.Int63() generates 63 random bits, enough for letterIdxMax characters!
for i, cache, remain := n-1, src.Int63(), letterIdxMax; i >= 0; {
if remain == 0 {
cache, remain = src.Int63(), letterIdxMax
}
if idx := int(cache & letterIdxMask); idx < len(letterBytes) {
b[i] = letterBytes[idx]
i--
}
cache >>= letterIdxBits
remain--
}
return *(*string)(unsafe.Pointer(&b))
}
(9. Using rand.Read())
Go 1.7 added a rand.Read() function and a Rand.Read() method. We should be tempted to use these to read as many bytes as we need in one step, in order to achieve better performance.
There is one small "problem" with this: how many bytes do we need? We could say: as many as the number of output letters. We would think this is an upper estimation, as a letter index uses less than 8 bits (1 byte). But at this point we are already doing worse (as getting the random bits is the "hard part"), and we're getting more than needed.
Also note that to maintain equal distribution of all letter indices, there might be some "garbage" random data that we won't be able to use, so we would end up skipping some data, and thus end up short when we go through all the byte slice. We would need to further get more random bytes, "recursively". And now we're even losing the "single call to rand package" advantage...
We could "somewhat" optimize the usage of the random data we acquire from math.Rand(). We may estimate how many bytes (bits) we'll need. 1 letter requires letterIdxBits bits, and we need n letters, so we need n * letterIdxBits / 8.0 bytes rounding up. We can calculate the probability of a random index not being usable (see above), so we could request more that will "more likely" be enough (if it turns out it's not, we repeat the process). We can process the byte slice as a "bit stream" for example, for which we have a nice 3rd party lib: github.com/icza/bitio (disclosure: I'm the author).
But Benchmark code still shows we're not winning. Why is it so?
The answer to the last question is because rand.Read() uses a loop and keeps calling Source.Int63() until it fills the passed slice. Exactly what the RandStringBytesMaskImprSrc() solution does, without the intermediate buffer, and without the added complexity. That's why RandStringBytesMaskImprSrc() remains on the throne. Yes, RandStringBytesMaskImprSrc() uses an unsynchronized rand.Source unlike rand.Read(). But the reasoning still applies; and which is proven if we use Rand.Read() instead of rand.Read() (the former is also unsynchronzed).
II. Benchmark
All right, it's time for benchmarking the different solutions.
Moment of truth:
BenchmarkRunes-4 2000000 723 ns/op 96 B/op 2 allocs/op
BenchmarkBytes-4 3000000 550 ns/op 32 B/op 2 allocs/op
BenchmarkBytesRmndr-4 3000000 438 ns/op 32 B/op 2 allocs/op
BenchmarkBytesMask-4 3000000 534 ns/op 32 B/op 2 allocs/op
BenchmarkBytesMaskImpr-4 10000000 176 ns/op 32 B/op 2 allocs/op
BenchmarkBytesMaskImprSrc-4 10000000 139 ns/op 32 B/op 2 allocs/op
BenchmarkBytesMaskImprSrcSB-4 10000000 134 ns/op 16 B/op 1 allocs/op
BenchmarkBytesMaskImprSrcUnsafe-4 10000000 115 ns/op 16 B/op 1 allocs/op
Just by switching from runes to bytes, we immediately have 24% performance gain, and memory requirement drops to one third.
Getting rid of rand.Intn() and using rand.Int63() instead gives another 20% boost.
Masking (and repeating in case of big indices) slows down a little (due to repetition calls): -22%...
But when we make use of all (or most) of the 63 random bits (10 indices from one rand.Int63() call): that speeds up big time: 3 times.
If we settle with a (non-default, new) rand.Source instead of rand.Rand, we again gain 21%.
If we utilize strings.Builder, we gain a tiny 3.5% in speed, but we also achieved 50% reduction in memory usage and allocations! That's nice!
Finally if we dare to use package unsafe instead of strings.Builder, we again gain a nice 14%.
Comparing the final to the initial solution: RandStringBytesMaskImprSrcUnsafe() is 6.3 times faster than RandStringRunes(), uses one sixth memory and half as few allocations. Mission accomplished.
You can just write code for it. This code can be a little simpler if you want to rely on the letters all being single bytes when encoded in UTF-8.
package main
import (
"fmt"
"time"
"math/rand"
)
func init() {
rand.Seed(time.Now().UnixNano())
}
var letters = []rune("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
func randSeq(n int) string {
b := make([]rune, n)
for i := range b {
b[i] = letters[rand.Intn(len(letters))]
}
return string(b)
}
func main() {
fmt.Println(randSeq(10))
}
Use package uniuri, which generates cryptographically secure uniform (unbiased) strings.
Disclaimer: I'm the author of the package
Simple solution for you, with least duplicate result:
import (
"fmt"
"math/rand"
"time"
)
func randomString(length int) string {
rand.Seed(time.Now().UnixNano())
b := make([]byte, length+2)
rand.Read(b)
return fmt.Sprintf("%x", b)[2 : length+2]
}
Check it out in the PlayGround
Two possible options (there might be more of course):
You can use the crypto/rand package that supports reading random byte arrays (from /dev/urandom) and is geared towards cryptographic random generation. see http://golang.org/pkg/crypto/rand/#example_Read . It might be slower than normal pseudo-random number generation though.
Take a random number and hash it using md5 or something like this.
If you want cryptographically secure random numbers, and the exact charset is flexible (say, base64 is fine), you can calculate exactly what the length of random characters you need from the desired output size.
Base 64 text is 1/3 longer than base 256. (2^8 vs 2^6; 8bits/6bits = 1.333 ratio)
import (
"crypto/rand"
"encoding/base64"
"math"
)
func randomBase64String(l int) string {
buff := make([]byte, int(math.Ceil(float64(l)/float64(1.33333333333))))
rand.Read(buff)
str := base64.RawURLEncoding.EncodeToString(buff)
return str[:l] // strip 1 extra character we get from odd length results
}
Note: you can also use RawStdEncoding if you prefer + and / characters to - and _
If you want hex, base 16 is 2x longer than base 256. (2^8 vs 2^4; 8bits/4bits = 2x ratio)
import (
"crypto/rand"
"encoding/hex"
"math"
)
func randomBase16String(l int) string {
buff := make([]byte, int(math.Ceil(float64(l)/2)))
rand.Read(buff)
str := hex.EncodeToString(buff)
return str[:l] // strip 1 extra character we get from odd length results
}
However, you could extend this to any arbitrary character set if you have a base256 to baseN encoder for your character set. You can do the same size calculation with how many bits are needed to represent your character set. The ratio calculation for any arbitrary charset is: ratio = 8 / log2(len(charset))).
Though both of these solutions are secure, simple, should be fast, and don't waste your crypto entropy pool.
Here's the playground showing it works for any size. https://play.golang.org/p/_yF_xxXer0Z
Another version, inspired from generate password in JavaScript crypto:
package main
import (
"crypto/rand"
"fmt"
)
var chars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890-"
func shortID(length int) string {
ll := len(chars)
b := make([]byte, length)
rand.Read(b) // generates len(b) random bytes
for i := 0; i < length; i++ {
b[i] = chars[int(b[i])%ll]
}
return string(b)
}
func main() {
fmt.Println(shortID(18))
fmt.Println(shortID(18))
fmt.Println(shortID(18))
}
Following icza's wonderfully explained solution, here is a modification of it that uses crypto/rand instead of math/rand.
const (
letterBytes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" // 52 possibilities
letterIdxBits = 6 // 6 bits to represent 64 possibilities / indexes
letterIdxMask = 1<<letterIdxBits - 1 // All 1-bits, as many as letterIdxBits
)
func SecureRandomAlphaString(length int) string {
result := make([]byte, length)
bufferSize := int(float64(length)*1.3)
for i, j, randomBytes := 0, 0, []byte{}; i < length; j++ {
if j%bufferSize == 0 {
randomBytes = SecureRandomBytes(bufferSize)
}
if idx := int(randomBytes[j%length] & letterIdxMask); idx < len(letterBytes) {
result[i] = letterBytes[idx]
i++
}
}
return string(result)
}
// SecureRandomBytes returns the requested number of bytes using crypto/rand
func SecureRandomBytes(length int) []byte {
var randomBytes = make([]byte, length)
_, err := rand.Read(randomBytes)
if err != nil {
log.Fatal("Unable to generate random bytes")
}
return randomBytes
}
If you want a more generic solution, that allows you to pass in the slice of character bytes to create the string out of, you can try using this:
// SecureRandomString returns a string of the requested length,
// made from the byte characters provided (only ASCII allowed).
// Uses crypto/rand for security. Will panic if len(availableCharBytes) > 256.
func SecureRandomString(availableCharBytes string, length int) string {
// Compute bitMask
availableCharLength := len(availableCharBytes)
if availableCharLength == 0 || availableCharLength > 256 {
panic("availableCharBytes length must be greater than 0 and less than or equal to 256")
}
var bitLength byte
var bitMask byte
for bits := availableCharLength - 1; bits != 0; {
bits = bits >> 1
bitLength++
}
bitMask = 1<<bitLength - 1
// Compute bufferSize
bufferSize := length + length / 3
// Create random string
result := make([]byte, length)
for i, j, randomBytes := 0, 0, []byte{}; i < length; j++ {
if j%bufferSize == 0 {
// Random byte buffer is empty, get a new one
randomBytes = SecureRandomBytes(bufferSize)
}
// Mask bytes to get an index into the character slice
if idx := int(randomBytes[j%length] & bitMask); idx < availableCharLength {
result[i] = availableCharBytes[idx]
i++
}
}
return string(result)
}
If you want to pass in your own source of randomness, it would be trivial to modify the above to accept an io.Reader instead of using crypto/rand.
Here is my way ) Use math rand or crypto rand as you wish.
func randStr(len int) string {
buff := make([]byte, len)
rand.Read(buff)
str := base64.StdEncoding.EncodeToString(buff)
// Base 64 can be longer than len
return str[:len]
}
Here is a simple and performant solution for a cryptographically secure random string.
package main
import (
"crypto/rand"
"unsafe"
"fmt"
)
var alphabet = []byte("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
func main() {
fmt.Println(generate(16))
}
func generate(size int) string {
b := make([]byte, size)
rand.Read(b)
for i := 0; i < size; i++ {
b[i] = alphabet[b[i] % byte(len(alphabet))]
}
return *(*string)(unsafe.Pointer(&b))
}
Benchmark
Benchmark 95.2 ns/op 16 B/op 1 allocs/op
func Rand(n int) (str string) {
b := make([]byte, n)
rand.Read(b)
str = fmt.Sprintf("%x", b)
return
}
I usually do it like this if it takes an option to capitalize or not
func randomString(length int, upperCase bool) string {
rand.Seed(time.Now().UnixNano())
var alphabet string
if upperCase {
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
} else {
alphabet = "abcdefghijklmnopqrstuvwxyz"
}
var sb strings.Builder
l := len(alphabet)
for i := 0; i < length; i++ {
c := alphabet[rand.Intn(l)]
sb.WriteByte(c)
}
return sb.String()
}
and like this if you don't need capital letters
func randomString(length int) string {
rand.Seed(time.Now().UnixNano())
var alphabet string = "abcdefghijklmnopqrstuvwxyz"
var sb strings.Builder
l := len(alphabet)
for i := 0; i < length; i++ {
c := alphabet[rand.Intn(l)]
sb.WriteByte(c)
}
return sb.String()
}
If you are willing to add a few characters to your pool of allowed characters, you can make the code work with anything which provides random bytes through a io.Reader. Here we are using crypto/rand.
// len(encodeURL) == 64. This allows (x <= 265) x % 64 to have an even
// distribution.
const encodeURL = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_"
// A helper function create and fill a slice of length n with characters from
// a-zA-Z0-9_-. It panics if there are any problems getting random bytes.
func RandAsciiBytes(n int) []byte {
output := make([]byte, n)
// We will take n bytes, one byte for each character of output.
randomness := make([]byte, n)
// read all random
_, err := rand.Read(randomness)
if err != nil {
panic(err)
}
// fill output
for pos := range output {
// get random item
random := uint8(randomness[pos])
// random % 64
randomPos := random % uint8(len(encodeURL))
// put into output
output[pos] = encodeURL[randomPos]
}
return output
}
This is a sample code which I used to generate certificate number in my app.
func GenerateCertificateNumber() string {
CertificateLength := 7
t := time.Now().String()
CertificateHash, err := bcrypt.GenerateFromPassword([]byte(t), bcrypt.DefaultCost)
if err != nil {
fmt.Println(err)
}
// Make a Regex we only want letters and numbers
reg, err := regexp.Compile("[^a-zA-Z0-9]+")
if err != nil {
log.Fatal(err)
}
processedString := reg.ReplaceAllString(string(CertificateHash), "")
fmt.Println(string(processedString))
CertificateNumber := strings.ToUpper(string(processedString[len(processedString)-CertificateLength:]))
fmt.Println(CertificateNumber)
return CertificateNumber
}
/*
korzhao
*/
package rand
import (
crand "crypto/rand"
"math/rand"
"sync"
"time"
"unsafe"
)
// Doesn't share the rand library globally, reducing lock contention
type Rand struct {
Seed int64
Pool *sync.Pool
}
var (
MRand = NewRand()
randlist = []byte("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890")
)
// init random number generator
func NewRand() *Rand {
p := &sync.Pool{New: func() interface{} {
return rand.New(rand.NewSource(getSeed()))
},
}
mrand := &Rand{
Pool: p,
}
return mrand
}
// get the seed
func getSeed() int64 {
return time.Now().UnixNano()
}
func (s *Rand) getrand() *rand.Rand {
return s.Pool.Get().(*rand.Rand)
}
func (s *Rand) putrand(r *rand.Rand) {
s.Pool.Put(r)
}
// get a random number
func (s *Rand) Intn(n int) int {
r := s.getrand()
defer s.putrand(r)
return r.Intn(n)
}
// bulk get random numbers
func (s *Rand) Read(p []byte) (int, error) {
r := s.getrand()
defer s.putrand(r)
return r.Read(p)
}
func CreateRandomString(len int) string {
b := make([]byte, len)
_, err := MRand.Read(b)
if err != nil {
return ""
}
for i := 0; i < len; i++ {
b[i] = randlist[b[i]%(62)]
}
return *(*string)(unsafe.Pointer(&b))
}
24.0 ns/op 16 B/op 1 allocs/
As a follow-up to icza's brilliant solution, below I am using rand.Reader
func RandStringBytesMaskImprRandReaderUnsafe(length uint) (string, error) {
const (
charset = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
charIdxBits = 6 // 6 bits to represent a letter index
charIdxMask = 1<<charIdxBits - 1 // All 1-bits, as many as charIdxBits
charIdxMax = 63 / charIdxBits // # of letter indices fitting in 63 bits
)
buffer := make([]byte, length)
charsetLength := len(charset)
max := big.NewInt(int64(1 << uint64(charsetLength)))
limit, err := rand.Int(rand.Reader, max)
if err != nil {
return "", err
}
for index, cache, remain := int(length-1), limit.Int64(), charIdxMax; index >= 0; {
if remain == 0 {
limit, err = rand.Int(rand.Reader, max)
if err != nil {
return "", err
}
cache, remain = limit.Int64(), charIdxMax
}
if idx := int(cache & charIdxMask); idx < charsetLength {
buffer[index] = charset[idx]
index--
}
cache >>= charIdxBits
remain--
}
return *(*string)(unsafe.Pointer(&buffer)), nil
}
func BenchmarkBytesMaskImprRandReaderUnsafe(b *testing.B) {
b.ReportAllocs()
b.ResetTimer()
const length = 16
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
RandStringBytesMaskImprRandReaderUnsafe(length)
}
})
}
package main
import (
"encoding/base64"
"fmt"
"math/rand"
"time"
)
// customEncodeURL is like `bas64.encodeURL`
// except its made up entirely of uppercase characters:
const customEncodeURL = "ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKL"
// Random generates a random string.
// It is not cryptographically secure.
func Random(n int) string {
b := make([]byte, n)
rand.Seed(time.Now().UnixNano())
_, _ = rand.Read(b) // docs say that it always returns a nil error.
customEncoding := base64.NewEncoding(customEncodeURL).WithPadding(base64.NoPadding)
return customEncoding.EncodeToString(b)
}
func main() {
fmt.Println(Random(16))
}
const (
chars = "0123456789_abcdefghijkl-mnopqrstuvwxyz" //ABCDEFGHIJKLMNOPQRSTUVWXYZ
charsLen = len(chars)
mask = 1<<6 - 1
)
var rng = rand.NewSource(time.Now().UnixNano())
// RandStr 返回指定长度的随机字符串
func RandStr(ln int) string {
/* chars 38个字符
* rng.Int63() 每次产出64bit的随机数,每次我们使用6bit(2^6=64) 可以使用10次
*/
buf := make([]byte, ln)
for idx, cache, remain := ln-1, rng.Int63(), 10; idx >= 0; {
if remain == 0 {
cache, remain = rng.Int63(), 10
}
buf[idx] = chars[int(cache&mask)%charsLen]
cache >>= 6
remain--
idx--
}
return *(*string)(unsafe.Pointer(&buf))
}
BenchmarkRandStr16-8 20000000 68.1 ns/op 16 B/op 1 allocs/op

Performance difference in toString.map and toString.toArray.map

While coding Euler problems, I ran across what I think is bizarre:
The method toString.map is slower than toString.toArray.map.
Here's an example:
def main(args: Array[String])
{
def toDigit(num : Int) = num.toString.map(_ - 48) //2137 ms
def toDigitFast(num : Int) = num.toString.toArray.map(_ - 48) //592 ms
val startTime = System.currentTimeMillis;
(1 to 1200000).map(toDigit)
println(System.currentTimeMillis - startTime)
}
Shouldn't the method map on String fallback to a map over the array? Why is there such a noticeable difference? (Note that increasing the number even causes an stack overflow on the non-array case).
Original
Could be because toString.map uses the WrappedString implicit, while toString.toArray.map uses the WrappedArray implicit to resolve map.
Let's see map, as defined in TraversableLike:
def map[B, That](f: A => B)(implicit bf: CanBuildFrom[Repr, B, That]): That = {
val b = bf(repr)
b.sizeHint(this)
for (x <- this) b += f(x)
b.result
}
WrappedString uses a StringBuilder as builder:
def +=(x: Char): this.type = { append(x); this }
def append(x: Any): StringBuilder = {
underlying append String.valueOf(x)
this
}
The String.valueOf call for Any uses Java Object.toString on the Char instances, possibly getting boxed first. These extra ops might be the cause of speed difference, versus the supposedly shorter code paths of the Array builder.
This is a guess though, would have to measure.
Edit
After revising, the general point still stands, but the I referred the wrong implicits, since the toDigit methods return an Int sequence (or like), not a translated string as I misread.
toDigit uses LowPriorityImplicits.fallbackStringCanBuildFrom[T]: CanBuildFrom[String, T, immutable.IndexedSeq[T]], with T = Int, which just defers to a general IndexedSeq builder.
toDigitFast uses a direct Array implicit of type CanBuildFrom[Array[_], T, Array[T]], which is unarguably faster.
Passing the following CBF for toDigit explicitly makes the two methods on par:
object FastStringToArrayBuild {
def canBuildFrom[T : ClassManifest] = new CanBuildFrom[String, T, Array[T]] {
private def newBuilder = scala.collection.mutable.ArrayBuilder.make()
def apply(from: String) = newBuilder
def apply() = newBuilder
}
}
You're being fooled by running out of memory. The toDigit version does create more intermediate objects, but if you have plenty of memory then the GC won't be heavily impacted (and it'll all run faster). For example, if instead of creating 1.2 million numbers, I create 12k 100x in a row, I get approximately equal times for the two methods. If I create 1.2k 5-digit numbers 1000x in a row, I find that toDigit is about 5% faster.
Given that the toDigit method produces an immutable collection, which is better when all else is equal since it is easier to reason about, and given that all else is equal for all but highly demanding tasks, I think the library is as it should be.
When trying to improve performance, of course one needs to keep all sorts of tricks in mind; one of these is that arrays have better memory characteristics for collections of known length than do the fancy collections in the Scala library. Also, one needs to know that map isn't the fastest way to get things done; if you really wanted this to be fast you should
final def toDigitReallyFast(num: Int, accum: Long = 0L, iter: Int = 0): Array[Byte] = {
if (num==0) {
val ans = new Array[Byte](math.max(1,iter))
var i = 0
var ac = accum
while (i < ans.length) {
ans(ans.length-i-1) = (ac & 0xF).toByte
ac >>= 4
i += 1
}
ans
}
else {
val next = num/10
toDigitReallyFast(next, (accum << 4) | (num-10*next), iter+1)
}
}
which on my machine is at 4x faster than either of the others. And you can get almost 3x faster yet again if you leave everything in a Long and pack the results in an array instead of using 1 to N:
final def toDigitExtremelyFast(num: Int, accum: Long = 0L, iter: Int = 0): Long = {
if (num==0) accum | (iter.toLong << 48)
else {
val next = num/10
toDigitExtremelyFast(next, accum | ((num-10*next).toLong<<(4*iter)), iter+1)
}
}
// loop, instead of 1 to N map, for the 1.2k number case
{
var i = 10000
val a = new Array[Long](1201)
while (i<=11200) {
a(i-10000) = toDigitReallyReallyFast(i)
i += 1
}
a
}
As with many things, performance tuning is highly dependent on exactly what you want to do. In contrast, library design has to balance many different concerns. I do think it's worth noticing where the library is sub-optimal with respect to performance, but this isn't really one of those cases IMO; the flexibility is worth it for the common use cases.

Checksum Algorithm Producing Unpredictable Results

I'm working on a checksum algorithm, and I'm having some issues. The kicker is, when I hand craft a "fake" message, that is substantially smaller than the "real" data I'm receiving, I get a correct checksum. However, against the real data - the checksum does not work properly.
Here's some information on the incoming data/environment:
This is a groovy project (see code below)
All bytes are to be treated as unsigned integers for the purpose of checksum calculation
You'll notice some finagling with shorts and longs in order to make that work.
The size of the real data is 491 bytes.
The size of my sample data (which appears to add correctly) is 26 bytes
None of my hex-to-decimal conversions are producing a negative number, as best I can tell
Some bytes in the file are not added to the checksum. I've verified that the switch for these is working properly, and when it is supposed to - so that's not the issue.
My calculated checksum, and the checksum packaged with the real transmission always differ by the same amount.
I have manually verified that the checksum packaged with the real data is correct.
Here is the code:
// add bytes to checksum
public void addToChecksum( byte[] bytes) {
//if the checksum isn't enabled, don't add
if(!checksumEnabled) {
return;
}
long previouschecksum = this.checksum;
for(int i = 0; i < bytes.length; i++) {
byte[] tmpBytes = new byte[2];
tmpBytes[0] = 0x00;
tmpBytes[1] = bytes[i];
ByteBuffer tmpBuf = ByteBuffer.wrap(tmpBytes);
long computedBytes = tmpBuf.getShort();
logger.info(getHex(bytes[i]) + " = " + computedBytes);
this.checksum += computedBytes;
}
if(this.checksum < previouschecksum) {
logger.error("Checksum DECREASED: " + this.checksum);
}
//logger.info("Checksum: " + this.checksum);
}
If anyone can find anything in this algorithm that could be causing drift from the expected result, I would greatly appreciate your help in tracking this down.
I don't see a line in your code where you reset your this.checksum.
This way, you should alway get a this.checksum > previouschecksum, right? Is this intended?
Otherwise I can't find a flaw in your above code. Maybe your 'this.checksum' is of the wrong type (short for instance). This could rollover so that you get negative values.
here is an example for such a behaviour
import java.nio.ByteBuffer
short checksum = 0
byte[] bytes = new byte[491]
def count = 260
for (def i=0;i<count;i++) {
bytes[i]=255
}
bytes.each { b ->
byte[] tmpBytes = new byte[2];
tmpBytes[0] = 0x00;
tmpBytes[1] = b;
ByteBuffer tmpBuf = ByteBuffer.wrap(tmpBytes);
long computedBytes = tmpBuf.getShort();
checksum += computedBytes
println "${b} : ${computedBytes}"
}
println checksum +"!=" + 255*count
just play around with the value of the 'count' variable which somehow corresponds to the lenght of your input.
Your checksum will keep incrementing until it rolls over to being negative (as it is a signed long integer)
You can also shorten your method to:
public void addToChecksum( byte[] bytes) {
//if the checksum isn't enabled, don't add
if(!checksumEnabled) {
return;
}
long previouschecksum = this.checksum;
this.checksum += bytes.inject( 0L ) { tot, it -> tot += it & 0xFF }
if(this.checksum < previouschecksum) {
logger.error("Checksum DECREASED: " + this.checksum);
}
//logger.info("Checksum: " + this.checksum);
}
But that won't stop it rolling over to being negative. For the sake of saving 12 bytes per item that you are generating a hash for, I would still suggest something like MD5 which is know to work is probably better than rolling your own... However I understand sometimes there are crazy requirements you have to stick to...

Resources