How to generate a random alphanumeric string with Erlang? - string

I'm trying to generate an random alphanumeric ID with Erlang.
I naively tried crypto:strong_rand_bytes(Bytes) to generate a random binary and then used that binary like it was created with <<"my_unique_random_id">> - which didn't work because random bits are not necessarily a valid UTF-8 string, right?
Well, I looked for other options in the erlang docs and elsewhere, but I didn't find anything. Could someone point me to a solution?

It might depend on the randomness you need. Erlang's crypto module produces stronger random data than the random module (see also [erlang-questions] Yaws security alert - Yaws 1.93 and this question). If you want to use strong_rand_bytes to generate an ID maybe getting the base64 of it might be enough:
> base64:encode(crypto:strong_rand_bytes(Bytes)).
You could turn this into a list if needed.

According to Generating random strings in Erlang it only takes a few lines of Erlang to generate a string of a specified length from a certain set of characters.
get_random_string(Length, AllowedChars) ->
lists:foldl(fun(_, Acc) ->
[lists:nth(random:uniform(length(AllowedChars)),
AllowedChars)]
++ Acc
end, [], lists:seq(1, Length)).
The blog post has a line-by-line explanation of the code. Look to the comments for a couple of optimization tips.

I have prepared small module do to this
Also it uses crypto:rand_uniform/2 but not obsolete random:uniform
module(cloud_rnd).
-export([rnd_chars/1, rnd_numbers/1, rnd_chars_numbers/1]).
rnd_chars(L) -> get_rnd(L, chars).
rnd_numbers(L) -> get_rnd(L, numbers).
rnd_chars_numbers(L) -> get_rnd(L, chars_numbers).
get_rnd(L, chars) -> gen_rnd(L, "abcdefghijklmnopqrstuvwxyz");
get_rnd(L, numbers) -> gen_rnd(L, "1234567890");
get_rnd(L, chars_numbers) -> gen_rnd(L, "abcdefghijklmnopqrstuvwxyz1234567890").
gen_rnd(Length, AllowedChars) ->
MaxLength = length(AllowedChars),
lists:foldl(
fun(_, Acc) -> [lists:nth(crypto:rand_uniform(1, MaxLength), AllowedChars)] ++ Acc end,
[], lists:seq(1, Length)
).

The problem with responses to the various "I need random strings" questions (in whatever language) is almost every solution uses a flawed specification, namely, string length. The questions themselves rarely reveal why the random strings are needed, but I will boldly assume they are to be used as identifiers which need to be unique.
There are two leading ways to get strictly unique strings: deterministically (which is not random) and store/compare (which is onerous). What to do? Give up the ghost. Go with probabilistic uniqueness instead. That is, accept that there is some (however small) risk that your strings won't be unique. This is where understanding collision probability and entropy are helpful.
So I'll rephrase my bold assumption as you need some number of identifiers with a small risk of repeat. As a concrete example, let's say you need 5 million Ids with a less than 1 in a trillion risk of repeat. So what length of string do you need? Well, that question is underspecified as it depends on the characters used. But more importantly, it's misguided. What you need is specification of the entropy of the strings, not their length.
This is where EntropyString can help.
Bits = entropy_string:bits(5.0e6, 1.0e12).
83.37013046707142
entropy_string:random_string(Bits).
<<"QDrjGQFGgGjJ4t9r2">>
There are other predefined characters sets, and you can specify your own characters as well (though for efficiency reasons only sets with powers of 2 characters are supported). And best of all, the risk of repeat in the specified number of strings is explicit. No more guessing with string length.

randchar(N) ->
randchar(N, []).
randchar(0, Acc) ->
Acc;
randchar(N, Acc) ->
randchar(N - 1, [random:uniform(26) + 96 | Acc]).

You may use function uef_bin:random_latin_binary/2 from here:
https://github.com/DOBRO/uef-lib#uef_binrandom_latin_binary2
Bin = uef_bin:random_latin_binary(Length, any)
And then, if you need a string() type:
String = erlang:binary_to_list(Bin)

Related

Make a model to identify a string

I have a string like this
ODQ1OTc3MzY0MDcyNDk3MTUy.YKoz0Q.wlST3vVZ3IN8nTtVX1tz8Vvq5O8
The first part of the string is a random 18 digit number in base64 format and the second is a unix timestamp in base64 too, while the last is an hmac.
I want to make a model to recognize a string like this.
How may i do it?
While I did not necessarily think deeply about it, this would be what comes to my mind first.
You certainly don't need machine learning for this. In fact, machine learning would not only be inefficient for problems like this but may even be worse, depending on a given approach.
Here, an exact solution can be achieved, simply by understanding the problem.
One way people often go about matching strings with a certain structure is with so called regular expressions or RegExp.
Regular expressions allow you to match string patterns of varying complexity.
To give a simple example in Python:
import re
your_string = "ODQ1OTc3MzY0MDcyNDk3MTUy.YKoz0Q.wlST3vVZ3IN8nTtVX1tz8Vvq5O8"
regexp_pattern = r"(.+)\.(.+)\.(.+)"
re.findall(regexp_pattern, your_string)
>>> [('ODQ1OTc3MzY0MDcyNDk3MTUy', 'YKoz0Q', 'wlST3vVZ3IN8nTtVX1tz8Vvq5O8')]
Now one problem with this is how do you know where your string starts and stops. Most of the times there are certain anchors, especially in strings that were created programmatically. For instance, if we knew that prior to each string you wanted to match there is the word Token: , you could include that in your RegExp pattern r"Token: (.+)\.(.+)\.(.+)".
Other ways to avoid mismatches would be to clearer define the pattern requirements. Right now we simply match a pattern with any amount of characters and two . separating them into three sequences.
If you would know which implementation of base64 you were using, you could limit the alphabet of potential characters from . (thus any) to the alphabet used in your base64 implementation [abcdefgh1234]. In this example it would be abcdefgh1234, so the pattern could be refined like this r"([abcdefgh1234]+).([abcdefgh1234]+).(.+)"`.
The same applies to the HMAC code.
Furthermore, you could specify the allowed length of each substring.
For instance, you said you have 18 random digits. This would likely mean each is encoded as 1 byte, which would translate to 18*8 = 144 bits, which in base64, would translate to 24 tokens (where each encodes a sextet, thus 6 bits of information). The same could be done with the timestamp, assuming a 32 bit timestamp, this would likely necessitate 6 base64 tokens (representing 36 bits, 36 because you could not divide 32 into sextets).
With this information, you could further refine the pattern
r"([abcdefgh1234]{24})\.([abcdefgh1234]{6})\.(.+)"`
In addition, the same could be applied to the HMAC code.
I leave it to you to read a bit about RegExp but I'd guess it is the easiest solution and certainly more appropriate than any kind of machine learning.

Any algorithm that mangles/hashes a string but can be matched against?

Usage case: client needs to send a huge string over HTTP. The server replies whether the string contains some substring. However, huge string is huge. This system is as a result really inefficient. Moreover, huge string contains some sensitive info, so this is really insecure.
Is there some pseudo-hashing mechanism that somehow summarizes a big string into some number, which all substrings of this big string would hash to the same number, but non-substrings will with high probability not hash to this big string?
Is there some pseudo-hashing mechanism that somehow summarizes a big string into some number, which all substrings of this big string would hash to the same number, but non-substrings will with high probability not hash to this big string?
No.
Let f be such a hash. Consider a string s and non-substring t. Note that s and t are substrings of s + t. Therefore, s and t have the same hash (i.e., f(s) = f(t) = f(s + t)). This is contrary to the requirement that f(s) != f(t) with high probability.
In particular, with s = "", we see that all strings t have f(s) = f(t), so that f is constant and equal to f("").
Is there some pseudo-hashing mechanism that somehow summarizes a big string into some number, which all substrings of this big string would hash to the same number, but non-substrings will with high probability not hash to this big string?
I guess I'll have to explain why this won't happen:
String string = "the quick brown fox jumps over the lazy dog";
That means, according to your request, that every single letter in this will hash to the same value. Hashing algorithms are deterministic. In this example, t -> 5, h -> 5, e -> 5... And so on, but if you have some string:
String string2 = "hello there";
Then now, you want h to hash to something different, and you want e to hash to something different, so given the exact same input, you want a different value. This defeats the definition of a mathematical function.
What does this mean?
Well, without any aspect of determinism in your function, your data has no repeatable mapping between a value and the letter that is being hashed, meaning your data is meaningless.
If you have a constant length for the substrings you could do what many file-sharing programs do and use a list of hashes or something like the tiger-tree hash.
List of hashes: Make a hash for every chunk of the file of some pre-set length (say 64kB), then transmit a list of these hashes so these chunks can be verified.
Tiger-Tree hash: http://en.wikipedia.org/wiki/Merkle_tree#Tiger_tree_hash
Basically build a binary tree of hashes with the leaves being hashes of chunks like in a list of hashes.
If you need to match to every possible substring instead of just pre-defined chunks this isn't going to work though.
All substrings doesn't sound viable, but I imagine you may have some constraints on your substrings you haven't yet told us about.
If you do your substrings block-aligned or whitespace-aligned or something, you might look into using a bloom filter, EG: https://pypi.python.org/pypi/drs-bloom-filter/1.01 . Bloom Filters can store members of a set and be used for testing set membership, at times with as little as one bit per element. They do sometimes give false positives, but with a user-adjustable probability of a false positive.

Reversing string in ocaml

I have this function for reversing strings in ocaml however it says that I have my types wrong. I am unsure as to why or what I can do :(
Any tips on debugging would also be greatly appreciated!
28 let reverse s =
29 let rec helper i =
30 if i >= String.length s then "" else (helper (i+1))^(s.[i])
31 in
32 helper 0
Error: This expression has type char but an expression was expected of type
string
Thank you
Your implementation does not have the expected (linear) time and space complexity: it is quadratic in both time and space, so it is hardly a correct implementation of the requested feature.
String concatenation sa^sb allocates a new string of size length sa + length sb, and fills it with the two strings; this means that both its time and space complexity are linear in the sum of the lengths. When you iterate this operation once per character, you get an algorithm of quadratic complexity (the total size of memory allocated, and total number of copies, will be 1+2+3+....+n).
To correctly implement this algorithm, you could either:
allocate a string of the expected size, and mutate it in place with the content of the input string, reversed
create a string list made of reversed size-one strings, then use String.concat to concatenate all of them at once (which allocates the result and copies the strings only once)
use the Buffer module which is meant to accumulate characters or strings iteratively without exhibiting a quadratic behavior (it uses a dynamic resizing policy that makes addition of a char amortized constant time)
The first approach is both the simplest and the fastest, but the other two will get more interesting in more complex application where you want to concatenate strings, but it's less straightforward to know in one step what the final result will be.
The error message is pretty clear, I think. The expression s.[i] represents a character (the ith character of the string). But the ^ operator requires strings as its arguments.
To get past the problem you can use String.make 1 s.[i]. This expression gives a 1-character string containing the single character s.[i].
Handling strings recursively in OCaml isn't as nice as it could be, because there's no nice way to destructure a string (break it into parts). The equivalent code to reverse a list looks a lot prettier. For what it's worth :-)
You can also use 3rd party libraries to do so. http://batteries.forge.ocamlcore.org/ already implements a function for reversing strings

Encoding name strings into an unique number

I have a large set of names (millions in number). Each of them has a first name, an optional middle name, and a lastname. I need to encode these names into a number that uniquely represents the names. The encoding should be one-one, that is a name should be associated with only one number, and a number should be associated with only one name.
What is a smart way of encoding this? I know it is easy to tag each alphabet of the name according to its position in the alphabet set (a-> 1, b->2.. and so on) and so a name like Deepa would get -> 455161, but again here I cannot make out if the '16' is really 16 or a combination of 1 and 6.
So, I am looking for a smart way of encoding the names.
Furthermore, the encoding should be such that the number of digits in the output numeral for any name should have fixed number of digits, i.e., it should be independent of the length. Is this possible?
Thanks
Abhishek S
To get the same width numbers, can't you just zero-pad on the left?
Some options:
Sort them. Count them. The 10th name is number 10.
Treat each character as a digit in a base 26 (case insensitive, no
digits) or 52 (case significant, no digits) or 36 (case insensitive
with digits) or 62 (case sensitive with digits) number. Compute the
value in an int. EG, for a name of "abc", you'd have 0 * 26^2 + 1 *
26^1 + 2 * 20^0. Sometimes Chinese names may use digits to indicate tonality.
Use a "perfect hashing" scheme: http://en.wikipedia.org/wiki/Perfect_hash_function
This one's mostly suggested in fun: use goedel numbering :). So
"abc" would be 2^0 * 3^1 * 5^2 - it's a product of powers of primes.
Factoring the number gives you back the characters. The numbers
could get quite large though.
Convert to ASCII, if you aren't already using it. Then treat each
ordinal of a character as a digit in a base-256 numbering system.
So "abc" is 0*256^2 + 1*256^1 + 2*256^0.
If you need to be able to update your list of names and numbers from time to time, #2, #4 and #5 should work. #1 and #3 would have problems. #5 is probably the most future-proofed, though you may find you need unicode at some point.
I believe you could do unicode as a variant of #5, using powers of 2^32 instead of 2^8 == 256.
What you are trying to do there is actually hashing (at least if you have a fixed number of digits). There are some good hashing algorithms with few collisions. Try out sha1 for example, that one is well tested and available for modern languages (see http://en.wikipedia.org/wiki/Sha1) -- it seems to be good enough for git, so it might work for you.
There is of course a small possibility for identical hash values for two different names, but that's always the case with hashing and can be taken care of. With sha1 and such you won't have any obvious connection between names and IDs, which can be a good or a bad thing, depending on your problem.
If you really want unique ids for sure, you will need to do something like NealB suggested, create IDs yourself and connect names and IDs in a Database (you could create them randomly and check for collisions or increment them, starting at 0000000000001 or so).
(improved answer after giving it some thought and reading the first comments)
You can use the BigInteger for encoding arbitrary strings like this:
BigInteger bi = new BigInteger("some string".getBytes());
And for getting the string back use:
String str = new String(bi.toByteArray());
I've been looking for a solution to a problem very similar to the one you proposed and this is what I came up with:
def hash_string(value):
score = 0
depth = 1
for char in value:
score += (ord(char)) * depth
depth /= 256.
return score
If you are unfamiliar with Python, here's what it does.
The score is initially 0 and the depth are set to 1
For every character add the ord value * the depth
The ord function returns the UTF-8 value (0-255) for each character
Then it's multiplied by the 'depth'.
Finally the depth is divided by 256.
Essentially, the way that it works is that the initial characters add more to the score while later characters contribute less and less. If you need an integer, multiply the end score by 2**64. Otherwise you will have a decimal value between 0-256. This encoding scheme works for binary data as well as there are only 256 possible values in a byte/char.
This method works great for smaller string values, however, for longer strings you will notice that the decimal value requires more precision than a regular double (64-bit) can provide. In Java, you can use the 'BigDecimal' and in Python use the 'decimal' module for added precision. A bonus to using this method is that the values returned are in sorted order so they can be searched 'efficiently'.
Take a look at https://en.wikipedia.org/wiki/Huffman_coding. That is the standard approach.
You can translate it, if every character (plus blank, at least) will occupy a position.
Therefore ABC, which is 1,2,3 has to be translated to
1*(2*26+1)² + 2*(53) + 3
This way, you could encode arbitrary strings, but if the length of the input isn't limited (and how should it?), you aren't guaranteed to have an upper limit for the length.

String recurring subsequences and compression

I'd like to do some kind of "search and replace" algorithm which will, in an efficient manner if possible, identify a substring of a string which occurs more than once and replace all occurrences of that substring with a token.
For example, given a string "AbcAdAefgAbijkAblmnAbAb", notice that "A" recurs, so reduce in pass one to "#1bc#1d#1efg#1bijk#1blmn#1b#1b" where #_ is an indexed pattern (we note the patterns in an indexed table), then notice that "#1b" recurs so reduce to "#2c#1d#1efg#2ijk#2lmn#2#2". No more patterns occur in the string so we're done.
I have found some information on "longest common subsequences" and compression algorithms, but nothing that seems to do this. They either are for comparing two string or for getting some kind of storage-optimal result.
My objective, on the other hand, is to reduce the genome to its "words" instead of "letters". ie, instead of gatcatcgatc I want to see 2c1c2c. I could do some regex afterwards to find things like "#42*#42"; it would be cool to see recurring brackets in dna.
If I could just find that online I would skip doing it myself but I can't see this question answered before in terms I could uncover. To anyone who can point me in the right direction many thanks.
The byte pair encoding does something pretty close to what you want.
Rather than searching directly for the longest repeated string (top-down),
each pass of byte pair encoding searches for repeated byte pairs (bottom-up).
But eventually it discovers the longest repeated string(*).
gatcatcgatc
1=at g1c1cg1c
2=atc g22g2
3=gatc 2=atc 323
As you can see, it has found the longest repeated string "gatc".
(*) byte pair encoding either eventually finds the longest repeated string,
or else it stops early after making (2^8 - uniquechars(source) ) substitutions.
I suspect it may be possible to tweak byte pair encoding so that the early-stop condition is relaxed a little -- perhaps (2^9 - uniquechars(source) ) or 2^12 or 2^16.
Even if that hurts compression performance, perhaps it will give interesting results for applications like yours.
Wikipedia: byte pair encoding
Stack Overflow: optimizing byte-pair encoding

Resources