KDB: generate random string - string

How can one generate a random string of a given length in KDB? The string should be composed of both upper and lower case alphabet characters as well as digits, and the first character can not be a digit.
Example:
"i0J2Jx3qa" / OK
"30J2Jx3qa" / bad
Thank you very much for your help!

stringLength: 13
randomString: (1 ? .Q.A,.Q.a) , ((stringLength-1) ? .Q.nA,.Q.a)

If you prefer without the repetitions:
raze(1,stringLength-1)?'10 0_\:.Q.nA,.Q.a

For the purposes of creating random data you can also use ?/deal with a number of characters up to 8 as a symbol(which you could string). This doesn't include numbers though so just an alternative approach to your own answer.
1?`8
,`bghgobnj

There's already a fine answer above which has been accepted. Just wanted somewhere to note that if this is to generate truly random data you need to consider randomising your seed. This can be done in Linux by using $RANDOM in bash or reading up to four bytes from /dev/random (relatively recent versions of kdb can read directly from FIFOs).
Otherwise the seed is set to digits from pi: 314159

Related

Make a model to identify a string

I have a string like this
ODQ1OTc3MzY0MDcyNDk3MTUy.YKoz0Q.wlST3vVZ3IN8nTtVX1tz8Vvq5O8
The first part of the string is a random 18 digit number in base64 format and the second is a unix timestamp in base64 too, while the last is an hmac.
I want to make a model to recognize a string like this.
How may i do it?
While I did not necessarily think deeply about it, this would be what comes to my mind first.
You certainly don't need machine learning for this. In fact, machine learning would not only be inefficient for problems like this but may even be worse, depending on a given approach.
Here, an exact solution can be achieved, simply by understanding the problem.
One way people often go about matching strings with a certain structure is with so called regular expressions or RegExp.
Regular expressions allow you to match string patterns of varying complexity.
To give a simple example in Python:
import re
your_string = "ODQ1OTc3MzY0MDcyNDk3MTUy.YKoz0Q.wlST3vVZ3IN8nTtVX1tz8Vvq5O8"
regexp_pattern = r"(.+)\.(.+)\.(.+)"
re.findall(regexp_pattern, your_string)
>>> [('ODQ1OTc3MzY0MDcyNDk3MTUy', 'YKoz0Q', 'wlST3vVZ3IN8nTtVX1tz8Vvq5O8')]
Now one problem with this is how do you know where your string starts and stops. Most of the times there are certain anchors, especially in strings that were created programmatically. For instance, if we knew that prior to each string you wanted to match there is the word Token: , you could include that in your RegExp pattern r"Token: (.+)\.(.+)\.(.+)".
Other ways to avoid mismatches would be to clearer define the pattern requirements. Right now we simply match a pattern with any amount of characters and two . separating them into three sequences.
If you would know which implementation of base64 you were using, you could limit the alphabet of potential characters from . (thus any) to the alphabet used in your base64 implementation [abcdefgh1234]. In this example it would be abcdefgh1234, so the pattern could be refined like this r"([abcdefgh1234]+).([abcdefgh1234]+).(.+)"`.
The same applies to the HMAC code.
Furthermore, you could specify the allowed length of each substring.
For instance, you said you have 18 random digits. This would likely mean each is encoded as 1 byte, which would translate to 18*8 = 144 bits, which in base64, would translate to 24 tokens (where each encodes a sextet, thus 6 bits of information). The same could be done with the timestamp, assuming a 32 bit timestamp, this would likely necessitate 6 base64 tokens (representing 36 bits, 36 because you could not divide 32 into sextets).
With this information, you could further refine the pattern
r"([abcdefgh1234]{24})\.([abcdefgh1234]{6})\.(.+)"`
In addition, the same could be applied to the HMAC code.
I leave it to you to read a bit about RegExp but I'd guess it is the easiest solution and certainly more appropriate than any kind of machine learning.

How to get pieces value from .torrent file

I am trying to build a .torrent file interpreter. The problem is that I can't seem to understand how to go about interpreting the pieces value. I am aware that the pieces key contains a concatenation of the SHA-1 hashes for each piece and that SHA-1 contains 20 bytes. A result of this is that the final output should be a multiple of 20 bytes. However, after counting the bytes from the pieces value as a string or in hexadecimal form it still does not satisfy this. How should I interpret the pieces key?
Here we use bencode and bdecode, and the pieces value can get easily. I think you need to firstly read BEP for more details. What's more, you can see this and use it as an example.
From looking at a real torrent file, I found that the SHA-1 hashes had to be taken from its hexadecimal string format, but I previously thought that it was wrong because the byte length of the hash was not a multiple of 20. Turns out I forgot to add a trailing 0 to hexadecimals that were only 1 character (e.g. a had to be changed to 0a)

Most efficient way to check if a stringified number is above MAX_UINT64?

Suppose I have a script which is executed by a 64-bit Perl and which is taking one parameter which actually is a number, but of course is a string in the first place (because all command line parameters are strings).
Now, if that parameter's value fits into a 64 bit unsigned int, the script should do something with the parameter; otherwise, it should abort with an appropriate error message.
What would be the most efficient way to check if that parameter (as a string, i.e. before using it in mathematical operations) fits into a 64-bit unsigned integer?
What I already have thought of:
I could do a string comparison
I don't want to do that because in that case I had to cope with collations, and the documentation for Unicode::Collate looks a bit oversized for my small problem.
But this is just a feeling, so I'd be grateful for comments or other opinions.
Side note: I have tried this, and it worked like expected. But this was just a quick test; I did not play around with locales, so on other systems it might not work (although I doubt that there is a collation which puts "2" before "1", but you never know).
Converting to numbers before comparing won't work:
root#spock:/root/test# perl -e '$i="18446744073709551615"+0; $j="18446744073709551616"+0; print "$i $j\n"; print(($i < $j) ? "less\n" : "greater or equal\n")'
18446744073709551615 1.84467440737096e+19
greater or equal
Note how Perl prints the second number. This is the smallest unsigned integer which does not fit into 64 bits, so Perl converts it to a double. When it then compares $i and $j numerically, it has to convert $i to a double as well; due to the loss of precision involved herein, $i is converted to the same value as $j, so the comparison goes wrong.
I could do use bigint;. I have tried this, and it behaved as expected.
But that probably would lead to a dramatic loss of performance. As far as I have understood, use bigint; implies the use of various heavy libraries.
But this is just a feeling as well, so if this is the way to go, please let me know.
Another idea (not tried yet): Could I use pack() to generate a byte sequence from the stringified number somehow? Then I could check the length of that byte sequence. If it is less or equal to 8 bytes, the stringified number fits into a 64-bit unsigned integer.
How would you solve this problem?
use constant MAX_UINT64 = '18446744073709551615';
my $larger_than_max =
length($s) > length(MAX_UINT64)
|| length($s) == length(MAX_UINT64) && $s gt MAX_UINT64;
Assumes input matches /^(?:0|[1-9][0-9]*)\z/. Adjust to liking (e.g. to handle leading zeros or signs).
You can use a simple shortcut that should eliminate most numbers. Any number that has 19 or fewer digits in the decimal representation can fit in a 64 bit integer, so if the length of the string containing the integer is less than 20, it is good.
Any string with length greater than or equal to 21 is bad.
UINT64_MAX is 18446744073709551615. So, there are some numbers with 20 decimal digits can fit a 64 bit unsigned integer. Some can't.
At this point, simple string comparison using ge will be enough because the ordering of Arabic digits is the same regardless of locale.
$ perl -E "say 'yes' if $ARGV[1] ge $ARGV[0]" 18446744073709551615 18446744073709551616
yes
I'll assume the input is a string of digits for clarity.
You ask for the most efficient way. This can't be determined without understanding the distribution of inputs. For example if the inputs are uniform in 128 bit integers, the most efficient is to start with something like:
if (length(#ARGV[0]) > 20) {die "Number too large.\n"}
This deals with over 99.9999999999 % of cases. In fact if the inputs were uniform in 256 bit integers you might be forgiven for simply writing:
warn "Number too large.\n";
As to repeatedly and consistently testing in a reasonable amount of time you could consider something like this regex from Damian Conway's Regexp::Number (for signed 64 bit numbers but the principle is valid). Notice, being real code, it deals with leading zeros.
'0*(?:(?:9(?:[0-1][0-9]{17}' .
'|2(?:[0-1][0-9]{16}' .
'|2(?:[0-2][0-9]{15}' .
'|3(?:[0-2][0-9]{14}' .
'|3(?:[0-6][0-9]{13}' .
'|7(?:[0-1][0-9]{12}' .
'|20(?:[0-2][0-9]{10}' .
'|3(?:[0-5][0-9]{9}' .
'|6(?:[0-7][0-9]{8}' .
'|8(?:[0-4][0-9]{7}' .
'|5(?:[0-3][0-9]{6}' .
'|4(?:[0-6][0-9]{5}' .
'|7(?:[0-6][0-9]{4}' .
'|7(?:[0-4][0-9]{3}' .
'|5(?:[0-7][0-9]{2}' .
'|80(?:[0-6])))))))))))))))))' .
'|[1-8]?[0-9]{0,18})'
This should be blindingly fast compared with perl run-up time for example, or even a keystroke.
As to bigint, it executes very quickly and includes some cool optimization features, but unless you are testing many numbers in code the above should suffice.
If you really want to burn rubber, though, take a look at perl guts, and use something that exposes the macro SvIOK(SV*). (See https://metacpan.org/pod/release/KRISHPL/pod2texi-0.1/perlguts.pod#What-is-an-%22IV%22? for more details.)

Encoding name strings into an unique number

I have a large set of names (millions in number). Each of them has a first name, an optional middle name, and a lastname. I need to encode these names into a number that uniquely represents the names. The encoding should be one-one, that is a name should be associated with only one number, and a number should be associated with only one name.
What is a smart way of encoding this? I know it is easy to tag each alphabet of the name according to its position in the alphabet set (a-> 1, b->2.. and so on) and so a name like Deepa would get -> 455161, but again here I cannot make out if the '16' is really 16 or a combination of 1 and 6.
So, I am looking for a smart way of encoding the names.
Furthermore, the encoding should be such that the number of digits in the output numeral for any name should have fixed number of digits, i.e., it should be independent of the length. Is this possible?
Thanks
Abhishek S
To get the same width numbers, can't you just zero-pad on the left?
Some options:
Sort them. Count them. The 10th name is number 10.
Treat each character as a digit in a base 26 (case insensitive, no
digits) or 52 (case significant, no digits) or 36 (case insensitive
with digits) or 62 (case sensitive with digits) number. Compute the
value in an int. EG, for a name of "abc", you'd have 0 * 26^2 + 1 *
26^1 + 2 * 20^0. Sometimes Chinese names may use digits to indicate tonality.
Use a "perfect hashing" scheme: http://en.wikipedia.org/wiki/Perfect_hash_function
This one's mostly suggested in fun: use goedel numbering :). So
"abc" would be 2^0 * 3^1 * 5^2 - it's a product of powers of primes.
Factoring the number gives you back the characters. The numbers
could get quite large though.
Convert to ASCII, if you aren't already using it. Then treat each
ordinal of a character as a digit in a base-256 numbering system.
So "abc" is 0*256^2 + 1*256^1 + 2*256^0.
If you need to be able to update your list of names and numbers from time to time, #2, #4 and #5 should work. #1 and #3 would have problems. #5 is probably the most future-proofed, though you may find you need unicode at some point.
I believe you could do unicode as a variant of #5, using powers of 2^32 instead of 2^8 == 256.
What you are trying to do there is actually hashing (at least if you have a fixed number of digits). There are some good hashing algorithms with few collisions. Try out sha1 for example, that one is well tested and available for modern languages (see http://en.wikipedia.org/wiki/Sha1) -- it seems to be good enough for git, so it might work for you.
There is of course a small possibility for identical hash values for two different names, but that's always the case with hashing and can be taken care of. With sha1 and such you won't have any obvious connection between names and IDs, which can be a good or a bad thing, depending on your problem.
If you really want unique ids for sure, you will need to do something like NealB suggested, create IDs yourself and connect names and IDs in a Database (you could create them randomly and check for collisions or increment them, starting at 0000000000001 or so).
(improved answer after giving it some thought and reading the first comments)
You can use the BigInteger for encoding arbitrary strings like this:
BigInteger bi = new BigInteger("some string".getBytes());
And for getting the string back use:
String str = new String(bi.toByteArray());
I've been looking for a solution to a problem very similar to the one you proposed and this is what I came up with:
def hash_string(value):
score = 0
depth = 1
for char in value:
score += (ord(char)) * depth
depth /= 256.
return score
If you are unfamiliar with Python, here's what it does.
The score is initially 0 and the depth are set to 1
For every character add the ord value * the depth
The ord function returns the UTF-8 value (0-255) for each character
Then it's multiplied by the 'depth'.
Finally the depth is divided by 256.
Essentially, the way that it works is that the initial characters add more to the score while later characters contribute less and less. If you need an integer, multiply the end score by 2**64. Otherwise you will have a decimal value between 0-256. This encoding scheme works for binary data as well as there are only 256 possible values in a byte/char.
This method works great for smaller string values, however, for longer strings you will notice that the decimal value requires more precision than a regular double (64-bit) can provide. In Java, you can use the 'BigDecimal' and in Python use the 'decimal' module for added precision. A bonus to using this method is that the values returned are in sorted order so they can be searched 'efficiently'.
Take a look at https://en.wikipedia.org/wiki/Huffman_coding. That is the standard approach.
You can translate it, if every character (plus blank, at least) will occupy a position.
Therefore ABC, which is 1,2,3 has to be translated to
1*(2*26+1)² + 2*(53) + 3
This way, you could encode arbitrary strings, but if the length of the input isn't limited (and how should it?), you aren't guaranteed to have an upper limit for the length.

What is a good method for obfuscating a base 64 string?

Base64 encoding is often used to obfuscate plaintext, I am wondering if there are any quick/easy ways of obfuscating a base 64 string, so that it is not easily recognizeable as such. To do so the method should obfuscate the padding characters (='s) such that they become some other symbol and are more dispersed.
Does anyone know of an easy (and easily reversible) way to do this?
You could use a shift cipher, but I am looking for something that's a little more comprehensive, for example if my shift cipher mapped = to a, someone might notice a string that frequently ends in a's.
The purpose is not to add security, it is actually simply to make base64 unrecognizeable as base 64. It also does not need to pass a security proffesional, just an individual that knows what base64 is and what it looks like. Ex (='s at the end etc.)
The method I describe would probably add non base 64 characters, like ^%$##!, to help obfuscate the reader.
Most of the replies seem to be on the topic of WHY I would want to do this, and the basic answer is that the operation would be completed numerous times (So I want something inexpensive), and done in a way where no password can be remembered (Why I don't XOR). Also the data isn't highly sensitive, and is just to be used as a method against the casual user, who might have knowledge of what a base 64 string is.
A couple of suggestions:
Strip any ending = (according to Wikipedia they are no needed) and then bitwise negate each byte. This will transform the text into mostly non-readable characters.
Loop over the data and xor each character with it's position, modulo 256. This will eliminate any simple statistical analysis since the mapping of each character depends on the position in the string.
In contrast to one of the points in Anders Abel's best answer, the = signs in the base64 strings seem to matter:
$ echo -n foobar | base64
Zm9vYmFy
$ echo -n foobar1 | base64
Zm9vYmFyMQ==
$ echo -n Zm9vYmFyMQ | base64 -D
foobar$ echo -n Zm9vYmFyMQ= | base64 -D
foobar$ echo -n Zm9vYmFyMQ== | base64 -D
foobar1$
What you are asking for is called "security by obscurity" and generally is a bad idea.
Base64 encoding was never designed or intended to be used to obfuscate text or data. Its used to encode binary data which needs to travel trough some communication channel which allows only ASCII characters - like email messages, or be part of XML, etc.
Better use real encryption if you want to hide the data. In any case, even after encrypting the data, you need to pass it as XML, etc., you may end up again encode it in Base64 for transport purposes.
I suppose you could generate a small amount of random data, and then use that to encode the Base64 characters. Prepend the random data to the re-encoded Base64 data.
A very simple example: given an input string "Hello", generate a random number in the range 1-9 and use that as the offset to apply to each input character. Suppose you generate "5", then the re-encoded string would be "5Mjqqt". Or encode the offset as a letter rather than as a number (a=1, b=2, ...) Then the "=" padding will be translated to a different character each time.
Or you could just drop the padding; according to the Wikipedia article, it's not really necessary.
(But consider whether this is really a necessary and sufficient thing to be doing in the first place. It's not clear from your question why you want to obfuscate base 64 data.)
agreed with the responses suggesting use of encryption if your requirements are to actually keep someone who is determined to decode the data from reversing the process.
otherwise, the answer somewhat depends on other constraints of your system, but a few ideas came to mind. if you're just concerned about the delimiter characters, and you have control over the process that generates the Base64 to begin with, you could choose some method of padding the data prior to conversion, thus eliminating the '=' characters from the output.
along this same vein, you could use one of the variants like 'base64url' encoding (see http://en.wikipedia.org/wiki/Base64 for lots of good info on the variants) that does not use the pad character.
after eliminating the '=' by one of these methods, you could perhaps do some sort of char-swapping on the generated Base64, just swapping every other character, just leaving any final character in place. you could also perhaps do some sort of substitution of the upper- or lowercase letters into some other characters to make it look less like Base64 to a quick glance.
however, whatever idea you choose, just remember that it will not be a substitute for a real encryption scheme if you require real protection of that data.
Base64 usually used when you want your data goes through some channel that can distort non-alpha-numeric symbols - for example in XML. If it is your task too - your code will be similar to Base64 no matter how you try :)
If your channel handles binary data well - then just get source text (decode Base64 back), get binary representation for it and use some sort of xor. For example make xor 37 with every byte in source bytes. The same operation will restore your text back.
But it still easily recognizable by anyone who has basic knowledge of cryptanalysis. If it is a problem - use real encryption.

Resources