Generating numbers sequence containing leading zeros - linux

I'm looking for a way to create a sequence of numbers with a number of leading zeros.
For example, starting from 00000001 to 1000000000.
I tried many variations of seq but I was never able to produce a sequence with starting number as indicated to go to the desired end number.
To generate a billion of numbers, seq would likely take 20min on my machine o printf which is a ton slower is likely not an option, I'm looking into achieving this with seq only for starters if possible.
Any ideas?

To generate a sequence with a specified format you have to use -f argument like this:
seq -f "%08g" 1000000000
Although you didn't specify how you want to use so large numbers it would be better to parallelize it to make a final solution efficient.
Edit:
Version without shortest representation:
seq -f "%08.0f" 10000000

Related

KDB: generate random string

How can one generate a random string of a given length in KDB? The string should be composed of both upper and lower case alphabet characters as well as digits, and the first character can not be a digit.
Example:
"i0J2Jx3qa" / OK
"30J2Jx3qa" / bad
Thank you very much for your help!
stringLength: 13
randomString: (1 ? .Q.A,.Q.a) , ((stringLength-1) ? .Q.nA,.Q.a)
If you prefer without the repetitions:
raze(1,stringLength-1)?'10 0_\:.Q.nA,.Q.a
For the purposes of creating random data you can also use ?/deal with a number of characters up to 8 as a symbol(which you could string). This doesn't include numbers though so just an alternative approach to your own answer.
1?`8
,`bghgobnj
There's already a fine answer above which has been accepted. Just wanted somewhere to note that if this is to generate truly random data you need to consider randomising your seed. This can be done in Linux by using $RANDOM in bash or reading up to four bytes from /dev/random (relatively recent versions of kdb can read directly from FIFOs).
Otherwise the seed is set to digits from pi: 314159

Most efficient way to check if a stringified number is above MAX_UINT64?

Suppose I have a script which is executed by a 64-bit Perl and which is taking one parameter which actually is a number, but of course is a string in the first place (because all command line parameters are strings).
Now, if that parameter's value fits into a 64 bit unsigned int, the script should do something with the parameter; otherwise, it should abort with an appropriate error message.
What would be the most efficient way to check if that parameter (as a string, i.e. before using it in mathematical operations) fits into a 64-bit unsigned integer?
What I already have thought of:
I could do a string comparison
I don't want to do that because in that case I had to cope with collations, and the documentation for Unicode::Collate looks a bit oversized for my small problem.
But this is just a feeling, so I'd be grateful for comments or other opinions.
Side note: I have tried this, and it worked like expected. But this was just a quick test; I did not play around with locales, so on other systems it might not work (although I doubt that there is a collation which puts "2" before "1", but you never know).
Converting to numbers before comparing won't work:
root#spock:/root/test# perl -e '$i="18446744073709551615"+0; $j="18446744073709551616"+0; print "$i $j\n"; print(($i < $j) ? "less\n" : "greater or equal\n")'
18446744073709551615 1.84467440737096e+19
greater or equal
Note how Perl prints the second number. This is the smallest unsigned integer which does not fit into 64 bits, so Perl converts it to a double. When it then compares $i and $j numerically, it has to convert $i to a double as well; due to the loss of precision involved herein, $i is converted to the same value as $j, so the comparison goes wrong.
I could do use bigint;. I have tried this, and it behaved as expected.
But that probably would lead to a dramatic loss of performance. As far as I have understood, use bigint; implies the use of various heavy libraries.
But this is just a feeling as well, so if this is the way to go, please let me know.
Another idea (not tried yet): Could I use pack() to generate a byte sequence from the stringified number somehow? Then I could check the length of that byte sequence. If it is less or equal to 8 bytes, the stringified number fits into a 64-bit unsigned integer.
How would you solve this problem?
use constant MAX_UINT64 = '18446744073709551615';
my $larger_than_max =
length($s) > length(MAX_UINT64)
|| length($s) == length(MAX_UINT64) && $s gt MAX_UINT64;
Assumes input matches /^(?:0|[1-9][0-9]*)\z/. Adjust to liking (e.g. to handle leading zeros or signs).
You can use a simple shortcut that should eliminate most numbers. Any number that has 19 or fewer digits in the decimal representation can fit in a 64 bit integer, so if the length of the string containing the integer is less than 20, it is good.
Any string with length greater than or equal to 21 is bad.
UINT64_MAX is 18446744073709551615. So, there are some numbers with 20 decimal digits can fit a 64 bit unsigned integer. Some can't.
At this point, simple string comparison using ge will be enough because the ordering of Arabic digits is the same regardless of locale.
$ perl -E "say 'yes' if $ARGV[1] ge $ARGV[0]" 18446744073709551615 18446744073709551616
yes
I'll assume the input is a string of digits for clarity.
You ask for the most efficient way. This can't be determined without understanding the distribution of inputs. For example if the inputs are uniform in 128 bit integers, the most efficient is to start with something like:
if (length(#ARGV[0]) > 20) {die "Number too large.\n"}
This deals with over 99.9999999999 % of cases. In fact if the inputs were uniform in 256 bit integers you might be forgiven for simply writing:
warn "Number too large.\n";
As to repeatedly and consistently testing in a reasonable amount of time you could consider something like this regex from Damian Conway's Regexp::Number (for signed 64 bit numbers but the principle is valid). Notice, being real code, it deals with leading zeros.
'0*(?:(?:9(?:[0-1][0-9]{17}' .
'|2(?:[0-1][0-9]{16}' .
'|2(?:[0-2][0-9]{15}' .
'|3(?:[0-2][0-9]{14}' .
'|3(?:[0-6][0-9]{13}' .
'|7(?:[0-1][0-9]{12}' .
'|20(?:[0-2][0-9]{10}' .
'|3(?:[0-5][0-9]{9}' .
'|6(?:[0-7][0-9]{8}' .
'|8(?:[0-4][0-9]{7}' .
'|5(?:[0-3][0-9]{6}' .
'|4(?:[0-6][0-9]{5}' .
'|7(?:[0-6][0-9]{4}' .
'|7(?:[0-4][0-9]{3}' .
'|5(?:[0-7][0-9]{2}' .
'|80(?:[0-6])))))))))))))))))' .
'|[1-8]?[0-9]{0,18})'
This should be blindingly fast compared with perl run-up time for example, or even a keystroke.
As to bigint, it executes very quickly and includes some cool optimization features, but unless you are testing many numbers in code the above should suffice.
If you really want to burn rubber, though, take a look at perl guts, and use something that exposes the macro SvIOK(SV*). (See https://metacpan.org/pod/release/KRISHPL/pod2texi-0.1/perlguts.pod#What-is-an-%22IV%22? for more details.)

Sorting Algorithms for strings of equal length C++

I need to sort around 100000 strings ASCIIbetically and by length, I sort by lengths by putting it into a 2D vector at the length of the string and then sort each array using quicksort (for ASCIIbetically). But is there a faster sort for strings of equal length? I've heard radix is great but I find it difficult to understand. What would be the best way to sort equal length strings without using the sort() function? If you need the code I can post it up.
I think building a trie and then retrieving the keys in the trie by means of pre-order traversal is about as efficient as it gets for string sorting, and is actually a form of radix sort. Here is a detailed academic paper discussing this method. In 2006 at least this was the fastest string sorting method out there.
For strings of between 8 and 15 characters, your comparison function for quick sort could do the first 8 characters in a single 64-bit chunk. And so on for 16 to 31, etc. So, you end up with as many comparison functions as you feel makes a difference. Unless you have a very large number of strings with long common sub strings, just using what you know about the string lengths may do the trick, straightforwardly.
For completeness, you need to worry about alignment and byte order. So, fetching 8 bytes at a time into a uint64_t:
uint64_t u ;
memcpy(&u, pv, 8) ;
...convert to big-endian if required...
will do the trick. I can tell you that with gcc and -O2 on an x86_64 the memcpy() compiles to a single instruction, as if it was u = *(uint64_t*)pv :-) For processors with alignment issues, I would hope that the compiler does something suitable.
Sadly, memcmp(foo, bar, 8) does not get the same treatment (at least on gcc 4.8, not even with -O3) :-(

Vectorize string concatenation in matlab

I am doing a project where I would like to vectorize (if possible), these lines of code in Matlab:
for j=1:length(image_feature(i,:))
string1b=strcat(num2str(j),':',num2str(image_feature(i,j)));
write_file1b=[write_file1b string1b ' '];
end
Basically, what I want to get is an output string in the following way:
1:Value1 2:Value2 3:Value3 ....
Note that ValueX is a number so a real example would be an output like this:
1:23.2 2:34.3 3:110.8
Is this possible? I was thinking about doing something like creating another vector with values from 1 to j, and another j-long vector with only ":", and a num2str(image_feature(i,:)), and then hopefully there is a function f (like a vectorized strcat) that if I do:
f(num2str(1:j),colon_vector,num2str(image_feature(i,:)))
will give me the output I mention above.
Im not sure I understood your question but perhaps this might help
val=[23.2 34.3 110.8]
output = [1:length(val); val]
sprintf('%i: %f ',output)
As output I get
1: 23.200000 2: 34.300000 3: 110.800000
You can vectorize all your array operations to create an array or matrix of numbers very efficiently, but by the very nature of strings in MATLAB, you cannot vectorize the creation of your output string. Theoretically, if you have a fixed length string like in C++, you could concurrently write to different memory locations within the string, but that is not something that is supported by MATLAB. Even if it were, it looks like you have variable-length numbers, so even that would be difficult (unless you were to allocate a specific amount of space per number pair, leading to variable-length spaces between number pairs. It doesn't look like you want to do that, since your examples have exactly one space between all number pairs).
If you'd be interested in efficiently creating a vector, the answer provided by twerdster would accomplish that, but even in that code, the sprintf statement is not concurrent. His code does get rid of the for-loop, which improves efficiency, so I prefer his code to yours.

radix sort on binary strings with arbitrary length

i googled around and see lots of discussion about radix sort on binary string, but they are all with same lenght, how aobut binary string with arbitrary lenght?
say i have {"001", "10101", "011010", "10", "111"}, how do i do radix sort on them ? Thanks!
Find the max length and pad them all to that length. Should still perform well provided there's some upper bound on the length of the longest string.
You could pad them all to be the same length, but there's no real reason to run a sorting algorithm to determine that a length 5 number in binary is larger than a length 2 one. You would likely get better performance by grouping the numbers by length and running your radix sort within each group. Of course, that's dependent upon how you group them and then on how you sort your groups.
An example of how you might do this would be to run through all the items once and throw them all into a hash table (length --> numbers of that length). This takes linear time, and then let's say nlogn time to access them in order. A radix sort runs in O(nk) time where n is the number of items and k is their average length. If you've got a large k, then the difference between O(nk) and O(nlogn) would be acceptable.
If creating a ton of new string instances leaves a nasty taste, write the comparison yourself.
Compare what the lengths of the strings would be without the leading 0's (ie. find the firstIndexOf("1")); the longer string is larger.
If both are the same length, just continue comparing them, character-by-character, until you find two characters that differ - the string with the "1" is the larger.

Resources