How to have a bigint hash for a string

How to have a bigint hash for a string - string

We have an alpha numeric string (up to 32 characters) and we want to transform it to an integer (bigint). Now we're looking for an algorithm to do that. Collision isn't bad (therefor we use an bigint to prevent this a little bit), important thing is, that the calculated integers are constantly distributed over bigint range and the calculated integer is always the same for a given string.

This page has a few. You'll need to port to 64bit, but that should be trivial. A C# port of SBDM hash is here. Another page of hash functions here

Most programming languages come with a built-in construct or a standard library call to do this. Without knowing the language, I don't think anyone can help you.

Yes, a "hash" should be the right description for my problem. I know, that there is CRC32, but it only provides an 32-bit int (in PHP) and this 32-bit integers are at least 10 characters long, so a huge range of integer number is unused!?
Mostly, we have a short string like "PX38IEK" or an 36 character UUID like "24868d36-a150-11df-8882-d8d385ffc39c", so the strings are arbitrary, yes.
It doesn't has to be reversible (so collisions aren't bad). It also doesn't matter what int a string is converted to, my only wish is, that the full bigint range is used as best as possible.

Related

How to convert big hex value to integer in nodejs?

There is big hex value:
var Hex = "ad6eb61316ff805e9c94667ab04aa45aa3203eef71ba8c12afb353a5c7f11657e43f5ce4483d4e6eca46af6b3bde4981499014730d3b233420bf3ecd3287a2768da8bd401f0abd7a5a137d700f0c9d0574ef7ba91328e9a6b055820d03c98d56943139075d";
How can I convert it to big integer in node.js? I tried to search, but what I found is
var integer = parseInt(Hex, 16);
But It doesn't work if I put big hex value. I think.
the result is,
1.1564501846672726e+243
How can I return normal big integer? I want to use this value for modulus in RSA encryption. Actually I don't know I have to convert it or not.

You need precise integers to do modular arithmetic for RSA, but the largest integer in JavaScript is 9007199254740991 without losing precision. You cannot represent a larger integer as a Number. You would need to devise a way to do modular arithmetic with many chunks of the large integer or simply use one of the available like the big number arithmetic in JSBN which also provides a full implementation of RSA including PKCS#1 v1.5 padding.

how to code a string to a unique number and decode it

Is there any way to code a long string to a unique number (integer) and then decode this number to original string? (I mean to reduce size of long string)

The simple answer is no.
The complex answer is maybe.
What you are looking for is compression, compression can reduce the size of the String but there is no guarantee as to how small it can make it. In particular you can never guarantee being able to fit it into a certain sized integer.
There are concepts like "hashing" which may help you do what you want depending on exactly what you are trying to do with this number.
Alternatively if you use the same string in a lot of different places then you can store it once and pass references/pointers to that single instance of the String around.

First you need to hash it to string eg md5. Then you convert the characters of the hash string into numbers according to the alphabetical number

Is there a difference between datatypes on different bit-size OSes?

I have a C program that I know works on 32-bit systems. On 64-Bit systems (at least mine) it works to a point and then stops. Reading some forums the program may not be 64-bit safe? I assume it has to do with differences of data types between 32-bit and 64-bit systems.
Is a char the same on both? what about int or long or their unsigned variants? Is there any other way a 32-bit program wouldn't be 64-bit safe? If I wanted to verify the application is 64-bit safe, what steps should I take?

Regular data types in C has minimum ranges of values rather than specific bit widths. For example, a short has to be able to represent, at a minimum, -32767 thru 32767 inclusive.
So,yes, if your code depends on values wrapping around at 32768, it's unlikely to behave well if the short is some big honking 128-bit behemoth.
If you want specific-width data types, look into stdint.h for things like int64_t and so on. There are a wide variety to choose from, specific widths, "at-least" widths, and so on. They also mandate two's complement for these, unlike the "regular" integral types:
integer types having certain exact widths;
integer types having at least certain specified widths;
fastest integer types having at least certain specified widths;
integer types wide enough to hold pointers to objects;
integer types having greatest width.
For example, from C11 7.20.1.1 Exact-width integer types:
The typedef name intN_t designates a signed integer type with width N, no padding
bits, and a two’s complement representation. Thus, int8_t denotes such a signed
integer type with a width of exactly 8 bits.
Provided you have followed the rules (things like not casting pointers to integers), your code should compile and run on any implementation, and any architecture.
If it doesn't, you'll just have to start debugging, then post the detailed information and code that seems to be causing problem on a forum site dedicated to such things. Now where have I seen one of those recently? :-)

Strategies for parallel implementation of Lua numbers and a 64bit integer

Lua by default uses a double precision floating point (double) type as its only numeric type. That's nice and useful. However, I'm working on software that expects to see 64bit integers, for which I don't get around using actual 64bit integers one way or another.
The place where the integer type becomes relevant is for file sizes. Although I don't truly expect to see file sizes beyond what Lua can represent with full "integer" precision using a double, I want to be prepared.
What strategies can you recommend when using a 64bit integer type in parallel with the default numeric type of Lua? I don't really want to throw the default implementation overboard (and I'm not worried of its performance compared to integer arithmetics), but I need some way of representing 64bit integers up to their full precision without too much of a performance penalty.
My problem is that I'm unsure where to modify the behavior. Should I modify the syntax and extend the parser (numbers with appended LL or ULL come to mind, which to my knowledge doesn't exist in default Lua) or should I instead write my own C module and define a userdata type that represents the 64bit integer, along with library functions able to manipulate the values? ...
Note: yes, I am embedding Lua, so I am free to extend it whichever way I please.

As part of LuaJIT's port to ARM CPUs (which often have poor floating-point), LuaJIT implemented a "Dual-number VM", which allows it to switch between integers and floats dynamically as needed. You could use this yourself, just switch between 64-bit integers and doubles instead of 32-bit integers and floats.
It's currently live in builds, so you may want to consider using LuaJIT as your Lua "interpreter." Or you could use it as a way to learn how to do this sort of thing.
However, I do agree with Marcelo; the 53-bit mantissa should be plenty. You shouldn't really need this for a good 10 years or so.

I'd suggest storing your data outside of Lua and use some type of reference to retrieve it when calling your other libraries. You can then push various results onto the Lua stack for the user the see, you can even retrieve the value as a string to be precise, but I would avoid modifying them in Lua and relying on the Lua values when calling your external library.

If you're not going to need floating-point precision at any point in the program, you can just redefine LUA_NUMBER to __int64 (or whatever 64-bit int may be in your environment) in luaconf.h.
Otherwise, you can just bring in another library to handle your integers- for infinite precision, you can use a bignum library such as lhf's lbn.

Efficient String Implementation in Haskell

I'm currently teaching myself Haskell, and I'm wondering what the best practices are when working with strings in Haskell.
The default string implementation in Haskell is a list of Char. This is inefficient for file input-output, according to Real World Haskell, since each character is separately allocated (I assume that this means that a String is basically a linked list in Haskell, but I'm not sure.)
But if the default string implementation is inefficient for file i/o, is it also inefficient for working with Strings in memory? Why or why not? C uses an array of char to represent a String, and I assumed that this would be the default way of doing things in most languages.
As I see it, the list implementation of String will take up more memory, since each character will require overhead, and also more time to iterate over, because a pointer dereferencing will be required to get to the next char. But I've liked playing with Haskell so far, so I want to believe that the default implementation is efficient.

Apart from String/ByteString there is now the Text library which combines the best of both worlds—it works with Unicode while being ByteString-based internally, so you get fast, correct strings.

Best practices for working with strings performantly in Haskell are basically: Use Data.ByteString/Data.ByteString.Lazy.
http://hackage.haskell.org/packages/archive/bytestring/latest/doc/html/
As far as the efficiency of the default string implementation goes in Haskell, it's not. Each Char represents a Unicode codepoint which means it needs at least 21bits per Char.
Since a String is just [Char], that is a linked list of Char, it means Strings have poor locality of reference, and again means that Strings are fairly large in memory, at a minimum it's N * (21bits + Mbits) where N is the length of the string and M is the size of a pointer (32, 64, what have you) and unlike many other places where Haskell uses lists where other languages might use different structures (I'm thinking specifically of control flow here), Strings are much less likely to be able to be optimized to loops, etc. by the compiler.
And while a Char corresponds to a codepoint, the Haskell 98 report doesn't specify anything about the encoding used when doing file IO, not even a default much less a way to change it. In practice GHC provides an extensions to do e.g. binary IO, but you're going off the reservation at that point anyway.
Even with operations like prepending to front of the string it's unlikely that a String will beat a ByteString in practice.

The answer is a bit more complex than just "use lazy bytestrings".
Byte strings only store 8 bits per value, whereas String holds real Unicode characters. So if you want to work with Unicode then you have to convert to and from UTF-8 or UTF-16 all the time, which is more expensive than just using strings. Don't make the mistake of assuming that your program will only need ASCII. Unless its just throwaway code then one day someone will need to put in a Euro symbol (U+20AC) or accented characters, and your nice fast bytestring implementation will be irretrievably broken.
Byte strings make some things, like prepending to the start of a string, more expensive.
That said, if you need performance and you can represent your data purely in bytestrings, then do so.

The basic answer given, use ByteString, is correct. That said, all of the three answers before mine have inaccuracies.
Regarding UTF-8: whether this will be an issue or not depends entirely on what sort of processing you do with your strings. If you're simply treating them as single chunks of data (which includes operations such as concatenation, though not splitting), or doing certain limited byte-based operations (e.g., finding the length of the string in bytes, rather than the length in characters), you won't have any issues. If you are using I18N, there are enough other issues that simply using String rather than ByteString will start to fix only a very few of the problems you'll encounter.
Prepending single bytes to the front of a ByteString is probably more expensive than doing the same for a String. However, if you're doing a lot of this, it's probably possible to find ways of dealing with your particular problem that are cheaper.
But the end result would be, for the poster of the original question: yes, Strings are inefficient in Haskell, though rather handy. If you're worried about efficiency, use ByteStrings, and view them as either arrays of Char8 or Word8, depending on your purpose (ASCII/ISO-8859-1 vs Unicode of some sort, or just arbitrary binary data). Generally, use Lazy ByteStrings (where prepending to the start of a string is actually a very fast operation) unless you know why you want non-lazy ones (which is usually wrapped up in an appreciation of the performance aspects of lazy evaluation).
For what it's worth, I am building an automated trading system entirely in Haskell, and one of the things we need to do is very quickly parse a market data feed we receive over a network connection. I can handle reading and parsing 300 messages per second with a negligable amount of CPU; as far as handling this data goes, GHC-compiled Haskell performs close enough to C that it's nowhere near entering my list of notable issues.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string