How does UUID work with PHPCassa

How does UUID work with PHPCassa - cassandra

I am using:
CassandraUtil::uuid1();
This is what I get:
ämªðÏBà=0£Ï‰
I though it would output a int.
What is going on? Is it normal?
Also should I use uuid1 or 2 or 3 or 4 or ...?
Thanks in advance!

There are a few parts to UUIDs in phpcassa. First, how to generate one. The following functions are useful for this:
$my_uuid_string = phpcassa\UUID::uuid1();
$my_uuid_string = phpcassa\UUID::uuid4();
uuid1() generates a v1 UUID, which has a timestamp component, and is called TimeUUIDType in Cassandra. uuid4() generates a totally random UUID, and is called LexicalUUIDType in Cassandra. (The other uuidX() functions aren't generally that useful.) What this function gives you back is a byte array representation of the UUID -- basically a 16 byte string. This is what your "ämªðÏBà=0£Ï‰" string is. When you are trying to insert a UUID into Cassandra, this is what you want to use.
It's possible to make a UUID object which has useful methods and attributes from this byte array:
$my_uuid = phpcassa\UUID::import($my_uuid_string);
With $my_uuid, you can get a pretty string representation like 'd881bf7c-cf8f-11e0-85e5-00234d21610a' by getting $my_uuid->string. You can get back the byte representation by doing $my_uuid->bytes. Any uuid data that you get back from Cassandra will by in the byte array format, so you need to use UUID::import() on it if you want a UUID object.
Also, UUID::import() also works on the pretty string representation as well (the one that looks like ''d881bf7c-cf8f-11e0-85e5-00234d21610a').
Last, don't forget about the documentation for the UUID class.
EDIT: updated links and class names to match the latest phpcassa API

uuid1() generates a UUID based on the current time and the MAC address of the machine.
Pros: Useful if you want to be able to sort your UUIDs by creation time.
Cons: Potential privacy leakage since it reveals which computer it was generated on and at what time.
Collisions possible: If two UUIDs are generated at the exact same time (within 100 ns) on the same machine. (Or a few other unlikely marginal cases.)
uuid2() doesn't seem to be used anymore.
uuid3() generates a UUID by taking an MD5 hash of an arbitrary name that you choose within some namespace (e.g. URL, domain name, etc).
Pros: Provides a nice way of assigning blocks of UUIDs to different namespaces. Easy to reproduce the UUID from the name.
Cons: If you have a unique name already, why do you need a UUID?
Collisions possible: If you reuse a name within a namespace, or if there is a hash collision.
uuid4() generates a completely random UUID.
Pros: No privacy concerns. Don't have to generate unique names.
Cons: No structure to UUIDs.
Collisions possible: If you use a bad random number generator, reuse a random seed, or are very, very unlucky.
uuid5() is the same as uuid3(), except using a SHA-1 hash instead of MD5. Officially preferred over uuid3().

Related

How to get value from IMAP (hazelcast) given the list of keys?

Problem we are trying to solve:
Give a list of Keys, what is the best way to get the value from IMap given the number of entries is around 500K?
Also we need to filter the values based on fields.
Here is the example map we are trying to read from.
Given IMap[String, Object]
We are using protobuf to serialize the object
Object can be say
Message test{ Required mac_address eth_mac = 1, ….// size can be around 300 bytes }

You can use IMap.getAll(keySet) if you know the keys beforehand. It's much better than single gets since it'll be much less network trips in a bulk operation.
For filtering, you can use predicates on IMap.values(predicate), IMap.entryset(predicate) or IMap.keyset(predicate) based on what you want to filter.
See more: http://docs.hazelcast.org/docs/latest/manual/html-single/index.html#distributed-query

Detect fake random numbers?

My client side code generates UUIDs and sends them to the server.
For example, '6ea140caa83b485f9c98ebaacfb536ce' would be a valid uuid4 to send back.
Is there any way to detect or prevent a user sending back a valid but "user generated" uuid4 like 'babebabebabe4abebabebabebabebabe'?
For example, one way to prevent a certain class of these would be looking at the number of occurrences of 0's and 1's in the binary representation of the number. This could work for a string like '00000000000040000000000000000000' but not for all strings.

It depends a little ...
there is no way to be entirely sure, but depending on the UUID version/subtype you are using there MIGHT be a way to detect at least some irregular values:
https://www.rfc-editor.org/rfc/rfc4122#section-4.1 defines the original version 1 of UUIDs, and a layout for the uuid fields ...
you could for example check if the version and variant fields are valid...
if your UUID generation actually uses Version 1 you could, in addition to the first test of version and variant, test if the timestamp is in a valid range ... for example, it might be unlikely that the UUID in question was generated in the year 1600 ... or in the future
so tests like there could be applied to check if the value actually makes sense, or is complete gibberish ... it can not protect you against someone thinking: ok ... lets analyze this and provide a manually choosen value that satisfies all conditions

No there is no way to distinguish user generated UUID's from randomly generated UUID's.
To start with, a user generated UUID may as well be partially random. But lets assume that it is not.
In that case you want to detect a pattern. However, although you give an example of a pattern, a pattern can be almost anything. For instance, the following byte array looks completely random, right?
40 09 21 fb 54 44 2d 18
But actually it is a nothing-up-my-sleeve number usually used within the cryptographic community: it's simply the encoding of Pi (in this case as a 64 bit floating point, as I was somewhat lazy).
There are certainly randomness tests, for instance FIPS random number tests. Those require a very high number of input to see if something fails or succeeds. Even then: it only shows that certain statistical properties have indeed been attained by a random number generator. The encoding of Pi might very well succeed.
And annoyingly, a random number generator is perfectly possible to generate bit strings that do not look random at all, if just by chance. The smaller the bit string the more chance of the random number generator generating something that doesn't look random at all. And UUID's are not all that big.
So yes, of course you can do some tests, but you can never be sure: you will have both false positives as false negatives.

Is it better to use a 128-bit long (uuid) or a 7-bit long (shortid) unique ID generator in Node?

I'm currently using uuid npm package to generate unique IDs for the elements of my graph database in my node.js app.
It generates RFC-compliant 128-bit long IDs, like
6e228580-1cb5-11e8-8271-891867c15336
I'm currently thinking to shift to shortid npm package, which does a similar job but generates 7-bit long IDs:
PPBqWA9
My database requests are already long and I want to shorten them, so I'm considering the switch from uuid to shortid.
However, the question: I understand that the 128-bit long compliant UUID generator guarantees that it's going to be unique. What about the 7-bit one? I understand it can provide about 78364164096 unique possibilities which is not bad, but I already have about 50M unique objects in my DB, each of which has a unique index, so I'm just curious if that algorithm would really be able to generate a unique ID considering that 78364164096 is only 1350 times more than 50000.
Any ideas? Should I use a 7-bit identifier or a 128-bit one?

I will assume the shorter ids provided by the shortid package to be full 56 bit long. But most likely they take 56-bit long space (7 bytes) but only something like 42-bit long payload.
Both 56-bit and 128-bit ids are subject to collision. The difference is the probability of having a collision. I think the 56-bit one REQUIRE you to be able to handle collisions, so you'll end up having a more complex code. The 128-bit is quite unlikely to produce collisions to the point of generally not being considered.
For the sake of simplicity and safety, I would choose the time proven 128-bit.

What are the differences between hashtable and hashmap? (Not specific to Java)

During my most recent job interview for a software engineer position, I was asked this questions: what are the differences between hashtable and hashmap? I asked the interviewer if he was specific about Java since in Java hashtable is synchronized and hashmap is not (and actually tons of information to compare hashtable vs hashmap in Java after googling so that's not the answer I am looking for) but he said no and wanted to me to explain the the difference of these two in general.
I was really puzzled and shocked (actually still puzzled now) about this question. IMO, hastable or hashmap is simply a matter of terminology. Actually only Java has both terms and in other languages like C++, they don't even have the term hashtable. During the interview, I just explained the principle of hashing and said that hashmap and hashtable should both be implemented based on this principle and I don't know if there is any difference between these two. The interviewer was definitely not convinced and was looking for other answers and of course I was rejected after that round.
So back to the topic, what could possibly be the differences between hashmap and hashtable in general (not specific to Java) if there is any?

In Computer Science there is a difference due to the wording.
A HashTable is some kind of lookup table using key hashes to lookup the corresponding value in a table like data structure. Thats only one kind of a key-value Mapping. There are different implementations as you are probably aware. Different hashes, hash collusion solutions and table growing strategies and more under the hood. It's only interesting if you need to make your own hash table for whatever reason.
A HashMap is some kind of mapping of key-value pairs with a hashed key. Mapping is abstract as such and it may not be a table. Balanced trees or tries or other data structures/mappings are possible too.
You could simplify and say that a HashTable is the underlying data structure and the HashMap may be utilizing a HashTable.
A Dictionary is yet another abstraction level since it may not use hashes at all - for example with full text binary search lookups or other ways for compares. This is all you can get out of the words without considering certain programming languages.
--
Before thinking too much about it. Can you say - with certainty - that your interviewer had a clue about what he/she was talking about? Did you discuss technical details or did they just listen/ask and sometimes comment? Sometimes interviewers just come up with the most ridicules answers to problems they don't really understand in the first place.
Like you wrote yourself, in general it's just Terminology. Software Developers often use the Terms interchangeable, except maybe those who really have differences like in Java.

The interviewer may have been looking for the insight that...
a hash table is a lower-level concept that doesn't imply or necessarily support any distinction or separation of keys and values (i.e. you can implement a hash set of values using a hash table), while
a hash map must support distinct keys and values, as there's to be a mapping/association from keys to values; the two are distinct, even if in some implementations they're always stored side by side in memory, e.g. members of the same structure / std::pair<>.
Example: a (bad) hash table implementation preventing use as a hash map.
Consider:
template <typename T>
class Hash_Table
{
...
bool insert(const T& t)
{
// work out which bucket t hashes to...
size_t bucket = hash_bytes((void*)&t, sizeof t) % num_buckets_;
// see if t is already stored in the bucket...
if (memcmp((void*)&t, (void*)&buckets_[bucket], sizeof t) == 0)
...
... handle collisions etc. ...
}
...
};
Above, the hard-coded calls to a hash function that treats the value being inserted as a binary blob, and memcmp of the entire t, mean you can't make T say a std::pair<int, std::string> and use the hash table as a hash map from ints to strings. So, it's an example of a hash table that's not usable as a hash map.
You might or might not also consider a hash table that simply doesn't provide any convenience features for use as a hash map not to be a hash map. For example, if the API was designed as if dealing only in values - h.insert(t); h.erase(t); auto i = h.find(t); - but it allowed the caller to specify arbitrary custom comparison and hashing functions that could restrict their operations to only the key part of t, then the hash table could be (ab)used as a functional hash map.
To clarify how this relates to makadev's existing answer, I disagree with:
"A HashTable [uses] key hashes to lookup the corresponding value"; wrong because it assumes a key->value mapping.
"A HashMap [...]. Mapping is abstract as such and it may not be a table. Balanced trees or tries or other data structures/mappings are possible too."; wrong because the primary mechanism of a hash map is still hashing of the key to a bucket (index) in the table/array: some hash tables/maps may use other data structures (arrays, linked lists, trees...) to store elements that collide at the same bucket, but that's a different issue and not part of the difference between hash tables and hash maps.

Actually HashTable become obsoletes and HasHMap is best approach to use because Hashtable is synchronized. If a thread-safe implementation is not needed, it is recommended to use HashMap in place of Hashtable. If a thread-safe highly-concurrent implementation is desired, then it is recommended to use java.util.concurrent.ConcurrentHashMap in place of Hashtable.
Second difference is HashMap extends Map Interface and whether HashSet Dictionary interface.

How to reconstruct hash value to the original format?

I would like to know how I can reconstruct a hash value such as 558f68181d2b0c9d57d41ce7aa36b71d9 to its original format (734).
I have used a code in matlab, which provided me with an hash output, but I tried to revers the operation to obtain the original value but no use. I tired converting from hex to binary but no use.
Are there any built in functions that can help me obtaining the original value?
i have used this code :
http://uk.mathworks.com/matlabcentral/fileexchange/31272-datahash

In general this is impossible. The whole idea of cryptographical hashes (like SHA-1 used above) is to be as unpredictable as possible. The hash of certain data should always be the same (of course) but it should be really hard to predict which data that resulted in a certain hash.
If you have a limited amount of values, you could probably create a lookup-table (hash -> data that made it) but this is actually the exact opposite of how they are supposed to be used.
I think you want to create your own hashing for this problem, where you could inline the data you hash in some particular way.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string