I am wondering about the format UUIDs are by default represented in CouchDB. While the RFC 4122 describes UUIDs like 550e8400-e29b-11d4-a716-446655440000, CouchDB uses continuously chars like 3069197232055d39bc5bc39348a36417. I've searched some time in both their wiki and their documentation what this actually is, however without any result.
Do you know whether this is either a non RFC-conform format omitting all - or is this a completely different representation of the 128 bits.
The background is that I'm using Java UUIDs which are formatted as noted in the RFC. I see the advantage that the CouchDB-style is probably more handy for building internal trees, but I want to be sure to use a consistent implementation.
Technically we don't use the rfc standard for uuids as you've noticed. Version four uuids reserve something like four bits to specify the version of uuid. We also don't format them with the hyphens that are generally seen in other implementations.
CouchDB uuids are 16 random bytes formatted as hex. Roughly speaking that's a v4 uuid but not rfc compliant.
Regardless of the specifics, there's really not much of an issue in practice. You generally shouldn't try to interpret a uuid unless you're trying to do some sort out-of-band analysis. CouchDB will never interpret uuids, we only rely on the properties of randomness involved therein.
Bottom line would be to not worry about it and just treat them as strings after generation.
K I can provide some 2019 reference from the doc site: "it's in any case preferable to provide one's own uuids" -- https://docs.couchdb.org/en/latest/best-practices/documents.html?highlight=uuid
I ran slap bang into this because the (hobby) db I'm attempting as a first programming anything, deals with an application that does generate and use 4122 -compliant uuids and I was chewing my nails worrying about stripping the "-" bits out and putting them back on retrieval.
Before it hit me that the uuid that couchdb uses as the doc _id is a string not a number... doh. So I use the app's uuid generated when it creates an object to _id the document. No random duplicated uuids.
Related
I'm thinking about to implement a couple of GATT services for a custom app, but I'm stuck right now in the research. I know that the service uuids are not random, some parts are well defined, others are still confusing me.
E.g. The Device Information Service seems to be advertised as 0000180a-xxx the "180a" I can find in https://www.bluetooth.com/specifications/gatt/services/, but e.g. the subfield "Model Number String" has the id 00002a24-xxx, a total different prefix which is also not defined in the same list, but instead here: https://www.bluetooth.com/specifications/gatt/characteristics/. How can I get a list of all those prefixes?
Is that part I marked with xxx seems to be equal for a random device my mobile found. Is there somewhere a overview how the uuid should be build and which "safe" prefixes I can use for my own GATT services?
In Bluetooth, attributes types, GATT service types, characteristic types and descriptor types, and other constants are identified through UUIDs.
UUIDs are no more than identifiers, 128-bit identifiers. One given 128-bit value designates one given thing. Using 128-bit random values gives minimal chances of collisions for two parties generating identifiers on their own, without the need for a central registry.
UUIDs in Bluetooth
Bluetooth standard-defined UUIDs receive special treatment as they are commonly used throughout the various protocols of the Specification. They are grouped around the Bluetooth Base UUID (xxxxxxxx-0000-1000-8000-00805F9B34FB) and share 96 common bits. (See core specification, 3.B.2.5.1)
In various protocol parts, standard UUIDs can be transmitted in a short form, skipping the common bits, thus only using 16 or 32 bits on the air. This is an implementation detail, specific to some protocols in the whole stack. Because of this, standard-defined UUIDs are often referred as Short UUIDs.
Apart from standard UUIDs, any implementor is free to generate its own UUIDs from 128 bits of random and use them anywhere needed. Custom UUIDs must not use the Bluetooth Base UUID (and cannot be encoded in a short form, but you should not care about this).
Implications for custom services
Use Standard UUID when implementing standard services and characteristics.
Never use Bluetooth Base-UUID based UUIDs for custom purposes.
When developing your custom service and attributes, reusing existing UUIDs (or part of UUIDs) from code you may find in example code, vendor SDKs, or anywhere else is not a good idea. You should really regenerate yours.
Side notes
Specification does not define any grouping scheme (what you call prefixes) for custom UUIDs, but some vendors do incitate to group custom UUIDs, in a way they generate one 96-bit custom base UUID, and issue incrementing values from this, as Bluetooth did. This is not standard and brings no protocol optimization.
I often times use filepaths to provide some sort of unique id for some software system. Is there any way to take a filepath and turn it into a unique integer in relatively quick (computationally) way?
I am ok with larger integers. This would have to be a pretty nifty algorithm as far as I can tell, but would be very useful in some cases.
Anybody know if such a thing exists?
You could try the inode number:
fs.statSync(filename).ino
#djones's suggestion of the inode number is good if the program is only running on one machine and you don't care about a new file duplicating the id of an old, deleted one. Inode numbers are re-used.
Another simple approach is hashing the path to a big integer space. E.g. using a 128 bit murmurhash (in Java I'd use the Guava Hashing class; there are several js ports), the chance of a collision among a billion paths is still 1/2^96. If you're really paranoid, keep a set of the hash values you've already used and rehash on collision.
This is just my comment turned to an answer.
If you run it in the memory, you can use one of standard hashmaps in your corresponding language. Not just for file names, but for any similar situation. Normally, hashmaps in different programming languages are satisfying collisions by buckets, so the hash number and the corresponding bucket number will provide a unique id.
Btw, it is not hard to write your own hashmap, such that you have control on the underlying structure (e.g. to retrieve the number etc).
I need to do large-scale anonymisation of database log-files.
Part of this will involve obscuring various field names (strings), as well as IP addresses.
1. Field Names
For example, we might have the string BusinessLogic.Categorisation.ExternalDeals. In the anonymised version, we would want it to be something like Jerrycan.Doorway.Fodmap (or something gibberish, but still "pronounceable")
The purpose is simply to obscure the original strings - however, we still want to be able to matchup occurrence of those strings across different logfiles.
The requirements of the hash are:
Repeatable - that is, the same inputs passed in each time would always produce the same outputs. We need to be able to match-up fields between different logfiles (all we're trying to prevent is somebody deriving the original string).
One-way - there is no way of reversing the outputs to product the inputs.
Low chance of collision - it will mess up our analysis if two fields are mapped to the same output.
Human readable (or pronounceable) - somebody scanning through logfiles by hand should be able to make out fields, and visually match them up. Or if need be, read them over the phone.
Short strings - I do understand there's a tradeoff between this and available entropy, however, ideally a string like HumanReadable should map to something like LizzyNasbeth.
I had a look around, and I found https://github.com/zacharyvoase/humanhash (output hash is a bit longer than what I want) and https://www.rfc-editor.org/rfc/rfc1751 (not really "pronouceable" - ideally, we'd want something that looks like a English-language human word, but isn't actually - and, once again, a bit long).
What algorithms or approaches are there to this problem? Or any libraries or implementations you could recommend?
2. IP Addresses
For the IP addresses, we need a way to mask them (i.e. not possible for an outside observer to derive the original IP address), but still have it be repeatable across different logfiles (i.e. the same input always produces the same output).
Ideally, the output would still "look" like an IP address. For example, maybe 192.168.1.55 would map to 33.41.22.44 (or we can use alphabetical codes as well, if that's easier).
Any thoughts on how to do this?
You could use codenamize :
$ codenamize BusinessLogic -j "" -c
AbsorbedUpper
You can use this from command line or as a Python library.
(Disclaimer, I wrote it).
I was discussing with a colleague, and he suggested one approach.
Take the field name - and pass it through a standard one-way hash (e.g. MD5).
Use the resulting digest as a index to map to a dictionary of English words (e.g. using mod).
That solves the issue of it always being repeatable - the same word hashed each time will always map to the same English word (assuming your dictionary list does not change).
If individuals companies were worried about dictionary attacks (i.e. the field name "firstname" would always map to say "Paris"), then we could also use a company-specific keyfile to salt the hash. This means that it would be repeatable for anonymised logfiles from them (i.e. "firstname" might always map to "Toulouse" for them), but it would not be the same as for other companies who use other keyfiles.
I'm still very keen to see what other people can suggest, or whether they might have any thoughts on the above.
If you bind an api call to the object's id, could one simply brute force this api to get all objects? If you think of MySQL, this would be totally possible with incremental integer ids. But what about MongoDB? Are the ids guessable? For example, if you know one id, is it easy to guess other (next, previous) ids?
Thanks!
Update Jan 2019: As mentioned in the comments, the information below is true up until version 3.2. Version 3.4+ changed the spec so that machine ID and process ID were merged into a single random 5 byte value instead. That might make it harder to figure out where a document came from, but it also simplifies the generation and reduces the likelihood of collisions.
Original Answer:
+1 for Sergio's answer, in terms of answering whether they could be guessed or not, they are not hashes, they are predictable, so they can be "brute forced" given enough time. The likelihood depends on how the ObjectIDs were generated and how you go about guessing. To explain, first, read the spec here:
Object ID Spec
Let us then break it down piece by piece:
TimeStamp - completely predictable as long as you have a general idea of when the data was generated
Machine - this is an MD5 hash of one of several options, some of which are more easily determined than others, but highly dependent on the environment
PID - again, not a huge number of values here, and could be sleuthed for data generated from a known source
Increment - if this is a random number rather than an increment (both are allowed), then it is less predictable
To expand a bit on the sources. ObjectIDs can be generated by:
MongoDB itself (but can be migrated, moved, updated)
The driver (on any machine that inserts or updates data)
Your Application (you can manually insert your own ObjectID if you wish)
So, there are things you can do to make them harder to guess individually, but without a lot of forethought and safeguards, for a normal data set, the ranges of valid ObjectIDs should be fairly easy to work out since they are all prefixed with a timestamp (unless you are manipulating this in some way).
Mongo's ObjectId were never meant to be a protection from brute force attack (or any attack, for that matter). They simply offer global uniqueness. You should not assume that some object can't be accessed by a user because this user should not know its id.
For an actual protection of your resources, employ other techniques.
If you defend against an unauthorized access, place some authorization logic in your app (allow access to legitimate users, deny for everyone else).
If you want to hinder dumping all objects, use some kind of rate limiting. Combine with authorization if applicable.
Optional reading: Eric Lippert on GUIDs.
I've been wondering about this for some time. In CouchDB we have some fairly log IDs...eg:
"000ab56cb24aef9b817ac98d55695c6a"
Now if we're searching for this item and going through the tree structure created by the view. It seems a simple integer as an id would be much faster. If we used 64bit integers it would be a simple CMP followed by a JMP (assuming that the Erlang code was using JIT, but you get my point).
For strings, I assume we generate a hash off the ID or something, but at some point we have to do a character compare on all 33 characters...won't that affect performance?
The short answer is, yes, of course it will affect performance, because the key length will directly impact the time it takes to walk down the tree.
It also affects storage, as longer keys take more space, space takes time.
However, the nuance you are missing is that while Couch CAN (and does) allocated new IDs for you, it is not required to. It will be more than happy to accept your own IDs rather than generate it's own. So, if the key length bothers you, you are free to use shorter keys.
However, given the "json" nature of couch, it's pretty much a "text" based database. There's isn't a lot of binary data stored in a normal Couch instance (attachments not withstanding, but even those I think are stored in BASE64, I may be wrong).
So, while, yes an 64-bit would be the most efficient, the simple fact is that Couch is designed to work for any key, and "any key" is most readily expressed in text.
Finally, truth be told, the cost of the key compare is dwarfed by the disk I/O fetch times, and the JSON marshaling of data (especially on writes). Any real gain achieved by converting to such a system would likely have no "real world" impact on overall performance.
If you want to really speed up the Couch key system, code the key routine to block the key in to 64Bit longs, and comapre those (like you said). 8 bytes of text is the same as a 64 bit "long int". That would give you, in theory, an 8x performance boost on key compares. Whether erlang can create such code, I can't say.
From the CouchDB: The definitive guide book:
I need to draw a picture of this at
some point, but the reason is if you
think of the idealized btree, when you
use UUID’s you might be hitting any
number of root nodes in that tree, so
with the append only nature you have
to write each of those nodes and
everything above it in the tree. but
if you use monotonically increasing
id’s then you’re invalidating the same
path down the right hand side of the
tree thus minimizing the number of
nodes that need to be rewritten. would
be just the same for monotonically
decreasing as well. and it should
technically work if you’re updates can
be guaranteed to hit one or two nodes
in the inside of the tree, though
that’s much harder to prove.
So sequential IDs offer a performance benefit, however, you must remember this isn't maintainable when you have more than one database, as the IDs will collide.