How to access entries stored in MapDB BTreeMap in reverse order - mapdb

I want to know if there is any way by which I can access entries stored in MapDB BTreeMap in reverse order. I know I can use descendingMap() but it is very slow and it involves a lot of CPU operations. Is there any other faster way? The Key Value pairs are non primitive java types.

Got following reply from Jan Kotek, creator of MapDB,
There is bug open for
BTreeMap.descendingMap() iteration
performance. It will be fixed in MapDB 2.

Related

Turning unique filepath into unique integer

I often times use filepaths to provide some sort of unique id for some software system. Is there any way to take a filepath and turn it into a unique integer in relatively quick (computationally) way?
I am ok with larger integers. This would have to be a pretty nifty algorithm as far as I can tell, but would be very useful in some cases.
Anybody know if such a thing exists?
You could try the inode number:
fs.statSync(filename).ino
#djones's suggestion of the inode number is good if the program is only running on one machine and you don't care about a new file duplicating the id of an old, deleted one. Inode numbers are re-used.
Another simple approach is hashing the path to a big integer space. E.g. using a 128 bit murmurhash (in Java I'd use the Guava Hashing class; there are several js ports), the chance of a collision among a billion paths is still 1/2^96. If you're really paranoid, keep a set of the hash values you've already used and rehash on collision.
This is just my comment turned to an answer.
If you run it in the memory, you can use one of standard hashmaps in your corresponding language. Not just for file names, but for any similar situation. Normally, hashmaps in different programming languages are satisfying collisions by buckets, so the hash number and the corresponding bucket number will provide a unique id.
Btw, it is not hard to write your own hashmap, such that you have control on the underlying structure (e.g. to retrieve the number etc).

Keep list of hashs

I am working on a small project to keep my skills from completely rusting
I am generating a lot of hashes(in his case md5) and I need to check if I've seen that hash before so I wanted to keep it in a list
whats the best way to list them that I can look if they exist in pior to doing calculations
The hash itself is already a key of sorts. Your best bet is a hash table. In a properly implemented hash table, you can check for the existence of a key in constant time. Common hash table implementations with this feature are C# Dictionaries, Python's dict type, PHP array (which are actually Maps, not arrays), Perl's hashes % and Ruby's Hash. If you included details of what language you're working in, an example wouldn't be too hard to lookup.

Are MongoDB ids guessable?

If you bind an api call to the object's id, could one simply brute force this api to get all objects? If you think of MySQL, this would be totally possible with incremental integer ids. But what about MongoDB? Are the ids guessable? For example, if you know one id, is it easy to guess other (next, previous) ids?
Thanks!
Update Jan 2019: As mentioned in the comments, the information below is true up until version 3.2. Version 3.4+ changed the spec so that machine ID and process ID were merged into a single random 5 byte value instead. That might make it harder to figure out where a document came from, but it also simplifies the generation and reduces the likelihood of collisions.
Original Answer:
+1 for Sergio's answer, in terms of answering whether they could be guessed or not, they are not hashes, they are predictable, so they can be "brute forced" given enough time. The likelihood depends on how the ObjectIDs were generated and how you go about guessing. To explain, first, read the spec here:
Object ID Spec
Let us then break it down piece by piece:
TimeStamp - completely predictable as long as you have a general idea of when the data was generated
Machine - this is an MD5 hash of one of several options, some of which are more easily determined than others, but highly dependent on the environment
PID - again, not a huge number of values here, and could be sleuthed for data generated from a known source
Increment - if this is a random number rather than an increment (both are allowed), then it is less predictable
To expand a bit on the sources. ObjectIDs can be generated by:
MongoDB itself (but can be migrated, moved, updated)
The driver (on any machine that inserts or updates data)
Your Application (you can manually insert your own ObjectID if you wish)
So, there are things you can do to make them harder to guess individually, but without a lot of forethought and safeguards, for a normal data set, the ranges of valid ObjectIDs should be fairly easy to work out since they are all prefixed with a timestamp (unless you are manipulating this in some way).
Mongo's ObjectId were never meant to be a protection from brute force attack (or any attack, for that matter). They simply offer global uniqueness. You should not assume that some object can't be accessed by a user because this user should not know its id.
For an actual protection of your resources, employ other techniques.
If you defend against an unauthorized access, place some authorization logic in your app (allow access to legitimate users, deny for everyone else).
If you want to hinder dumping all objects, use some kind of rate limiting. Combine with authorization if applicable.
Optional reading: Eric Lippert on GUIDs.

Fastest Search and Insertion

I ve got a million objects . Which is the fastest way to lookup a particular object with name as key also the fastest way to perfrom insertion ? would hashing be sufficient?
Probably a hash table, assuming you don't need anything other than key based access. Make sure that the hashing of the key is good enough (as to minimise collisions) and the table is large enough (for the same reason).
It will depend on how often your need to do a lookup and how often you need to insert elements.
If you often have to insert elements then a linked list would perform better.
If you often have to search for elements, an hash table is more efficient. Perhaps, you can have both - your main data as a linked list, and an hash table which will serve as an index to the list.
You can also use a binary search tree. BST has the advantage of fast search and fast insertion too. Use the key to route your way in the tree and build the tree node to have the value.
Use BST in favor of hash tables if you are not sure about the balance of the operation (ie: looking up a key and value pairs, insertion, etc) and if you (based on your analysis) know that keys may collide frequently in the hash table (which will cause bad performance for the hash table).
Several Structures exist here that you can use. Each has it's advantage and disadvantage.
A HashTable will have a great lookup time and insertion time, provided you have a table that minimizes collision. If not, then lookup/insertion can lead to a lot more time.
A Binary Search Tree has ln(n) insertion and lookup, provided that it's balanced. Sometimes the balancing can cause the insertion to take a bit longer then ln(n), depending on the BST you go with.
Can go with B+ tree, it guarantees lesser search complexity ( since you reach leaf nodes fast, height = log n to base k, k = degree of nodes). The databases have similar requirement and they use B+ trees to maintain and retrieve data.

Performance of Long IDs

I've been wondering about this for some time. In CouchDB we have some fairly log IDs...eg:
"000ab56cb24aef9b817ac98d55695c6a"
Now if we're searching for this item and going through the tree structure created by the view. It seems a simple integer as an id would be much faster. If we used 64bit integers it would be a simple CMP followed by a JMP (assuming that the Erlang code was using JIT, but you get my point).
For strings, I assume we generate a hash off the ID or something, but at some point we have to do a character compare on all 33 characters...won't that affect performance?
The short answer is, yes, of course it will affect performance, because the key length will directly impact the time it takes to walk down the tree.
It also affects storage, as longer keys take more space, space takes time.
However, the nuance you are missing is that while Couch CAN (and does) allocated new IDs for you, it is not required to. It will be more than happy to accept your own IDs rather than generate it's own. So, if the key length bothers you, you are free to use shorter keys.
However, given the "json" nature of couch, it's pretty much a "text" based database. There's isn't a lot of binary data stored in a normal Couch instance (attachments not withstanding, but even those I think are stored in BASE64, I may be wrong).
So, while, yes an 64-bit would be the most efficient, the simple fact is that Couch is designed to work for any key, and "any key" is most readily expressed in text.
Finally, truth be told, the cost of the key compare is dwarfed by the disk I/O fetch times, and the JSON marshaling of data (especially on writes). Any real gain achieved by converting to such a system would likely have no "real world" impact on overall performance.
If you want to really speed up the Couch key system, code the key routine to block the key in to 64Bit longs, and comapre those (like you said). 8 bytes of text is the same as a 64 bit "long int". That would give you, in theory, an 8x performance boost on key compares. Whether erlang can create such code, I can't say.
From the CouchDB: The definitive guide book:
I need to draw a picture of this at
some point, but the reason is if you
think of the idealized btree, when you
use UUID’s you might be hitting any
number of root nodes in that tree, so
with the append only nature you have
to write each of those nodes and
everything above it in the tree. but
if you use monotonically increasing
id’s then you’re invalidating the same
path down the right hand side of the
tree thus minimizing the number of
nodes that need to be rewritten. would
be just the same for monotonically
decreasing as well. and it should
technically work if you’re updates can
be guaranteed to hit one or two nodes
in the inside of the tree, though
that’s much harder to prove.
So sequential IDs offer a performance benefit, however, you must remember this isn't maintainable when you have more than one database, as the IDs will collide.

Resources