Sorting an LMDB file for sequential access according to key order - lmdb

I have LMDB files (about 20GB usually but may be larger) with several thousand Key-Value pairs each. The keys have not been inserted in their lexicographic order and I would like to know if there is a simple command to reorder an LMDB file according to the lexicographic order of the keys so that it translates to a sequential read access if data is read in that very order.
Thanks a lot!

Lmdb internally stores the keys in lexicographical order irrespective of the order in which they are inserted.
If you don't want keys to be sorted lexicographically, you can specify the comparison function to sort keys in lmdb using function mdb_set_compare().
The documentation of key sorting and mdb_set_compare() function is mentioned in the below link.
mdb_set_compare() function documentation

Related

Speeding up python load times for read only dictionary?

I currently am working on a bioinformatics project that currently involves a dictionary corresponding to about 10million unique keys, which each return a subset of categorical strings.
I currently use unpickle a dictionary object, but my main issue is that unpickling takes a very long time. I also need to iterate through a file, generating a set of keys(~200) for each row, lookup the keys, appending the list to a list-of-lists, and then subsequently flattening the list to generate a counter object of value frequencies for each row, and I have heard that a SQL database like structure would end up trading load times for lookup times.
The file that has keys typically contain about 100k rows and so this was my best solution, however it seems like even on faster pcs with increased ram, num of cores, and NVME storage that the time spent on loading the database is extremely slow.
I was wondering what direction (different database structure, alternatives to pickle such as shelves or mashall, parallelizing the code with multiprocess) would provide an overall speed up (either through faster loading times, faster lookup, or both) to my code?
Specifically: Need a create databases of the format key -> (DNA sub-sequence) : value ->[A,B,C,Y,Z] on the order of 1e6/1e7 entries.
When used, this database is loaded, and then given a query file (1e6 DNA sequences to query), perform a lookup of all the sub sequences in each sequence do the following.
For each query:
slice the sequence into subsequences.
Lookup each subsequence and return the list of categoricals for each subsequence
Aggregate lists using collections.Counter
I was wondering how to either:
Speed up the loading time of the database, either through a better data structure, or some optimization
Generally improve the speed of the run itself (querying subsequences)
I'm not sure there is a right answer here since there are some tradeoff, BUT.
two options come to mind:
1st. consider using panads.DataFrame for the data-stucture.
It will allow serialization/deserialization to many formats (I believe CSV should be the fastest but would give SQL a try). as for query time, it should be much faster than a dict for the complex queries.
2nd.
key value store, such as MongoDB that has map-reduce and other fancy query capilites, in this case the data is always available without loading times.

How to have objects in Redis sorted set?

There sorted sets available in Redis, How can I have objects in sorted sets. What I need is sorted object set.
I need to store json structures and edit individual properties inside. What I found is Redis hashes, however I cant do the in-order searches as in the sorted sets there
Redis does not support this feature by default. I had the same issue sometime back and We came up with a hybrid, simple data structure that a allows sorted object using Redis Hashes and Redis Sorted-sets.
What we did was, we stored the objects in the redis hashes and we kept a list of all keys in the Redis hashes as a sorted set. This allow us to get all the maps that came after some key. or between two keys. Other than that this allows us to search under topics.
Implementation details: http://www.malinga.me/redis-sorted-object-set-sorted-hashes/

Composite partition key (Cassandra) vs. interleaved indexes (Accumulo, BigTable) for time-spatial series

I'm working on a project in which we import 50k - 100k datapoints every day, located both temporally (YYYYMMDDHHmm) and spatially (lon, lat), which we then dynamically render onto maps according to the query parameters set by our users. We do use pre-computed clusters below a given zoom level.
Within this context and given the fact that we're in the process of selecting a database engine for our storage layer, I'm currently evaluating Cassandra and BigTable's variants.
Specifically, I'm trying to understand the difference between using composite partition keys in Cassandra vs. interleaved index keys in BigTable, such as the one GeoMesa uses.
As far as I understand, both these approaches can leverage COTS hardware and can be tuned to reduce hotspotting and maximize space-filling.
What are the logical steps I should follow in order to discriminate between the two? Even though I am planning on testing both approaches in the near future, I'd like to hear a more reasoned and educated approach.
GeoMesa actually supports both BigTable clones like Accumulo and Cassandra. The Cassandra support, at the time of writing, is currently in an early phase. The README has a description of the indexing scheme.
Both implementations utilize Z2 or Z3 (depending on whether the index is just spatial or spatio-temporal) interleaved indexes. The BigTable clone indexing puts the full resolution Z3 into the primary key. Queries are just range scans on the sorted keys. Cassandra requires that partition keys be explicitly enumerated (unless you're doing full table scans). Because of that face, GeoMesa's Cassandra indexing uses composite keys to spread the information across both the partition key and the range key. The partition key is a coarse spatio-temporal key that buckets the world into NxN cells. Then, the range key is the full resolution Z3 interleaved index. Queries are decomposed into an enumeration of the overlapping buckets (partition key) and Z3 ranges within each bucket (range key). Having to enumerate the partition keys can cause a lot of network chattiness in order to satisfy a query. Setting up the bucket resolution is key to reducing this chattiness.

Using intensive update in Map type column in Cassandra is anti-pattern?

Friends,
I am modeling a table in Cassandra which contains a Map column. So this Map should contains dynamic values and will be update so much for that row (I will update by a Primary Key)
Is it an anti-patterns, which other options should I consider ?
What you're trying to do is possibly what I described here.
First big limitations that comes into my mind are the one given by the specification:
64KB is the max size of an item in a collection
65536 is the max number of queryable elements inside a collection
More there are the problems described in other post
you can not retrieve part of a collection: even if internally each entry of a map is stored as a column you can only retrieve the whole collection (this can lead to very slow performances)
you have to choose whether creating an index on keys or on values, both simultaneously are not supported.
Since maps are typed you can't put mixed values inside: you have to represent everything as a string or bytes and then transform your data client side
I personally consider this approach as an anti pattern for all these reasons -- this approach provide a schema less solution but reduce performances and introduce lots of limitations like the one secondary indexes and typing.
HTH, Carlo

How to access entries stored in MapDB BTreeMap in reverse order

I want to know if there is any way by which I can access entries stored in MapDB BTreeMap in reverse order. I know I can use descendingMap() but it is very slow and it involves a lot of CPU operations. Is there any other faster way? The Key Value pairs are non primitive java types.
Got following reply from Jan Kotek, creator of MapDB,
There is bug open for
BTreeMap.descendingMap() iteration
performance. It will be fixed in MapDB 2.

Resources