MapDb: is there any limitation for the number of maps - mapdb

I'm using mapdb 1.0.7 and my question is:
is there any limitation for the number of maps I can create with a single Db instance? Or is it theoretically unlimited.
I've tested to create 1.000.000 maps with 25 entries in each of it. This works fine. But is there any limitation?
Kind regards
Thorsten

Just got the answer in the mapdb google group:
there is no limit. But each opened map consumes under 1KB on heap. So I would recommend to keep it under thousands or millions.

Related

How to speed up a search on large collection of text files (1TB)

I have a collection of text files containing anonymised medical data (age, country, symptoms, diagnosis etc). This data goes back for at least 30 years so as you can imagine I have quite a large sized data set. In total I have around 20,000 text files totalling approx. 1TB.
Periodically I will be needing to search these files for occurances of a particular string (not regex). What is the quickest way to search through this data?
I have tried using grep and recursively searching through the directory as follows:
LC_ALL=C fgrep -r -i "searchTerm" /Folder/Containing/Files
The only problem with doing the above is that it takes hours (sometimes half a day!) to search through this data.
Is there a quicker way to search through this data? At this moment I am open to different approaches such as databases, elasticsearch etc. If I do go down the database route, I will have approx. 1 billion records.
My only requirements are:
1) The search will be happening on my local computer (Dual-Core CPU and 8GB RAM)
2) I will be searching for strings (not regex).
3) I will need to see all occurances of the search string and the file it was within.
There are a lot of answers already, I just wanted to add my two cents:
Having this much huge data(1 TB) with just 8 GB of memory will not be good enough for any approach, be it using the Lucene or Elasticsearch(internally uses Lucene) or some grep command if you want faster search, the reason being very simple all these systems hold the data in fastest memory to be able to serve faster and out of 8 GB(25% you should reserve for OS and another 25-50% at least for other application), you are left with very few GB of RAM.
Upgrading the SSD, increasing RAM on your system will help but it's quite cumbersome and again if you hit performance issues it will be difficult to do vertical scaling of your system.
Suggestion
I know you already mentioned that you want to do this on your system but as I said it wouldn't give any real benefit and you might end up wasting so much time(infra and code-wise(so many approaches as mentioned in various answers)), hence would suggest you do the top-down approach as mentioned in my another answer for determining the right capacity. It would help you to identify the correct capacity quickly of whatever approach you choose.
About the implementation wise, I would suggest doing it with Elasticsearch(ES), as it's very easy to set up and scale, you can even use the AWS Elasticsearch which is available in free-tier as well and later on quickly scale, although I am not a big fan of AWS ES, its saves a lot of time of setting up and you can quickly get started if you are much familiar of ES.
In order to make search faster, you can split the file into multiple fields(title,body,tags,author etc) and index only the important field, which would reduce the inverted index size and if you are looking only for exact string match(no partial or full-text search), then you can simply use the keyword field which is even faster to index and search.
I can go on about why Elasticsearch is good and how to optimize it, but that's not the crux and Bottomline is that any search will need a significant amount of memory, CPU, and disk and any one of becoming bottleneck would hamper your local system search and other application, hence advising you to really consider doing this on external system and Elasticsearch really stands out as its mean for distributed system and most popular open-source search system today.
You clearly need an index, as almost every answer has suggested. You could totally improve your hardware but since you have said that it is fixed, I won’t elaborate on that.
I have a few relevant pointers for you:
Index only the fields in which you want to find the search term rather than indexing the entire dataset;
Create multilevel index (i.e. index over index) so that your index searches are quicker. This will be especially relevant if your index grows to more than 8 GB;
I wanted to recommend caching of your searches as an alternative, but this will cause a new search to again take half a day. So preprocessing your data to build an index is clearly better than processing the data as the query comes.
Minor Update:
A lot of answers here are suggesting you to put the data in Cloud. I'd highly recommend, even for anonymized medical data, that you confirm with the source (unless you scraped the data from the web) that it is ok to do.
To speed up your searches you need an inverted index. To be able to add new documents without the need to re-index all existing files the index should be incremental.
One of the first open source projects that introduced incremental indexing is Apache Lucense. It is still the most widely used indexing and search engine although other tools that extend its functionality are more popular nowadays. Elasiticsearch and Solr are both based on Lucense. But as long as you don't need a web frontend, support for analytical querying, filtering, grouping, support for indexing non-text files or an infrastrucutre for a cluster setup over multiple hosts, Lucene is still the best choice.
Apache Lucense is a Java library, but it ships with a fully-functional, commandline-based demo application. This basic demo should already provide all the functionality that you need.
With some Java knowledge it would also be easy to adapt the application to your needs. You will be suprised how simple the source code of the demo application is. If Java shouldn't be the language of your choice, its wrapper for Pyhton, PyLucene may also be an alternative. The indexing of the demo application is already reduced nearly to the minimum. By default no advanced functionlity is used like stemming or optimization for complex queries - features, you most likely will not need for your use-case but which would increase size of the index and indexing time.
I see 3 options for you.
You should really consider upgrading your hardware, hdd -> ssd upgrade can multiply the speed of search by times.
Increase the speed of your search on the spot.
You can refer to this question for various recommendations. The main idea of this method is optimize CPU load, but you will be limited by your HDD speed. The maximum speed multiplier is the number of your cores.
You can index your dataset.
Because you're working with texts, you would need some full text search databases. Elasticsearch and Postgres are good options.
This method requires you more disk space (but usually less than x2 space, depending on the data structure and the list of fields you want to index).
This method will be infinitely faster (seconds).
If you decide to use this method, select the analyzer configuration carefully to match what considered to be a single word for your task (here is an example for Elasticsearch)
Worth covering the topic from at two level: approach, and specific software to use.
Approach:
Based on the way you describe the data, it looks that pre-indexing will provide significant help. Pre-indexing will perform one time scan of the data, and will build a a compact index that make it possible to perform quick searches and identify where specific terms showed in the repository.
Depending on the queries, it the index will reduce or completely eliminate having to search through the actual document, even for complex queries like 'find all documents where AAA and BBB appears together).
Specific Tool
The hardware that you describe is relatively basic. Running complex searches will benefit from large memory/multi-core hardware. There are excellent solutions out there - elastic search, solr and similar tools can do magic, given strong hardware to support them.
I believe you want to look into two options, depending on your skills, and the data (it will help sample of the data can be shared) by OP.
* Build you own index, using light-weight database (sqlite, postgresql), OR
* Use light-weight search engine.
For the second approach, using describe hardware, I would recommended looking into 'glimpse' (and the supporting agrep utility). Glimple provide a way to pre-index the data, which make searches extremely fast. I've used it on big data repository (few GB, but never TB).
See: https://github.com/gvelez17/glimpse
Clearly, not as modern and feature rich as Elastic Search, but much easier to setup. It is server-less. The main benefit for the use case described by OP is the ability to scan existing files, without having to load the documents into extra search engine repository.
Can you think about ingesting all this data to elasticsearch if they have a consistent data structure format ?
If yes, below are the quick steps:
1. Install filebeat on your local computer
2. Install elasticsearch and kibana as well.
3. Export the data by making filebeat send all the data to elasticsearch.
4. Start searching it easily from Kibana.
Fs Crawler might help you in indexing the data into elasticsearch.After that normal elasticsearch queries can you be search engine.
I think if you cache the most recent searched medical data it might help performance wise instead of going through the whole 1TB you can use redis/memcached

Spark broadcast variables: large maps

I am broadcasting a large Map (~6-10 GB). I am using sc.broadcast(prod_rdd) to do that. However, I am not sure whether broadcasting is meant only for small data/files and not for larger objects that I have. If former, what is the recommended practice? One option is to use a NoSQL database and then do the lookup using that. One issue with that is I might have to give up performance since I will be going through a single node (Region server or whatever equivalent of that is). If anyone has any insight into performance impact of these design choices, that will be greatly appreciated.
I'm wondering if you could perhaps use mapPartitions and read the map once per partition rather than broadcasting it?

How to access entries stored in MapDB BTreeMap in reverse order

I want to know if there is any way by which I can access entries stored in MapDB BTreeMap in reverse order. I know I can use descendingMap() but it is very slow and it involves a lot of CPU operations. Is there any other faster way? The Key Value pairs are non primitive java types.
Got following reply from Jan Kotek, creator of MapDB,
There is bug open for
BTreeMap.descendingMap() iteration
performance. It will be fixed in MapDB 2.

How to keep an object unique

I have a static DataTable (with 80k records) in Common.DLL and that Common.DLL is referred by 10 windows services. So, instead of having 10 copies of that DataTable in the memory I need to have it as 1 copy and all the services pointing to that data source. Is this approach possible?
Given that the services will at least be using different AppDomains, and quite possibly different processes, sharing the same data between all of them would be tricky.
I would personally suggest that you just don't worry about it - unless each record is actually pretty large, 80K records is still going to be fairly small.
You could potentially have an 11th service which is the only one to have the data, and then talk to that service from the other ones. But that's introducing a lot of complexity for very little benefit.
One way of potentially saving memory would be to use a List<T> for a custom type, instead of a DataTable - that may well be more efficient, and would almost certainly be more pleasant to use within the code. It doesn't help if you really need DataTable for whatever you're doing with it, but personally I try to avoid that...
You could create a WCF service that is hosted locally and reads the 80k records into memory.
You would then define an API on the WCF service that contains methods appropriate to whatever calls your 10 windows services need to make.
Doing this would add a level of complexity to your solution that may well not be needed though.
Use the Singleton pattern. Here's a detailed tutorial with C#.

Is there any way to skip rows when I retrieve from Azure table storage?

I believe in the past the answer to this question was no. However has anything changed with the recent releases or does anyone know of a way that I can do this. I am using datatables and would love to be able to do something like skip 50 retrieve 50 rows. skip 100 retrieve 50 rows etc.
It is still not possible to skip rows. The only navigation construct supported is top. The Table Service REST API is the definitive way to access Wndows Azure Storage, so its documentation is the go-to location for what is or is not possible.
What you're asking here is possible using continuation tokens. Scott Densmore blogged about this a while ago to explain how you can use continuation tokens for paging when you're displaying a table (like what you're asking here with DataTables): Paging with Windows Azure Table Storage. The blog post shows how to show pages of 3 items while using continuation tokens to move forward and back between pages:
Besides that there's also Steve's post that describes the same concept: Paging Over Data in Windows Azure Tables
Yes (kinda) and no. No, in the sense that the Skip operation is not directly supported at the REST head. You could of course do it in memory, but that would defeat the purpose.
However, you can of course actually do this pattern if you structure your data correctly. We do something like this ourselves. We align our partition key to the datetime and use the RowKey as a discriminator. This means we can always pinpoint the partition range we are interested in and then Take() some amount of data. So, for example, we can easily Take() the first 20 rows per hour by specifying a unique query (skipping over data we don't want). The partion key is simply aligned per hour and then we optionally discriminate further using the RowKey - finally, we just take data. When executed in parallel, this works just dandy.
Again, the more technically correct answer is NO. However, you can approximate it cleverly using the PK and RK.

Resources