I have a Map that uses MapStore. In this way, some objects are not loaded into Memory. How I can search a required object if it isn't in memory?
Is 'read-through' feature working for queries?
You can query Hazelcast for data held in Hazelcast or for data external to Hazelcast using the same SQL,
SELECT * FROM etc..
For the latter, see documentation link.
Unfortunately, there is not currently an implementation for Mongo. So for now you are blocked, sorry.
Read-through (or query-through) would also require the remote store have the same format as the IMap, which is not otherwise required for MapStore.
If you can't host all your Mongo data in Hazelcast (which eliminates the need to query Mongo), then you could perhaps consider some sort of fly-weight design pattern and perhaps hold a projection.
Related
Can someone explain and provide the document that explains the behavior of
select * from <keyspace.table>
Let's assume I have 5 node cluster, how does Cassandra DataStax Driver behave when such queries are being issued?
(Fetchsize was set to 500)
Is this a proper way to pull data ? Does it cause any performance issues?
No, that's really a very bad way to pull data. Cassandra shines when it fetches the data by at least partition key (that identifies a server that holds the actual data). When you are doing the select * from table, request is sent to coordinating node, that will need to pull all data from all servers and send via that coordinating node, overloading it, and most probably lead to the timeout if you have enough data in the cluster.
If you really need to perform full fetch of the data from cluster, it's better to use something like Spark Cassandra Connector that read data by token ranges, fetching the data directly from nodes that are holding the data, and doing this in parallel. You can of course implement the token range scan in Java driver, something like this, but it will require more work on your side, comparing to use of Spark.
Can I inject a sharding algorithm to wither Cassandra or Couchbase?
Or do they decide where each document go to?
For instance if I want to pin data to shards by one of the data properties.
Couchbase hash the key of the document to decide in which shard(vBucket) the document should be associated with. The SDK also uses the same algorithm to find out in which shard the document is located when you want to retrieve the document by its key.
One of the problems of letting developers decide on the sharding algorithm is that sometimes they end up with an excessive number of documents in a single shard, and naturally, this shard becomes the bottleneck of the application.
One of the core concepts in Couchbase is that the documents are (almost) evenly distributed between all shards, so I am not familiar with any native support to insert your own algorithm there.
Cassandra decides where the data goes by the partition key. So if you use the data you want to use as the "pin" as the partition key then it will accomplish what your asking for I think. However, you don't pick the replicas explicitly and it can change as hosts are removed and added to the cluster.
Does NeDB use the disk to do find queries?
Or are find queries 100% using RAM based data structures.
I need to do intensive work using find queries and I don't want to put work on my hard disk.
(As a bonus track, do SQLite find queries also work 100% in memory?)
Depends, if you create indexes on fields, those fields will be held in memory for faster lookup and access. Unindexed field lookups will be hitting disk. This is true for SQLite, as well as most other persistent databases (e.g PostgreSQL, MySQL, MongoDB, and many more)
NeDB is an in-memory database, which means that all data will be held in memory (similar to Redis). That being said, you still have to index fields for faster lookup.
If you want to create indexes on fields in NeDB, they have documentation for that here.
For NeDB, the _id field is automatically indexed, so you don't have to create an index for that field and querying for _id will be substantially faster than querying for another unindexed field.
When we want to implement the MapStore interface we only have the method loadAll for initializing the map. Therefore you have to provide a set of keys to load into the map. How do you handle the situation when you have date / time as primary key. Intuitively one would define a key range where tst between a and b. But since we only can provide a Set we have to pre-fetch all the possible date-time values (via SQL or whatever). And the next time the IMap will start to hammer the database fetching every key one by one. Is this the best approach? Isn't there a more convenient way to do this?
My advice would be to stop thinking about the maps as if they were tables in a relational database. Try to think in terms that conform to the semantics of a Map (if you are using a Map as there are other distributed collections in Hazelcast). For example, you have to keep in mind that you can only make queries about objects that are available in memory as the semantics of query applies only to the case in which Hazelcast is used as a data grid and not as a cache. If semantics is the use of a cache, you should limit your access by key, as you would proceed with a traditional map in Java.
For example, when it comes to a data grid, you must think that access to the database will occur typically only to respond to disaster recovery scenarios. Therefore, the initial data loading from disk to memory may strongly hit the database, but that would only occur in cases of recovery, so that it is not such a major handicap. In the case of use of caching, yes it would be important to be very efficient when planning your persistence strategy since access to the database will be more frequent.
If you provide further information about what it's your particular use case, especially regarding eviction policies, maybe I might help you more.
Hope this helps.
I am looking for Cassandra/CQL's cousin of the common SQL idiom of INSERT INTO ... SELECT ... FROM ... and have been unable to find anything to do such an operation programmatically or in CQL. Is it just not supported?
My use case is to do a reasonably bulky copy from one table to another. I don't need any particular concurrent guarantees, but it's a lot of data so I'd like to avoid the additional network overhead of writing a client that retrieves data from one table, then issues batches of inserts into the other table. I understand that the changes will still need to be transported between nodes of the Cassandra cluster according to the replication set-up, but it seems reasonable for there to be an "internal" option to do a bulk operation from one table to another. Is there such a thing in CQL or elsewhere? I'm currently using Hector to talk to Cassandra.
Edit: it looks like sstableloader might be relevant, but is awfully low-level for something that I'd expect to be a fairly common use case. Taking just a subset of rows from one table to another also seems less than trivial in that framework.
Correct, this is not supported natively. (Another alternative would be a map/reduce job.) Cassandra's API focuses on short requests for applications at scale, not batch or analytical queries.