Defining a Hazelcast MapStore for Key-Ranges - hazelcast

When we want to implement the MapStore interface we only have the method loadAll for initializing the map. Therefore you have to provide a set of keys to load into the map. How do you handle the situation when you have date / time as primary key. Intuitively one would define a key range where tst between a and b. But since we only can provide a Set we have to pre-fetch all the possible date-time values (via SQL or whatever). And the next time the IMap will start to hammer the database fetching every key one by one. Is this the best approach? Isn't there a more convenient way to do this?

My advice would be to stop thinking about the maps as if they were tables in a relational database. Try to think in terms that conform to the semantics of a Map (if you are using a Map as there are other distributed collections in Hazelcast). For example, you have to keep in mind that you can only make queries about objects that are available in memory as the semantics of query applies only to the case in which Hazelcast is used as a data grid and not as a cache. If semantics is the use of a cache, you should limit your access by key, as you would proceed with a traditional map in Java.
For example, when it comes to a data grid, you must think that access to the database will occur typically only to respond to disaster recovery scenarios. Therefore, the initial data loading from disk to memory may strongly hit the database, but that would only occur in cases of recovery, so that it is not such a major handicap. In the case of use of caching, yes it would be important to be very efficient when planning your persistence strategy since access to the database will be more frequent.
If you provide further information about what it's your particular use case, especially regarding eviction policies, maybe I might help you more.
Hope this helps.

Related

Queries in Hazerlcast

I have a Map that uses MapStore. In this way, some objects are not loaded into Memory. How I can search a required object if it isn't in memory?
Is 'read-through' feature working for queries?
You can query Hazelcast for data held in Hazelcast or for data external to Hazelcast using the same SQL,
SELECT * FROM etc..
For the latter, see documentation link.
Unfortunately, there is not currently an implementation for Mongo. So for now you are blocked, sorry.
Read-through (or query-through) would also require the remote store have the same format as the IMap, which is not otherwise required for MapStore.
If you can't host all your Mongo data in Hazelcast (which eliminates the need to query Mongo), then you could perhaps consider some sort of fly-weight design pattern and perhaps hold a projection.

What are the advantages of using "new style" Cassandra paging state over "old style" token functions?

I understand that there are two ways of iterating over a large result set in Cassandra:
Querying explicitly with tokens, as discussed in this article on "Displaying rows from an unordered partitioner with the TOKEN function". This appears to have been the only way of doing things prior to Cassandra 2.0.
Using "paging state".
Paging state appears to be the suggested way of doing things these days, but doing it the old token way still works.
Aside from it being the blessed way of doing things, which is of course a type of advantage, I'd love to understand what are the particular advantages of using the "new" method over the "old"? Is there a reason I should not use token in this way?
The use of paging or tokens is really depending on your requirements, and technical abilities. From my point of view, use of paging is good for fetching of data from big partition, or when you have not so much data in table, so you can use select * from table.
But if you have multiple servers in cluster, and big amounts of data, use of token will allow you to read data from specific servers (if you set routing key correctly), and in parallel (Spark Cassandra Connector uses token exactly for this reason) - this is big advantage over use of paging where you're using one coordinator node that needs to go to other nodes for data that it doesn't have. But for some people, it's not really easy to implement, because you need to cover edge cases, like, when token range doesn't start exactly at minimum value. I have example in Java how to do it if you need.
I agree with Alex this answer, I will add that when you do that the old school(with tokens), you have the hand on your tokens, this mean that if you are dealing with a big amount of data, that you can save your check points for example, that you can handle well restart after failure, or just pausing your job for example, or also to launch multi-threading jobs and separate teach worker data, the way spark workers deal with data for example are token based too.
The driver handle the paging automatically for you so you don't have to fetch pages with all the advantage of a native thing handled for you, but the use of token give you full hands on the way you are paginating with all the advantage that you can get from it(attacking a specific range, specific server)
I hope this helps !

Inject Custom Sharding in Cassandra or Couchbase

Can I inject a sharding algorithm to wither Cassandra or Couchbase?
Or do they decide where each document go to?
For instance if I want to pin data to shards by one of the data properties.
Couchbase hash the key of the document to decide in which shard(vBucket) the document should be associated with. The SDK also uses the same algorithm to find out in which shard the document is located when you want to retrieve the document by its key.
One of the problems of letting developers decide on the sharding algorithm is that sometimes they end up with an excessive number of documents in a single shard, and naturally, this shard becomes the bottleneck of the application.
One of the core concepts in Couchbase is that the documents are (almost) evenly distributed between all shards, so I am not familiar with any native support to insert your own algorithm there.
Cassandra decides where the data goes by the partition key. So if you use the data you want to use as the "pin" as the partition key then it will accomplish what your asking for I think. However, you don't pick the replicas explicitly and it can change as hosts are removed and added to the cluster.

Pros and Cons of Cassandra User Defined Functions

I am using Apache Cassandra to store mostly time series data. And I am grouping the data and aggregating/counting it based on some conditions. At the moment I am doing this in a Java 8 application, but with the release of Cassandra 3.0 and the User Defined Functions, I have been asking myself if extracting the grouping and aggregation/counting logic to Cassandra is a good idea. To my understanding this functionallity is something like the stored procedures in SQL.
My concern is if this will impact the computation performance and the overall performance of the database. I am also not sure if there are other issues with it and if this new feature is something like the secondary indexes in Cassandra - you can do them, but it is not recommended at all.
Have you used user defined functions in Cassandra? Do you have any observations on the performance? What are the good and bad sides of this new functionality? Is it applicable in my use case?
You can compare it to using count() or avg() kind of aggregations. They can save you a lot of network traffic and object creation/GC by having the coordinator only send the result, but its easy to get carried away and make the coordinator do a lot of work. This extra work takes away from normal C* duties, and can just as likely increase GCs as reduce them.
If your aggregating 100 rows in a partition its probably fine and if your aggregating 10000 its probably not end of the world if its very rare. If your calling it once a second though its a problem. If your aggregating over 1000 I would be very careful.
If you absolutely need to do it and its a lot of data often, you may want to create dedicated proxy coordinators (-Djoin_ring=false) to bear the brunt of the load without impacting normal C* read/writes. At that point its just as easy to create dedicated workload DC for it or something (with RF=0 for your keyspace, and set application to be part of that DC with DCAwareRoundRobinPolicy). This also is the point where using Spark is probably the right thing to do.

Do I use Azure Table Storage or SQL Azure for our CQRS Read System?

We are about to implement the Read portion of our CQRS system in-house with the goal being to vastly improve our read performance. Currently our reads are conducted through a web service which runs a Linq-to-SQL query against normalised data, involving some degree of deserialization from an SQL Azure database.
The simplified structure of our data is:
User
Conversation (Grouping of Messages to the same recipients)
Message
Recipients (Set of Users)
I want to move this into a denormalized state, so that when a user requests to see a feed of messages it reads from EITHER:
A denormalized representation held in Azure Table Storage
UserID as the PartitionKey
ConversationID as the RowKey
Any volatile data prone to change stored as entities
The messages serialized as JSON in an entity
The recipients of said messages serialized as JSON in an entity
The main problem with this the limited size of a row in Table Storage (960KB)
Also any queries on the "volatile data" columns will be slow as they aren't part of the key
A normalized representation held in Azure Table Storage
Different table for Conversation details, Messages and Recipients
Partition keys for message and recipients stored on the Conversation table.
Bar that; this follows the same structure as above
Gets around the maximum row size issue
But will the normalized state reduce the performance gains of a denormalized table?
OR
A denormalized representation held in SQL Azure
UserID & ConversationID held as a composite primary key
Any volatile data prone to change stored in separate columns
The messages serialized as JSON in a column
The recipients of said messages serialized as JSON in an column
Greatest flexibility for indexing and the structure of the denormalized data
Much slower performance than Table Storage queries
What I'm asking is whether anyone has any experience implementing a denormalized structure in Table Storage or SQL Azure, which would you choose? Or is there a better approach I've missed?
My gut says the normalized (At least to some extent) data in Table Storage would be the way to go; however I am worried it will reduce the performance gains to conduct 3 queries in order to grab all the data for a user.
Your primary driver for considering Azure Tables is to vastly improve read performance, and in your scenario using SQL Azure is "much slower" according to your last point under "A denormalized representation held in SQL Azure". I personally find this very surprising for a few reasons and would ask for detailed analysis on how this claim was made. My default position would be that under most instances, SQL Azure would be much faster.
Here are some reasons for my skepticism of the claim:
SQL Azure uses the native/efficient TDS protocol to return data; Azure Tables use JSON format, which is more verbose
Joins / Filters in SQL Azure will be very fast as long as you are using primary keys or have indexes in SQL Azure; Azure Tables do not have indexes and joins must be performed client side
Limitations in the number of records returned by Azure Tables (1,000 records at a time) means you need to implement multiple roundtrips to fetch many records
Although you can fake indexes in Azure Tables by creating additional tables that hold a custom-built index, you own the responsibility of maintaining that index, which will slow your operations and possibly create orphan scenarios if you are not careful.
Last but not least, using Azure Tables usually makes sense when you are trying to reduce your storage costs (it is cheaper than SQL Azure) and when you need more storage than what SQL Azure can offer (although you can now use Federations to break the single database maximum storage limitation). For example, if you need to store 1 billion customer records, using Azure Table may make sense. But using Azure Tables for increase speed alone is rather suspicious in my mind.
If I were in your shoes I would question that claim very hard and make sure you have expert SQL development skills on staff that can demonstrate you are reaching performance bottlenecks inherent of SQL Server/SQL Azure before changing your architecture entirely.
In addition, I would define what your performance objectives are. Are you looking at 100x faster access times? Did you consider caching instead? Are you using indexing properly in your database?
My 2 cents... :)
I won't try to argue on the exact definition of CQRS. As we are talking about Azure, I'll use it's docs as a reference. From there we can find that:
CQRS doesn't necessary requires that you use a separate read storage.
For greater isolation, you can physically separate the read data from the write data.
"you can" doesn't mean "you must".
About denormalization and read optimization:
Although
The read model of a CQRS-based system provides materialized views of the data, typically as highly denormalized views
the key point is
the read database can use its own data schema that is optimized for queries
It can be a different schema, but it can still be normalized or at least not "highly denormalized". Again - you can, but that doesn't mean you must.
More than that, if you performance is poor due to write locks and not because of heavy SQL requests:
The read store can be a read-only replica of the write store
And when we talk about request's optimization, it's better to talk more about requests themselves, and less about storage types.
About "it reads from either" [...]
The Materialized View pattern describes generating prepopulated views of data in environments where the source data isn't in a suitable format for querying, where generating a suitable query is difficult, or where query performance is poor due to the nature of the data or the data store.
Here the key point is that views are plural.
A materialized view can even be optimized for just a single query.
...
Materialized views tend to be specifically tailored to one, or a small number of queries
So you choice is not between those 3 options. It's much wider actually.
And again, you don't need another storage to create views. All can be done inside a single DB.
About
My gut says the normalized (At least to some extent) data in Table Storage would be the way to go; however I am worried it will reduce the performance gains to conduct 3 queries in order to grab all the data for a user.
Yes, of course, performance will suffer! (Also consider the matter of consistency). But will it be OK or not you can never be sure until you test it. With your data and your requests. Because delays in data transfers can actually be less than time required for some elaborate SQL-request.
So all boils down to:
What features do you need and which of them Table Storage and/or SQL Azure have?
And then, how much will it cost?
These you can only answer yourself. And these choices have little to do with performance. Because if there is a suitable index in either of those, I believe the performance will be virtually indistinguishable.
To sum up:
SQL Azure or Azure Table Storage?
For different requests and data you can and you probably should use both. But there is too little information in the question to give you the exact answer (we need an exact request for that). But I agree with #HerveRoggero - most probably you should stick with SQL Azure.
I am not sure if I can add any value to other answers, but I want to draw your attention toward modeling the data storage based on your query paths. Are you going to query all the mentioned data bits together? Is the user going to ask for some of it as additional information after a click or something? I am assuming that you have thought about this question already, and you are positive that you want to query everything in one go. i.e., the API or something needs to return all this information at once.
In that case, nothing will beat querying a single object by key. If you are talking about Azure's Table Storage specifically, it says right there that it's a key-value store. I am curious whether you have considered the document database (e.g. Cosmos DB) instead? If you are implementing CQRS read models, you could generate a single document per user that has all information that a user sees on a feed. You query that document by user id, which would be the key. This approach would be the optimal CQRS implementation in my mind because, after all, you are aiming to implement read models. Unless I misinterpreted something in your question or you have strong reasons to not go with document databases.

Resources