Lagom Persistence | Cassandra | thenPersistAll throwing batch too large - cassandra

I am using the Lagom persistence and while trying to persist the events in Cassandra, I am using thenPersistAll but on running the service on the DC/OS cluster, there occasionally is an InvalidQueryException - Batch too large exception thrown.
What might be the issue and how can it be resolved? Thanks in advance

How many events are you attempting to persist at once, and what is the size of those events? Cassandras batch limit I believe is measured in kb, and I think defaults to 50kb. If your events are much bigger than that, then you should consider whether you're using events correctly, generally events should be small, eg no more than a few kb when serialized to json.

Related

Janusgraph warm-up

Janusgraph does some internal activities due to which there are spikes in starting 1-2 runs.. (even at 10 QPS) and after that it gets stable. Once it's stable, no spikes are observed, is it expected behaviour, that needed some warm-up run to make it stable? Using Back-end Cassandra CQL.
Cassandra can exhibit behavior like this on startup. If there are files in the commitlog (directory), those will validate and write their data to disk. If there are a lot of them, there can be a bit of an increase in resource consumption.

Is infinispan cache usable as a cluster cache of small resource nodes?

Suppose, I have a lot of nodes with small resources on memory and cpu maybe 5 or maybe 20.
These nodes are not really reliable, they may be switched of by the User.
They all use a database for readonly master data which will be delivered by a kafka topic connected to from each node.
What I want to achieve is to use infinispan as a distributed[replicated] cache above the database used by the nodes, so that at any node at any point on time has the same "view" on the readonly database.
Can I get this working, especially with low resources and if yes, is there any Link to an example for getting expirience?
Thanx
I don't think you can get a definite answer here, you need to try it out. I wouldn't call 5 - 20 CPUs small resources; there's not much going on in background when you're not actively reading/writing the cache so there shouldn't be any 'constant' overhead - just JGroups' heartbeat messages and such.
When using off-heap memory, Infinispan can be started with pretty small JVM heaps (24 MB IIRC, just for the POC), so you might be fine. However if you'll replicate the database on every node it's going to occupy some memory.
If the nodes often come and go, it could cause some churn on CPU. In replicated mode leaves won't matter too much, but when a node joins it will be getting all the data (from different nodes).

Bulk loading in Cassandra, issue of dirty reads, and its effect in cluster

Our use case is load bulk data into our live production Cassandra cluster. We have to load bulk data in Cassandra on daily basis. We came across sstableloader. We have few queries around same:
1: When we are loading bulk data into our live production cluster using sstableloader, do we have a chance of dirty read?(Basically does sstableloader load all data at once or it continues to update as and when it is getting data?) Dirty read is not acceptable in our production environment.
2: When we are loading bulk data into our live production cluster, does it affect cluster availability?(Basically since we are loading a huge amount of data into live production cluster, does it affect its performance? Do we need to increase cluster nodes for making it highly available during bulk loading?)
3: If there is possibility of dirty read in live production cluster using sstableloader, please suggest alternate tool which can avoid this issue. We want all bulk data to appear at once and not incremental.
Thanks!
SStableloader loads the data incrementally. It will not load everything in at once.
It will most definitely have an impact. How severe this impact is depends on the size of the data that is streamed in as well as many other factors. You can throttle the throughput with options in the sstableloader which might help in that regard. Run this use case on a test cluster and see the impact sstableloader will have with your dataset.
There is not really a way to make this work without giving at least a small timeframe where the data is 'dirty' unless you are willing to take downtime.
For example, for the more adventurous, you might be adding the SSTables directly into the data folders of all your nodes and run nodetool refresh. However, this will not be exactly simultaneous and therefore prone to dirty reads or failed reads for a short period of time.

Spark: Writing to DynamoDB, limited write capacity

My use case is to write to DynamoDB from a Spark application. As I have limited write capacity for DynamoDB and do not want to increase it for cost implications, how can I limit the Spark application to write at a regulated speed?
Can this be achieved by reducing the partitions to 1 and then executing foreachPartition()?
I already have auto-scaling enabled but don't want to increase it any further.
Please suggest other ways of handling this.
EDIT: This needs to be achieved when the Spark application is running on a multi-node EMR cluster.
Bucket scheduler
The way I would do this is to create a token bucket scheduler in your Spark application. A token bucket pattern is a common to design to ensure an application does not breach API limits. I have used this design successfully in very similar situations. You may find someone has written a library you can use for this purpose.
DynamoDB retry
Another (less attractive), option would be to increase the retry times on your DynamoDB connection. When your write does not succeed due to throughput provision exceeded, you can essentially instruct your DyanmoDB SDK to keep retrying for as long as you like. Details in this answer. This option may appeal if you want a 'quick and dirty' solution.

OpsCenter graphs are slow to refresh, can I configure the refresh rate?

I am new to OpsCenter and trying to get a feel for the metric graphs. The graphs seem slow to refresh and I'm trying to determine if this is a configuration issue on my part or simply what to expect.
For example, I have a three node Cassandra test cluster created via CCM. OpsCenter and the node Agents were configured manually.
I have graphs on the dashboard for Read and Write Requests and Latency. I'm running a JMeter test that inserts 100k rows into a Cassandra table (via REST calls to my webapp) over the course of about 5 minutes.
I have both OpsCenter and VisualVm open. When the test kicks off, VisualVM graphs immediately start showing the change in load (via Heap and CPU/GC graphs) but the OpsCenter graphs lag behind and are slow to update. I realize I'm comparing different metrics (ie. Heap vs Write Requests) but I would expect to see some immediate indication in OpsCenter that a load is being applied.
My environment is as follows:
Cassandra: dsc-cassandra-2.1.2
OpsCenter: opscenter-5.1.0
Agents: datastax-agent-5.1.0
OS: OSX 10.10.1
Currently metrics are collected every 60 seconds, plus there’s a (albeit very small) overhead on inserting them into C*, reading back on the OpsCenter server side, and pushing to the UI.
OpsCenter team is working on both improving metrics collection in general and on delivering realtime metrics, so stay tuned.
By the way, comparing VisualVM and OpsCenter in terms of latencies is not quite correct since OpsCenter has to do a lot more work to both collect and aggregate those metrics due to its distributed nature (and also because VisualVM is so close to the meta^WJVM ;)

Resources