Cassandra as an embedded service and with custom consistency level - cassandra

I am thinking of building an application that uses Cassandra as its data store, but has low latency requirements. I am aware of EmbeddedCassandraService from this blog post
Is the following implementation possible and what are known pitfalls (defects, functional limitations)?
1) Run Cassandra as an embedded service, persisting data to disk (durable).
2) Java application interacts with local embedded service via one of the following. What are the pros
TMemoryBuffer (or something more appropriate?)
StorageProxy (what are the pitfalls of using this API?)
Apache Avro? (see question #5 below)
3) Java application interacts with remote Cassandra service ("backup" nodes) via Thrift (or Avro?).
4) Write must always succeed to the local embedded Cassandra service in order to be successful, and at least one of the remote (non-embedded) Cassandra nodes. Is this possible? Is it possible to define a custom / complex consistency level?
5) Side-question: Cassandra: The Definitive Guide mentions in several places that Thrift will ultimately be replaced with Avro, but seems like that's not the case just yet?
As you might guess, I am new to Cassandra, so any direction to specific documentation pages (not the wiki homepage) or sample projects are appreciated.

Unless your entire database is sitting on the local machine (i.e. a single node), you gain nothing by this configuration. Cassandra will shard your data across the cluster, so (as mentioned in one of the comments) your writes will frequently be made to another node that owns the data. Presuming you write with a consistency level of at least one, your call will block until that other node acks the write. This negates any benefit of talking to the embedded instance since you have some network latency anyway.

Related

Limiting Cassandra query syntax for clients

We plan to use Cassandra 3.x and we want to allow our customers to connect to Cassandra directly for exporting the data into their data warehouses.
They will connect via ODBC from remote.
Is there any way to prevent that the customer executes huge or bad SELECT statements that will result in a high load for all nodes? We use an extra data center in our replication strategy where only customers can connect, so live system will not be affected. But we want to setup some workers that will run on this shadow system also. Most important thing is, that a connected remote client will not have any noticable impact on other remote connections or our local worker jobs. There is a materialized view already and I want to force customers to get data based on primary key only (i.e. disallow usage of ALLOW FILTERING). It would be great also, if one can limit the number of rows returned (e.g. 1 million) to prevent a pull of all data.
Is there a best practise for this use case?
I know of BlackRocks video related to multi-tenant strategy in C* which advises to use tenant_id in schema. That is what we're doing already, but how can I ensure security/isolation via ODBC connected tenants/customers? Or do I have to write an API on my own which handles security?
I would recommend to expose access via API, not via ODBC - at least you would have greater control on what is executed, and enforce tenant_id, and other checks, like limits, etc. You can try to utilize the Cassandra's CQL parser to decompose query, and put all required things back.
Theoretically, you can could utilize Apache Calcite, for example. It has implementation of JDBC driver that could be used, plus there is existing Cassandra adapter that you can modify to accomplish your task (mapping authentication into tenant_ids, etc.), but this will be quite a lot of work.

Are there any benefits of using hazel cast over Cassandra

I want to implement session store for my web application. Here's the profile of my application.
The information associated with a session does not change a lot, but
it does change sometimes.
Session reads(session.getAttribute()) are more frequent than writes(session.setAttribute()).
I don't want to deal with master node based architecture(like redis).
Data associated with a session is small, but the number of sessions could be large.
The lookup is always in the form of key value like in hash map.
I am OK with eventual consistency.
I want to be able to specify the replication factor. i.e. number of nodes that will hold data for a given session
I am only looking for open source solutions that wouldn't incur license cost for above features.
For now I want to store upto 10,000 sessions with 10kb data per session(on average), but eventually I want to scale to 100,000 sessions or more!
In my app hazelcast is already being used for some other functionality. But I don't want that to be the deciding factor. Cassandra seems to fulfill all my requirements and it seems to be quite popular. Any reason I should chose hazelcast over cassandra?
Disclaimer: Hazelcast employee
In general I would argue that if you can exchange Hazelcast with Cassandra OR Cassandra with Hazelcast, one of the tools is misused.
We have plenty of people using them as companions, meaning Cassandra as the storage layer and Hazelcast as the caching layer, Cassandra, however, is not a cache and Hazelcast is not a database.
If you want to persist your storages to disk, go for Cassandra (maybe add caching with Hazelcast), if you just want to distribute, go with Hazelcast. Latter case especially if it "doesn't really matter" if you loose sessions once in a while if you (for some reason or another) restart the cluster.
We use both of them in our project. We use Cassandra as persistent storage and Hazelcast for temporary and frequently changed data (e.g. distributed queues and synchronization primitives).
Any reason I should chose hazelcast over cassandra?
In my opinion, Hazelcast is easier from developer point of view and it does not require as much attention as Cassandra on production environment (deep configuration and tunning, repeair, restarting...), so Hazelcast is cheaper in support.

Need architecture hint: Data replication into the cloud + data cleansing

I need to sync customer data from several on-premise databases into the cloud. In a second step, the customer data there needs some cleanup in order to remove duplicates (of different types). Based on that cleansed data I need to do some data analytics.
To achieve this goal, I'm searching for an open source framework or cloud solution I can use for. I took a look into Apache Apex and Apache Kafka, but I'm not sure whether these are the right solutions.
Can you give me a hint which frameworks you would use for such an task?
From my quick read on APEX it requires Hadoop underneath coupling to more dependencies than you probably want early on.
Kafka on the other hand is used for transmitting messages (it has other APIs such as streams and connect which im not as familiar with).
Im currently using Kafka to stream log files in real time from a client system. Out of the box Kafka really only provides fire and forget semantics. I have had to add a bit to make it an exactly once delivery semantic (Kafka 0.11.0 should solve this).
Overall, think of KAFKA being a more low level solution with logical message domains with queues and from what I skimmed over APEX being a more heavy packaged library with alot more things to explore.
Kafka would allow you to switch out the underlying analytical system of your choosing with their consumer api.
The question is very generic, but I'll try to outline a few different scenarios, as there are many parameters in play here. One of them is cost, which on the cloud it can quickly build up. Of course, the size of data is also important.
These are a few things you should consider:
batch vs streaming: do the updates flow continuously, or the process is run on demand/periodically (sounds the latter rather than the former)
what's the latency required ? That is, what's the maximum time that it would take an update to propagate through the system ? Answer to this question influences question 1)
how much data are we talking about ? If you're up the Gbyte size, Tbyte or Pbyte ? Different tools have different 'maximum altitude'
and what format ? Do you have text files, or are you pulling from relational DBs ?
Cleaning and deduping can be tricky in plain SQL. What language/tools are you planning on using to do that part ? Depending on question 3), data size, deduping usually requires a join by ID, which is done in constant time in a key value store, but requires a sort (generally O(nlogn)) in most other data systems (spark, hadoop, etc)
So, while you ponder all this questions, if you're not sure, I'd recommend you start your cloud work with an elastic solution, that is, pay as you go vs setting up entire clusters on the cloud, which could quickly become expensive.
One cloud solution that you could quickly fire up is amazon athena (https://aws.amazon.com/athena/). You can dump your data in S3, where it's read by Athena, and you just pay per query, so you don't pay when you're not using it. It is based on Apache Presto, so you could write the whole system using basically SQL.
Otherwise you could use Elastic Mapreduce with Hive (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive.html). Or Spark (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark.html). It depends on what language/technology you're most comfortable with. Also, there are similar products from Google (BigData, etc) and Microsoft (Azure).
Yes, you can use Apache Apex for your use case. Apache Apex is supported with Apache Malhar which can help you build application quickly to load data using JDBC input operator and then either store it to your cloud storage ( may be S3 ) or you can do de-duplication before storing it to any sink. It also supports Dedup operator for such kind of operations. But as mentioned in previous reply, Apex do need Hadoop underneath to function.

Should users be directed to specific data nodes when using an eventually consistent datastore?

When running a web application in a farm that uses a distributed datastore that's eventually consistent (CouchDB in my case), should I be ensuring that a given user is always directed to same the datastore instance?
It seems to me that the alternate approach, where any web request can use any data store, adds significant complexity to deal with consistency issues (retries, checks, etc). On the other hand, if a user in a given session is always directed to the same couch node, won't my consistency issues revolve mostly around "shared" user data and thus be greatly simplified?
I'm also curious about strategies for directing users but maybe I'll keep that for another question (comments welcome).
According to the CAP Theorem, distributed systems can either have complete consistency (all nodes see the same data at the same time) or availability (every request receives a response). You'll have to trade one for the other during a partition or datastore instance failure.
Should I be ensuring that a given user is always directed to same the datastore instance?
Ideally, you should not! What will you do when the given instance fails? A major feature of a distributed datastore is to be available in spite of network or instance failures.
If a user in a given session is always directed to the same couch node, won't my consistency issues revolve mostly around "shared" user data and thus be greatly simplified?
You're right, the architecture would be a lot more simpler that way, but again, what would you do if that instance fails? A lot of engineering effort has gone into distributed systems to allow multiple instances to reply to a query. I am not sure about CouchDB, but Cassandra allows you to choose your consistency model, you'll have to tradeoff availability for higher degree of consistency. The client is configured to request servers in a round-robin fashion by default, which distributes the load.
I would recommend you read the Dynamo paper. The authors describe a lot of engineering details behind a distributed database.

What are the best papers for learning about algorithms for communicating updates in a distributed system?

I have a distributed system in mind (multiple nodes in a single datacenter) that I want to have the following properties:
nodes can enter and leave the system at any time.
There is no data replication between nodes.
Which node the client makes use of is up to the client (i.e. it could be consistent hashing, it could be something else)
no master (i.e. no central point of failure)
each node may receive a piece of information that needs to be forwarded to the rest of the nodes
What algorithms (links to papers are best) are suitable for this?
(I assume some of the answers will include P2P algorithms, but most of them that I've encountered in the past have acted more like distributed hash tables, where nodes enter and take over some part of the keyspace, etc. I also recognize that multicast with simple UDP messages might be appropriate here, but what existing work would help make the messaging reliable?)
How about trying to implement ADHOC nodes with JXTA? See the Practical JXTA II book available online at Scribd.

Resources