User Defined Functions and Aggregate in Cassandra

User Defined Functions and Aggregate in Cassandra - cassandra

I am testing the UDF / UDA feature in Cassandra, It seems good. But I have few questions in using it.
1) In the Cassandra.yaml, It is mentioned that sandboxing is enabled for avoiding the evil code, So are we violating the rule and what will be the consequences of enabling this support (flag)?
2) What are the advantages of using UDF / UDA in Cassandra compared to reading the data and writing the aggregation logic in client side?
3) Also, apart from JAVA, Is there a language support available for nodejs, python in writing UDF / UDA?
Thanks,
Harry

Here are some comments:
Sandboxing prevents execution of "dangerous" code - working with files/sockets, starting threads, etc. This blog post provides some additional details about it.
It could be several - you don't move data from coordinator node to your app, you offload calculations to cassandra cluster, etc.
Languages supporting JSR 223 "Scripting for Java" - JavaScript, Groovy, JRuby, Jython, ... (with enable_scripted_user_defined_functions set to true in Cassandra config). But Java should be the fastest.
Also look to this presentation about UDF/UDA from author of this functionality (Robert Stupp) & this blog post with more details & examples.

Related

Difference between different ways of writing records to Cassandra using Flink

I was going through lots of posts over SO and the official documentation of Flink but I couldn't get what I was looking for. I am looking for the difference between RichSinkFunction, RichAsyncFunction, AsyncIO and CassandraSink for writing records faster/multithreaded in the Cassandra DB using Flink.
My understanding is as follows:
RichSinkFunction - If implemented properly, then it will do the work for you. Since it opens and closes the connection once.
RichAsyncFunction- Implementation same as RichSinkFunction. It works in Sync mode originally. I can use executorService for multithreading purposes. Also, I read that if the capacity is passed thoughtfully, it can give you higher throughput.
AsyncIO - Doesn't work multithreaded by default. Also, according to one of the SO answers, we can use executorService same as RichAsyncFunction for creating separate threads which are not mentioned in the documentation.
CassandraSink - Provided by Flink with various properties. Will using setMaxConcurrentReqeuests give me faster results?
What would be the best to use among the mentioned classes for the purpose I am looking for via the Flink program?

I think you can use CassandraSink, which uses DataStax's java-driver(https://github.com/datastax/java-driver) to access Cassandra. Flink uses executeAsync function to achieve better speed. As you mentioned, setMaxConcurrentRequests sets the max number of requests that can be sent from the same session.
RichSinkFunction is a fundamental function in Flink. We can implement our own Cassandra Sink with RichSinkFunction, but it requires work like initialising the client, creating threads, etc.
RichAsyncFunction is mainly for the AsyncIO operator. We still have to initialise the Cassandra client. Except that, we only have to focus on implementing the asyncInvoke function and failure handling.
CassandraSink implements RichSinkFunction with Cassandra client's async API. This is the easiest async API with the least code.

Cassandra batch operation support with CrudRepository

I understand cassandra does not support transaction management, but it does support batch operations which can be used like transactions (either all will success or none).
Similarly spring-data for cassandra provides CassandraTemplate which provides batchOps to support this batch feature.
I am wondering if something similar is available to be used with CrudRepository (a high level api which internally uses CassandraTemplate).

Batches are really used for very limited sets of use cases - please check documentation. It could be better to devise some other way to achieve the same goal, but this really depends on what you want to do - what operations to execute, table schemas, etc.

hbase vs cassandra for messaging

A few years ago, Facebook decided to use hbase instead of cassandra for its messaging system: http://highscalability.com/blog/2010/11/16/facebooks-new-real-time-messaging-system-hbase-to-store-135.html
The main fact why fb uses hbase was that reads are faster than writes in compare to cassandra. Is this fact still true? I am using cassandra 3.0 and when setting read consistency level to ONE or TWO, reads are faster than when setting to ALL.
Now my question is: If Facebook has to decide to use cassandra or hbase in 2016, will its decision still be hbase?

Cassandra was designed and built originally for optimized write performance. As versions have been released their has been a lot of work done to increase the read performance so that it is much closer to write performance. There have been multiple benchmarks and studies done on HBase versus Cassandra but in general they tend to say that performance is about equal to Cassandra being a bit better. however I always take all of these performance benchmark studies with a grain of salt as you can make anyone the winner depending on how you setup the test.
You will most certainly get faster reads and writes with a CL=ONE than ALL because the coordinator only needs to wait for any of the replicas to respond instead of all of them. If you are in a multi-DC scenario then LOCAL_ONE will increase the throughput even more.
As for whether or not FB would choose Cassandra over HBase, it is impossible to say because there is so much more to making that decision than just simple performance metrics. I can say that a messaging use case is one that cassandra performs well. You can read thier use cases here:
http://www.planetcassandra.org/blog/functional_use_cases/messaging/

Enabling Locking of Data in C

I just want to know whether there is any possible method to implement locking of a data in cassandra?I tried with multithreading using hector but this didn't work out well. Can anyone suggest a method?

Cassandra does not support any type of locking. Out of 4 traditional database properties (ACID - Atomicity, Consistency, Isolation and Durability), it only fully supports D. The way it supports the other 3 is arguable, because they only support each one of them partially. You can read more here. IMHO the only way to accomplish your goal is to use some synchronization layer which will intercept all calls to Cassandra and performa all necessary locking inside of itself, before anything is sent to Cassandra. In this case, you're using Astyanax as such layer.

For what reasons it is not preferable to use Thrift API for Cassandra access?

Please enumerate reasons for why its not advisable to use Thrift interface for Cassandra? What are the possible advantages and disadvantages?

If you use the raw Thrift apis the cons will be:
no connection pooling
no monitoring
no object oriented interface (not entirely true)
no failover support

To continue Schildmeijer's good start:
No batch interface.
No chunking of get_range_slices() or get_indexed_slices() so you can easily swamp Cassandra
Non-string types must be packed into binary strings yourself
You'll probably mess up timestamp precision
Exception messages are generally useless
Thrift is broken by default in some languages. See the PHP C extension, for example.
Because the code is generated, it's not intuitive, especially regarding super columns, SlicePredicates and batch_mutate().
Schema modification commands don't wait for schema agreement among all the nodes in the cluster

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

User Defined Functions and Aggregate in Cassandra - cassandra

Related

Difference between different ways of writing records to Cassandra using Flink

Cassandra batch operation support with CrudRepository

hbase vs cassandra for messaging

Enabling Locking of Data in C

For what reasons it is not preferable to use Thrift API for Cassandra access?

Categories

Resources