I have a use-case of inserting a lot of data during big calculation which really don't have to be available in the cluster immediately (so the cluster can synchronize as we go).
Currently I'm inserting batches using putAll() operation and it's blocking and taking time.
I've read a blog post about efficiency of set() operation but there is no analogous setAll(). I also saw putAsync() and didn't see matching putAllAsync() (I'm not interested in the future object).
Am I overlooking something? How can I improve insertion performance?
EDIT: Feature request: https://github.com/hazelcast/hazelcast/issues/5337
I think you're right, they're missing. Could you create a feature request, maybe you're also interested in helping to implement them using the Hazelcast Incubator?
Related
I was going through lots of posts over SO and the official documentation of Flink but I couldn't get what I was looking for. I am looking for the difference between RichSinkFunction, RichAsyncFunction, AsyncIO and CassandraSink for writing records faster/multithreaded in the Cassandra DB using Flink.
My understanding is as follows:
RichSinkFunction - If implemented properly, then it will do the work for you. Since it opens and closes the connection once.
RichAsyncFunction- Implementation same as RichSinkFunction. It works in Sync mode originally. I can use executorService for multithreading purposes. Also, I read that if the capacity is passed thoughtfully, it can give you higher throughput.
AsyncIO - Doesn't work multithreaded by default. Also, according to one of the SO answers, we can use executorService same as RichAsyncFunction for creating separate threads which are not mentioned in the documentation.
CassandraSink - Provided by Flink with various properties. Will using setMaxConcurrentReqeuests give me faster results?
What would be the best to use among the mentioned classes for the purpose I am looking for via the Flink program?
I think you can use CassandraSink, which uses DataStax's java-driver(https://github.com/datastax/java-driver) to access Cassandra. Flink uses executeAsync function to achieve better speed. As you mentioned, setMaxConcurrentRequests sets the max number of requests that can be sent from the same session.
RichSinkFunction is a fundamental function in Flink. We can implement our own Cassandra Sink with RichSinkFunction, but it requires work like initialising the client, creating threads, etc.
RichAsyncFunction is mainly for the AsyncIO operator. We still have to initialise the Cassandra client. Except that, we only have to focus on implementing the asyncInvoke function and failure handling.
CassandraSink implements RichSinkFunction with Cassandra client's async API. This is the easiest async API with the least code.
I understand that there are two ways of iterating over a large result set in Cassandra:
Querying explicitly with tokens, as discussed in this article on "Displaying rows from an unordered partitioner with the TOKEN function". This appears to have been the only way of doing things prior to Cassandra 2.0.
Using "paging state".
Paging state appears to be the suggested way of doing things these days, but doing it the old token way still works.
Aside from it being the blessed way of doing things, which is of course a type of advantage, I'd love to understand what are the particular advantages of using the "new" method over the "old"? Is there a reason I should not use token in this way?
The use of paging or tokens is really depending on your requirements, and technical abilities. From my point of view, use of paging is good for fetching of data from big partition, or when you have not so much data in table, so you can use select * from table.
But if you have multiple servers in cluster, and big amounts of data, use of token will allow you to read data from specific servers (if you set routing key correctly), and in parallel (Spark Cassandra Connector uses token exactly for this reason) - this is big advantage over use of paging where you're using one coordinator node that needs to go to other nodes for data that it doesn't have. But for some people, it's not really easy to implement, because you need to cover edge cases, like, when token range doesn't start exactly at minimum value. I have example in Java how to do it if you need.
I agree with Alex this answer, I will add that when you do that the old school(with tokens), you have the hand on your tokens, this mean that if you are dealing with a big amount of data, that you can save your check points for example, that you can handle well restart after failure, or just pausing your job for example, or also to launch multi-threading jobs and separate teach worker data, the way spark workers deal with data for example are token based too.
The driver handle the paging automatically for you so you don't have to fetch pages with all the advantage of a native thing handled for you, but the use of token give you full hands on the way you are paginating with all the advantage that you can get from it(attacking a specific range, specific server)
I hope this helps !
Recently I have used Jdbc_streaming filter plugin of logstash, it is very helpful plugin which allows me to connect with my database on the fly and perform checks against my events.
But are there any drawbacks or pitfall of using this filter.
I mean I have the following queries :
For example , I am firing select query against each of my events.
Is it a good idea to query my database for each event. I mean what if I am processing a syslog event of a server which is continuously sending me data, in that case for each event I will be triggering a select query on my database so how will my database will react in terms of load and response time.
What about the number of connections, how they are managed.
How this will behave if I join multiple tables.
I hope I am able to convey my question.
I just want to understand , how exactly it is working in back end and does querying my database at massive speed will degrade my database performance.
I am not sure whether this answer is correct or not.
But as per my experience , logstash works in sequential manner for the above plugin.
It creates only single connection to RDS and query's the DB for each record.
So there is no connection overhead, but then it degrades the performance by many folds.
This answer is just from my experience, it might be possible that this can be a completely wrong answer. Any edits or answers are welcome.
I am planning to ingest scientific measurement data into my 6-node cassandra cluster using python script.
I have checked various posts and articles on bulk loading data into cassandra. But unfortunately, none of the state-of-the.-art as discussed fits my use-case [1][2]. However, I found this post on Stack Overflow which seemed quite helpful.
Considering that post and my billion of records data, I would like to know if the combination of using PreparedStatement (instead of Simple Statement) and execute_async is a good practice.
Yes, that should work - but you need to have some throttle on the number of async requests that are running simultaneously. Driver allows only some number of in-flight requests, and if you submit more than allowed, then it will fail.
Another thing to think about - if you can organize data into small batches (UNLOGGED) where all entries have the same partition could also improve situation. See documentation for examples of good & bad practices of using batches.
I am looking for Cassandra/CQL's cousin of the common SQL idiom of INSERT INTO ... SELECT ... FROM ... and have been unable to find anything to do such an operation programmatically or in CQL. Is it just not supported?
My use case is to do a reasonably bulky copy from one table to another. I don't need any particular concurrent guarantees, but it's a lot of data so I'd like to avoid the additional network overhead of writing a client that retrieves data from one table, then issues batches of inserts into the other table. I understand that the changes will still need to be transported between nodes of the Cassandra cluster according to the replication set-up, but it seems reasonable for there to be an "internal" option to do a bulk operation from one table to another. Is there such a thing in CQL or elsewhere? I'm currently using Hector to talk to Cassandra.
Edit: it looks like sstableloader might be relevant, but is awfully low-level for something that I'd expect to be a fairly common use case. Taking just a subset of rows from one table to another also seems less than trivial in that framework.
Correct, this is not supported natively. (Another alternative would be a map/reduce job.) Cassandra's API focuses on short requests for applications at scale, not batch or analytical queries.