I am planning to ingest scientific measurement data into my 6-node cassandra cluster using python script.
I have checked various posts and articles on bulk loading data into cassandra. But unfortunately, none of the state-of-the.-art as discussed fits my use-case [1][2]. However, I found this post on Stack Overflow which seemed quite helpful.
Considering that post and my billion of records data, I would like to know if the combination of using PreparedStatement (instead of Simple Statement) and execute_async is a good practice.
Yes, that should work - but you need to have some throttle on the number of async requests that are running simultaneously. Driver allows only some number of in-flight requests, and if you submit more than allowed, then it will fail.
Another thing to think about - if you can organize data into small batches (UNLOGGED) where all entries have the same partition could also improve situation. See documentation for examples of good & bad practices of using batches.
Related
So currently I have an application which uses cassandra. I have 3 cassandra nodes of which 1 is seed node. The app currently takes in around 100 write requests every 20 seconds and around 200 read requests every second. The app seems to be crashing a lot with the error - expected to return 1 entry, but received 0.
Just wondering what optimization steps I should take into an account, so the cassandra does not crash that much anymore?
I'm following the correct table structure, but I'm not using materialized views, instead I'm using ALLOW FILTERING - could that be the problem?
Are there any other suggestions?
ALLOW FILTERING is there strictly for testing and dev, occasional ops purposes. While some tools like spark have been optimized with it, it should never be used by your application. There is no way to make it work efficiently or optimizations for it, please do not use it. If a query requires it, you have a data modeling problem. Your tables should reflect your queries not your data.
How can I export data, over a period of time (like hourly or daily) or updated records from a Cassandra database? It seems like using an index with a date field might work, but I definitely get timeouts in my cqlsh when I try that by hand, so I'm concerned that it's not reliable to do that.
If that's not the right way, then how do people get their data out of Cassandra and into a traditional database (for analysis, querying with JOINs, etc..)? It's not a java shop, so using Spark is non-trivial (and we don't want to change our whole system to use Spark instead of cassandra directly). Do I have to read sstables and try to keep track of them that way? Is there a way to say "get me all records affected after point in time X" or "get me all changes after timestamp X" or something similar?
It looks like Cassandra is really awesome at rapidly reading and writing individual records, but beyond that Cassandra seems to not be the right tool if you want to pull its data into anything else for analysis or warehousing or querying...
Spark is the most typical to do exactly that (as you say). It does it efficiently and is used often so pretty reliable. Cassandra is not really designed for OLAP workloads but things like spark connector help bridge the gap. DataStax Enterprise might have some more options available to you but I am not sure their current offerings.
You can still just query and page through the whole data set with normal CQL queries, its just not as fast. You can even use ALLOW FILTERING just be wary as its very expensive and can impact your cluster (creating a separate dc for the workload and using LOCOL_CL queries against it helps). You will probably also in that scenario add a < token() and > token() to the where clause to split up the query and prevent too much work on any one coordinator. Organizing your data so that this query is more efficient would be strongly recommended (ie if doing time slices, put things in a partition bucketed by time and clustering key timeuuids so its sequential read for each part of time).
Kinda cheesy sounding but the CSV dump from cqlsh is actually fast and might work for you if your data set is small enough.
I would not recommend going to the sstables directly unless you are familiar with internals and using hadoop or spark.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
we are looking for an opensource in memory database which can support indexes.
The use case is that we have lot of items that are going to grow in a big way.
Each item has a few fields on which we need to query.
Currently we store the data in application's memory. However with increasing data, we have to think about distributing/sharding the db.
We have looked at a few options
Redis cluster could be used, but it does not have the concept of
indexes or SQL like queries.
Apache Ignite is both in-memory, and distributed as well as provides
SQL queries. However, the problem is that ignite fires all
queries into all master nodes, so that the final result will be
slower than the slowest of those queries. It seems like a problem
because a non performing/slow node out of a number of nodes can
really slow down the application a lot. Further in ignite, reads are
done from the masters and slaves are not used, so that it is
difficult to scale the queries. Increasing the nodes will have
negative impact as the no of queries will increase and it will be
even slower.
Cassandra - The in-memory option in cassandra can be used, but it
seems that the max size of a table per node can be 1 GB. If
our table is more than 1 GB, we will have to resort to partitioning
which will inturn lead cassandra to make multiple queries(one per
node) and it is a problem(same as ignite). Not sure whether reads in
cassandra in-memory table can be scaled by increasing the number of
slaves.
We are open to other solutions but wondering whether the multi-query will be a problem everywhere(like hazelcast).
The ideal solution for our use case would be an in-memory database with indexes which could be read scaled by increasing the number of slaves. Making it distributed/sharded will lead to multiple queries and we are reluctant because one erring node could slow the whole system down.
Hazelcast supports indexes (sorted & unsorted) and what is important there is no Multi-Query problem with Hazelcast.
Hazelcast supports a PartitionPredicate that restricts the execution of a query to a node that is a primaryReplica of the key passed to the constructor of the PartitionPredicate. So if you know where the data resides you can just query this node. So no need to fix or implement anything to support it, you can use it right away.
It's probably not reasonable to use it all the time. Depends on your use-case.
For complex queries that scan a lot of data but return small results it's better to use OBJECT inMemoryFormat. You should get excellent execution times and low latencies.
Disclaimer: I am GridGain employee and Apache Ignite committer.
Several comments on your concerns:
1) Slow nodes will lead to problems in virtually any clustered environment, so I would not consider this as disadvantage. This is reality you should embrace and accept. It is necessary understand why it is slow and fix/upgrade it.
2) Ignite are able to perform reads from slaves both for regular cache operations [1] and for SQL queries executed over REPLICATED caches. In fact, using REPLICATED cache for reference data is one of the most important features allowing Ignite to scale smoothly.
3) As you correctly mentioned, currently query is broadcasted to all data nodes. We are going to improve it. First, we will let users to specify partitions to execute the query against [2]. Second, we are going to improve our optimizer so that it will try to calculate target data nodes in advance to avoid broadcast [3], [4]. Both improvements will be released very soon.
4) Last, but not least - persistent layer will be released in several months [5], meaning that Ignite will become distributed database with both in-memory and persistence capabilities.
[1] https://ignite.apache.org/releases/mobile/org/apache/ignite/configuration/CacheConfiguration.html#isReadFromBackup()
[2] https://issues.apache.org/jira/browse/IGNITE-4523
[3] https://issues.apache.org/jira/browse/IGNITE-4509
[4] https://issues.apache.org/jira/browse/IGNITE-4510
[5] http://apache-ignite-developers.2346864.n4.nabble.com/GridGain-Donates-Persistent-Distributed-Store-To-ASF-Apache-Ignite-tc16788.html
I can give opinions on cassandra. Max size of your table per node is configurable and tunable so it depends on the amount of the memory that you are willing to pay. Partitioning is built in into cassandra so basically cassandra manages it for you. It's relatively simple to do paritioning. Basically first part of the primary key syntax is partitioning key and it determines on which node in the cluster the data lives.
But I also guess you are aware of this since you are mentioning multiple query per node. I guess there is no nice way around it.
Just one slight remark there is no master slaves in cassandra. Every node is equal. Basically client asks any node in the cluster, this node then becomes coordinator nodes and since it gets partitioning key it knows which node to ask the data for and it gives it then to the client.
Other than that I guess you read upon cassandra enough (from what I can see in your question)
Basically it comes down to the access pattern, if you know how you are going to access your data then it's the way to go. But other databases are also pretty decent.
Indexing with cassandra usually hides some potential performance problems. Usually people avoid it because in cassandra index has to be build for every record there is on whole cluster and it's done per node. This doesn't really scale. Basically you always have to do query first no matter how ypu put it with cassandra.
Plus the in memory seems to be part of the DSE cassandra. Not the open source or community one. You have to take this into account also.
Background
We have recently started a "Big Data" project where we want to track what users are doing with our product - how often they are logging in, which features they are clicking on, etc - your basic user analytics stuff. We still don't know exactly what questions we will be asking, but most of it will be "how often did X occur over the last Y months?" type of thing, so we started storing the data sooner rather than later thinking we can always migrate, re-shape etc when we need to but if we don't store it it is gone forever.
We are now looking at what sorts of questions we can ask. In a typical RDBMS, this stage would consist of slicing and dicing the data in many different dimensions, exporting to Excel, producing graphs, looking for trends etc - it seems that for Cassandra, this is rather difficult to do.
Currently we are using Apache Spark, and submitting Spark SQL jobs to slice and dice the data. This actually works really well, and we are getting the data we need, but it is rather cumbersome as there doesn't seem to be any native API for Spark that we can connect to from our workstations, so we are stuck using the spark-submit script and a Spark app that wraps some SQL from the command line and outputs to a file which we then have to read.
The question
In a table (or Column Family) with ~30 columns running on 3 nodes with RF 2, how bad would it be to add an INDEX to every non-PK column, so that we could simply query it using CQL across any column? Would there be a horrendous impact on the performance of writes? Would there be a large increase in disk space usage?
The other option I have been investigating is using Triggers, so that for each row inserted, we populated another handful of tables (essentially, custom secondary index tables) - is this a more acceptable approach? Does anyone have any experience of the performance impact of Triggers?
Impact of adding more indexes:
This really depends on your data structure, distribution and how you access it; you were right before when you compared this process to RDMS. For Cassandra, it's best to define your queries first and then build the data model.
These guys have a nice write-up on the performance impacts of secondary indexes:
https://pantheon.io/blog/cassandra-scale-problem-secondary-indexes
The main impact (from the post) is that secondary indexes are local to each node, so to satisfy a query by indexed value, each node has to query its own records to build the final result set (as opposed to a primary key query where it is known exactly which node needs to be quired). So there's not just an impact on writes, but on read performance as well.
In terms of working out the performance on your data model, I'd recommend using the cassandra-stress tool; you can combine it with a data modeler tool that Datastax have built, to quickly generate profile yamls:
http://www.datastax.com/dev/blog/data-modeler
For example, I ran the basic stress profile without and then with secondary indexes on the default table, and the "with indexes" batch of writes took a little over 40% longer to complete. There was also an increase in GC operations / duration etc.
I have a use-case of inserting a lot of data during big calculation which really don't have to be available in the cluster immediately (so the cluster can synchronize as we go).
Currently I'm inserting batches using putAll() operation and it's blocking and taking time.
I've read a blog post about efficiency of set() operation but there is no analogous setAll(). I also saw putAsync() and didn't see matching putAllAsync() (I'm not interested in the future object).
Am I overlooking something? How can I improve insertion performance?
EDIT: Feature request: https://github.com/hazelcast/hazelcast/issues/5337
I think you're right, they're missing. Could you create a feature request, maybe you're also interested in helping to implement them using the Hazelcast Incubator?