DataStax DSBulk - Difference between query / table unload

DataStax DSBulk - Difference between query / table unload - cassandra

I'm using dsbulk to try to extract some data from our cassandra cluster, and seeing some odd behavior. Trying to understand if this is expected.
If I perform an unload by specifying tablespace and table, I'm seeing different (less) results than if I perform a query unload specifying select * from table.
I assumed this might be a consistency issue within the cluster, but I've tried various consistency levels, and the results are the same at all levels between ONE and ALL.
Anyone know if this is expected behavior? The direct table extract is about 2x faster, so would prefer that if at all possible.

You are certainly hitting DAT-295, a bug that was fixed since. Please upgrade to the latest DSBulk version (1.2.0 atm - 1.3.0 is due in a few weeks).

Related

Does cassandra guarantee row level consistency during write?

As I understand a row in a cassandra table is a Set of Key-value pairs (corresponding to each column)
I notice a strange issue during insert, values are not persisted in couple of columns, though I am fairly confident it has values before insert.
It happens sporadically and succeeds if we retry later. We are suspecting some kind of race condition or db connection drop etc.
Is it possible that only a subset of keys gets saved in a row of cassandra table ? Does cassandra guarantee all or nothing during save (row level consistency)
Cassandra Version : 2.1.8
Datastax cassandra-driver-core : 3.1.0

On the row level the concurrency guarantees are described pretty much in this answer.
Cassandra row level isolation
As far as your problem goes. First check if it's really cassandra with dropped mutations
nodetool tpstats
If you see dropped mutations, it's likely you are running underpowered setup and you simply have to put more hardware to the problem you are facing.
There isn't really more from your question that I can tell. Just as a precaution, please go into your code and check that you are actually creating a new bound statement every time and that you are not reusing the created bound statement instance. Once a client had this issue that the inserts were lost under mysterious circumstances and that was it. Hope this helps you, if not please give some code that you have.

There are consistency levels for read and writes in Cassandra.
It looks like you are using consistency level one, so your reads/writes are not consistent. Try to use quorum for both reads and writes and see if the problem resolves.
If this doesn't help, please provide example query, cluster size, rf factor.

Cassandra read operation error using datastax cassandra

Sorry if this is an existing question, but any of the existing ones resolved my problem..
I've installed Cassandra single noded. I don't have a large application right now, but I think this can be the case soon, and I will need more and more nodes..
Well, I'm saving data from a stream to Cassandra, and this were going well, but suddently, when I tried to read data, I've started to receive this error:
"Not enough replica available for query at consistency ONE (1 required but only 0 alive)"
My keyspace was built using simplestrategy with replication_factor = 1. Im saving data separated by a field called "catchId", so most of my queries are like: "select * from data where catchId='xxx'". catchId is a partition key.
I'm using the cassandra-driver-core version 3.0.0-rc1.
The thing is that I don't have that much of data rigth now, and I'm thinking if it will be better to use a RDBMS for now, and migrate to Cassandra only when I have a better infrastructure.
Thanks :)

It seems that your node is unable to respond when you try to make your read (in general this error appears for more than one node). If you do not have lots of data, it's very strange, so this is probably a bad design choice. This can emanate from several things, so you have to make a few investigations.
study your logs ! In particular the system.log
you can change your read_request_timeout_in_ms parameter in cassandra.yaml. Although it's not agood idea in production, it will say you if it's just temporary problem (your request succeed after a little time) or a bigger problem
study your CPU and memory behavior when you are doing requests
if you are very motivated, you can install opscenter which will you give more valuable informations
How and how many write requests are you doing ? It can overwhelm cassandra (even if it's designed for). I recommend to make async requests to avoid problems.

Spark Cassandra connector - Range query on partition key

I'm evaluating spark-cassandra-connector and i'm struggling trying to get a range query on partition key to work.
According to the connector's documentation it seems that's possible to make server-side filtering on partition key using equality or IN operator, but unfortunately, my partition key is a timestamp, so I can not use it.
So I tried using Spark SQL with the following query ('timestamp' is the partition key):
select * from datastore.data where timestamp >= '2013-01-01T00:00:00.000Z' and timestamp < '2013-12-31T00:00:00.000Z'
Although the job spawns 200 tasks, the query is not returning any data.
Also I can assure that there is data to be returned since running the query on cqlsh (doing the appropriate conversion using 'token' function) DOES return data.
I'm using spark 1.1.0 with standalone mode. Cassandra is 2.1.2 and connector version is 'b1.1' branch. Cassandra driver is DataStax 'master' branch.
Cassandra cluster is overlaid on spark cluster with 3 servers with replication factor of 1.
Here is the job's full log
Any clue anyone?
Update: When trying to do server-side filtering based on the partition key (using CassandraRDD.where method) I get the following exception:
Exception in thread "main" java.lang.UnsupportedOperationException: Range predicates on partition key columns (here: timestamp) are not supported in where. Use filter instead.
But unfortunately I don't know what "filter" is...

i think the CassandraRDD error is telling that the query that you are trying to do is not allowed in Cassandra and you have to load all the table in a CassandraRDD and then make a spark filter operation over this CassandraRDD.
So your code (in scala) should something like this:
val cassRDD= sc.cassandraTable("keyspace name", "table name").filter(row=> row.getDate("timestamp")>=DateFormat('2013-01-01T00:00:00.000Z')&&row.getDate("timestamp") < DateFormat('2013-12-31T00:00:00.000Z'))
If you are interested in making this type of queries you might have to take a look to others Cassandra connectors, like the one developed by Stratio

You have several options to get the solution you are looking for.
The most powerful one would be to use Lucene indexes integrated with Cassandra by Stratio, which allows you to search by any indexed field in the server side. Your writing time will be increased but, on the other hand, you will be able to query any time range. You can find further information about Lucene indexes in Cassandra here. This extended version of Cassandra is fully integrated into the deep-spark project so you can take all the advantages of the Lucene indexes in Cassandra through it. I would recommend you to use Lucene indexes when you are executing a restricted query that retrieves a small-medium result set, if you are going to retrieve a big piece of your data set, you should use the third option underneath.
Another approach, depending on how your application works, might be to truncate your timestamp field so you can look for it using an IN operator. The problem is, as far as I know, you can't use the spark-cassandra-connector for that, you should use the direct Cassandra driver which is not integrated with Spark, or you can have a look at the deep-spark project where a new feature allowing this is about to be released very soon. Your query would look something like this:
select * from datastore.data where timestamp IN ('2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', ... , '2013-12-31')
, but, as I said before, I don't know if it fits to your needs since you might not be able to truncate your data and group it by date/time.
The last option you have, but the less efficient, is to bring the full data set to your spark cluster and apply a filter on the RDD.
Disclaimer: I work for Stratio :-) Don't hesitate on contacting us if you need any help.
I hope it helps!

cql poor write perfromance and write timeouts with default configuration

I have a csv file with about 30 columns and 1 millions rows (less than 1GB in size).
I am using a single machine/node on localhost and my keyspace has:
WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor' : 1};
The columns are mostly doubles, with a few strings.
I have tried two methods to load this into cassandra using the default cassandra.yaml:
1) using the COPY function directly from CQL
2) using the cqlengine python driver wrapped around CQL with multiple scripts and batched inserts on a set of broken up csv files
Both approaches seem to take over an hour with default cassandra settings on both linux/windows. Is this really the speed I should expect? I was expecting something on the order of minutes.
If not, what are the key options I should focus on, or how can I quickly diagnose what is the bottleneck? This seems like a trivial use case (admittedly not a focus of Cassandra), so I'm having trouble understanding why it should be so challenging.
I've tried disabling commit logs, and changing other options. I'm trying to understand the source of this performance hit.

You might find http://datastax.github.io/python-driver/performance.html useful. Switching COPY FROM from a synchronous execution to a callback chaining gave us 10x increase in performance

What is a good Bulk data loading tool for Cassandra

I'm looking for a tool to load CSV into Cassandra. I was hoping to use RazorSQL for this but I've been told that it will be several months out.
What is a good tool?
Thanks

1) If you have all the data to be loaded in place you can try sstableloader(only for cassandra 0.8.x onwards) utility to bulk load the data.For more details see:cassandra bulk loader
2) Cassandra has introduced BulkOutputFormat bulk loading data into cassandra with hadoop job in latest version that is cassandra-1.1.x onwards.
For more details see:Bulkloading to Cassandra with Hadoop

I'm dubious that tool support would help a great deal with this, since a Cassandra schema needs to reflect the queries that you want to run, rather than just being a generic model of your domain.
The built-in bulk loading mechanism for cassandra is via BinaryMemtables: http://wiki.apache.org/cassandra/BinaryMemtable
However, whether you use this or the more usual Thrift interface, you still probably need to manually design a mapping from your CSV into Cassandra ColumnFamilies, taking into account the queries you need to run. A generic mapping from CSV-> Cassandra may not be appropriate since secondary indexes and denormalisation are commonly needed.

For Cassandra 1.1.3 and higher, there is the CQL COPY command that is available for importing (or exporting) data to (or from) a table. According to the documentation, if you are importing less than 2 million rows, roughly, then this is a good option. Is is much easier to use than the sstableloader and less error prone. The sstableloader requires you to create strictly formatted .db files whereas the CQL COPY command accepts a delimited text file. Documenation here:
http://www.datastax.com/docs/1.1/references/cql/COPY
For larger data sets, you should use the sstableloader.http://www.datastax.com/docs/1.1/references/bulkloader. A working example is described here http://www.datastax.com/dev/blog/bulk-loading.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string