I'm evaluating spark-cassandra-connector and i'm struggling trying to get a range query on partition key to work.
According to the connector's documentation it seems that's possible to make server-side filtering on partition key using equality or IN operator, but unfortunately, my partition key is a timestamp, so I can not use it.
So I tried using Spark SQL with the following query ('timestamp' is the partition key):
select * from datastore.data where timestamp >= '2013-01-01T00:00:00.000Z' and timestamp < '2013-12-31T00:00:00.000Z'
Although the job spawns 200 tasks, the query is not returning any data.
Also I can assure that there is data to be returned since running the query on cqlsh (doing the appropriate conversion using 'token' function) DOES return data.
I'm using spark 1.1.0 with standalone mode. Cassandra is 2.1.2 and connector version is 'b1.1' branch. Cassandra driver is DataStax 'master' branch.
Cassandra cluster is overlaid on spark cluster with 3 servers with replication factor of 1.
Here is the job's full log
Any clue anyone?
Update: When trying to do server-side filtering based on the partition key (using CassandraRDD.where method) I get the following exception:
Exception in thread "main" java.lang.UnsupportedOperationException: Range predicates on partition key columns (here: timestamp) are not supported in where. Use filter instead.
But unfortunately I don't know what "filter" is...
i think the CassandraRDD error is telling that the query that you are trying to do is not allowed in Cassandra and you have to load all the table in a CassandraRDD and then make a spark filter operation over this CassandraRDD.
So your code (in scala) should something like this:
val cassRDD= sc.cassandraTable("keyspace name", "table name").filter(row=> row.getDate("timestamp")>=DateFormat('2013-01-01T00:00:00.000Z')&&row.getDate("timestamp") < DateFormat('2013-12-31T00:00:00.000Z'))
If you are interested in making this type of queries you might have to take a look to others Cassandra connectors, like the one developed by Stratio
You have several options to get the solution you are looking for.
The most powerful one would be to use Lucene indexes integrated with Cassandra by Stratio, which allows you to search by any indexed field in the server side. Your writing time will be increased but, on the other hand, you will be able to query any time range. You can find further information about Lucene indexes in Cassandra here. This extended version of Cassandra is fully integrated into the deep-spark project so you can take all the advantages of the Lucene indexes in Cassandra through it. I would recommend you to use Lucene indexes when you are executing a restricted query that retrieves a small-medium result set, if you are going to retrieve a big piece of your data set, you should use the third option underneath.
Another approach, depending on how your application works, might be to truncate your timestamp field so you can look for it using an IN operator. The problem is, as far as I know, you can't use the spark-cassandra-connector for that, you should use the direct Cassandra driver which is not integrated with Spark, or you can have a look at the deep-spark project where a new feature allowing this is about to be released very soon. Your query would look something like this:
select * from datastore.data where timestamp IN ('2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', ... , '2013-12-31')
, but, as I said before, I don't know if it fits to your needs since you might not be able to truncate your data and group it by date/time.
The last option you have, but the less efficient, is to bring the full data set to your spark cluster and apply a filter on the RDD.
Disclaimer: I work for Stratio :-) Don't hesitate on contacting us if you need any help.
I hope it helps!
Related
I am trying to find out if we can combine the 'Select *' and 'Select count(*)' query together in CQL and if yes, then how about the performance, does it increase the execution performance of the query than running 2 separate queries on large dataset.
Short answer - no, you can't combine these two queries.
Longer answer - don't do that. On the big datasets your query most probably will timeout because Cassandra will need to go through all data on all machines. This will lead to increased load to coordinator node, and potentially crash it. If you need to fetch all data you need to use another approach:
use Spark Cassandra Connector that is heavily optimized for such tasks
if you want to offload data, or just count, you can use DSBulk utility for that
if you still need to do that from the code, you need to perform so-called token range scanning - technology that is used by Spark Cassandra Connector and DSBulk. You can either implement that yourself (here is an example for Java driver 3.x), or for Java use DSBulk's API - the jars are available via Maven (I don't have example for that)
I'm very new to the ETL world and I wish to implement Incremental Data Loading with Cassandra 3.7 and Spark. I'm aware that later versions of Cassandra do support CDC, but I can only use Cassandra 3.7. Is there a method through which I can track the changed records only and use spark to load them, thereby performing incremental data loading?
If it can't be done on the cassandra end, any other suggestions are also welcome on the Spark side :)
It's quite a broad topic, and efficient solution will depend on the amount of data in your tables, table structure, how data is inserted/updated, etc. Also, specific solution may depend on the version of Spark available. One downside of Spark-only method is you can't easily detect deletes of the data, without having a complete copy of previous state, so you can generate a diff between 2 states.
In all cases you'll need to perform full table scan to find changed entries, but if your table is organized specifically for this task, you can avoid reading of all data. For example, if you have a table with following structure:
create table test.tbl (
pk int,
ts timestamp,
v1 ...,
v2 ...,
primary key(pk, ts));
then if you do following query:
import org.apache.spark.sql.cassandra._
val data = spark.read.cassandraFormat("tbl", "test").load()
val filtered = data.filter("""ts >= cast('2019-03-10T14:41:34.373+0000' as timestamp)
AND ts <= cast('2019-03-10T19:01:56.316+0000' as timestamp)""")
then Spark Cassandra Connector will push this query down to the Cassandra, and will read only data where ts is in the given time range - you can check this by executing filtered.explain and checking that both time filters are marked with * symbol.
Another way to detect changes is to retrieve the write time from Cassandra, and filter out the changes based on that information. Fetching of writetime is supported in RDD API for all recent versions of SCC, and is supported in the Dataframe API since release of SCC 2.5.0 (requires at least Spark 2.4, although may work with 2.3 as well). After fetching this information, you can apply filters on the data & extract changes. But you need to keep in mind several things:
there is no way to detect deletes using this method
write time information exists only for regular & static columns, but not for columns of primary key
each column may have its own write time value, in case if there was a partial update of the row after insertion
in most versions of Cassandra, call of writetime function will generate error when it's done for collection column (list/map/set), and will/may return null for column with user-defined type
P.S. Even if you had CDC enabled, it's not a trivial task to use it correctly:
you need to de-duplicate changes - you have RF copies of the changes
some changes could be lost, for example, when node was down, and then propagated later, via hints or repairs
TTL isn't easy to handle
...
For CDC you may look for presentations from 2019th DataStax Accelerate conference - there were several talks on that topic.
I'm working on a project that involves reading data from RDBMS using JDBC and I succeeded reading the data. This is something I will be doing fairly constantly, weekly. So I've been trying to come up with a way to ensure that after the initial read, subsequent ones should only pull updated records instead of pulling the entire data from the table.
I can do this with sqoop incremental import by specifying the three parameters (--check-column, --incremental last-modified/append and --last-value). However, I dont want to use sqoop for this. Is there a way I can replicate same in Spark with Scala?
Secondly, some of the tables do not have unique column which can be used as partitionColumn, so I thought of using a row-number function to add a unique column to these table and then get the MIN and MAX of the unique column as lowerBound and upperBound respectively. My challenge now is how to dynamically parse these values into the read statement like below:
val queryNum = "select a1.*, row_number() over (order by sales) as row_nums from (select * from schema.table) a1"
val df = spark.read.format("jdbc").
option("driver", driver).
option("url",url ).
option("partitionColumn",row_nums).
option("lowerBound", min(row_nums)).
option("upperBound", max(row_nums)).
option("numPartitions", some value).
option("fetchsize",some value).
option("dbtable", queryNum).
option("user", user).
option("password",password).
load()
I know the above code is not right and might be missing a whole lot of processes but I guess it'll give a general overview of what I'm trying to achieve here.
It's surprisingly complicated to handle incremental JDBC reads in Spark. IMHO, it severely limits the ease of building many applications and may not be worth your trouble if Sqoop is doing the job.
However, it is doable. See this thread for an example using the dbtable option:
Apache Spark selects all rows
To keep this job idempotent, you'll need to read in the max row of your prior output either directly from loading all data files or via a log file that you write out each time. If your data files are massive you may need to use the log file, if smaller you could potentially load.
How can I export data, over a period of time (like hourly or daily) or updated records from a Cassandra database? It seems like using an index with a date field might work, but I definitely get timeouts in my cqlsh when I try that by hand, so I'm concerned that it's not reliable to do that.
If that's not the right way, then how do people get their data out of Cassandra and into a traditional database (for analysis, querying with JOINs, etc..)? It's not a java shop, so using Spark is non-trivial (and we don't want to change our whole system to use Spark instead of cassandra directly). Do I have to read sstables and try to keep track of them that way? Is there a way to say "get me all records affected after point in time X" or "get me all changes after timestamp X" or something similar?
It looks like Cassandra is really awesome at rapidly reading and writing individual records, but beyond that Cassandra seems to not be the right tool if you want to pull its data into anything else for analysis or warehousing or querying...
Spark is the most typical to do exactly that (as you say). It does it efficiently and is used often so pretty reliable. Cassandra is not really designed for OLAP workloads but things like spark connector help bridge the gap. DataStax Enterprise might have some more options available to you but I am not sure their current offerings.
You can still just query and page through the whole data set with normal CQL queries, its just not as fast. You can even use ALLOW FILTERING just be wary as its very expensive and can impact your cluster (creating a separate dc for the workload and using LOCOL_CL queries against it helps). You will probably also in that scenario add a < token() and > token() to the where clause to split up the query and prevent too much work on any one coordinator. Organizing your data so that this query is more efficient would be strongly recommended (ie if doing time slices, put things in a partition bucketed by time and clustering key timeuuids so its sequential read for each part of time).
Kinda cheesy sounding but the CSV dump from cqlsh is actually fast and might work for you if your data set is small enough.
I would not recommend going to the sstables directly unless you are familiar with internals and using hadoop or spark.
I am working with about a TB of data stored in Cassandra and trying to query it using Spark and R (could be Python).
My preference for querying the data would be to abstract the Cassandra table I'm querying from as a Spark RDD (using sparklyr and the spark-cassandra-connector with spark-sql) and simply doing an inner join on the column of interest (it is a partition key column). The company I'm working with says that this approach is a bad idea as it will translate into an IN clause in CQL and thus cause a big slow-down.
Instead I'm using their preferred method: write a closure that will extract the data for a single id in the partition key using a jdbc connection and then apply that closure 200k times for each id I'm interested in. I use spark_apply to apply that closure in parallel for each executor. I also set my spark.executor.cores to 1 so I get a lot of parellelization.
I'm having a lot of trouble with this approach and am wondering what the best practice is. Is it true that Spark SQL does not account for the slowdown associated with pulling multiple ids from a partition key column (IN operator)?
A few points here:
Working with Spark-SQL is not always the most performant option, the
optimized might not always as good of a job than a job you write
yourself
Check the logs carefully during your work, always check how your high-level queries are translated to CQL queries. In particular, make sure you avoid a full table scan if you can.
If you joining on the partition key, you should look into leveraging the methods: repartitionByCassandraReblica, and joinWithCassandraTable. Have a look at the official doc here: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md and Tip4 of this blog post: https://www.instaclustr.com/cassandra-connector-for-spark-5-tips-for-success/
Finale note, it's quite common to have 2 Cassandra data center when using Spark. The first one serves regular read / write, the second one is used for running Spark. It's a separation of concern best practice (at the cost of an additional DC of course).
Hope it helps!