Having trouble querying by dates using the Java Cassandra Spark SQL Connector - cassandra

I'm trying to use Spark SQL to query a table by a date range. For example, I'm trying to run an SQL statement like: SELECT * FROM trip WHERE utc_startdate >= '2015-01-01' AND utc_startdate <= '2015-12-31' AND deployment_id = 1 AND device_id = 1. When I run the query no error is being thrown but I'm not receiving any results back when I would expect some. When running the query without the date range I am getting results back.
SparkConf sparkConf = new SparkConf().setMaster("local").setAppName("SparkTest")
.set("spark.executor.memory", "1g")
.set("spark.cassandra.connection.host", "localhost")
.set("spark.cassandra.connection.native.port", "9042")
.set("spark.cassandra.connection.rpc.port", "9160");
JavaSparkContext context = new JavaSparkContext(sparkConf);
JavaCassandraSQLContext sqlContext = new JavaCassandraSQLContext(context);
sqlContext.sqlContext().setKeyspace("mykeyspace");
String sql = "SELECT * FROM trip WHERE utc_startdate >= '2015-01-01' AND utc_startdate < '2015-12-31' AND deployment_id = 1 AND device_id = 1";
JavaSchemaRDD rdd = sqlContext.sql(sql);
List<Row> rows = rdd.collect(); // rows.size() is zero when I would expect it to contain numerous rows.
Schema:
CREATE TABLE trip (
device_id bigint,
deployment_id bigint,
utc_startdate timestamp,
other columns....
PRIMARY KEY ((device_id, deployment_id), utc_startdate)
) WITH CLUSTERING ORDER BY (utc_startdate ASC);
Any help would be greatly appreciated.

What does your table schema (in particular, your PRIMARY KEY definition) look like? Even without seeing it, I am fairly certain that you are seeing this behavior because you are not qualifying your query with a partition key. Using the ALLOW FILTERING directive will filter the rows by date (assuming that is your clustering key), but that is not a good solution for a big cluster or large dataset.
Let's say that you are querying users in a certain geographic region. If you used region as a partition key, you could run this query, and it would work:
SELECT * FROM users
WHERE region='California'
AND date >= '2015-01-01' AND date <= '2015-12-31';
Give Patrick McFadin's article on Getting Started with Timeseries Data a read. That has some good examples that should help you.

Related

Reading guarantees for full table scan while updating the table?

Given schema:
CREATE TABLE keyspace.table (
key text,
ckey text,
value text
PRIMARY KEY (key, ckey)
)
...and Spark pseudocode:
val sc: SparkContext = ...
val connector: CassandraConnector = ...
sc.cassandraTable("keyspace", "table")
.mapPartitions { partition =>
connector.withSessionDo { session =>
partition.foreach { row =>
val key = row.getString("key")
val ckey = Random.nextString(42)
val value = row.getString("value")
session.execute(s"INSERT INTO keyspace.table (key, ckey, value)" +
" VALUES ($key, $ckey, $value)")
}
}
}
Is it possible for a code like this to read an inserted value within a single application (Spark job) run? More generalized version of my question would be whether a token range scan CQL query can read newly inserted values while iterating over rows.
Yes, it is possible exactly as Alex wrote
but I don't think it's possible with above code
So per data model the table is ordered by ckey in ascending order
The funny part however is the page size and how many pages are prefetched and since this is by default 1000 (spark.cassandra.input.fetch.sizeInRows), then the only problem could occur, if you wouldn't use 42, but something bigger and/or the executor didn't page yet
Also I think you use unnecessary nesting, so the code to achieve what you want might be simplified (after all cassandraTable will give you a data frame).
(I hope I understand that you want to read per partition (note a partition in your case is all rows under one primary key - "key") and for every row (distinguished by ckey) in this partition generate new one (with new ckey that will just duplicate value with new ckey) - use case for such code is a mystery for me, but I hope it has some sense:-))

Data model in Cassandra and proper deletion Strategy

I have following table in cassandra:
CREATE TABLE article (
id text,
price int,
validFrom timestamp,
PRIMARY KEY (id, validFrom)
) WITH CLUSTERING ORDER BY (validFrom DESC);
With articles and historical price information (validFrom is a timestamp of new price). Article price changes often. I want to query for
Historic price for a particular article.
Last price for an article.
From my understanding, I can solve both problems with following query:
select id, price from article where id = X validFrom < Y limit 1;
This query uses article id as restriction, query uses the partition key. Since the clustering order is based on the validFrom timestamp in reversed order, cassandra can efficient perform this query.
Am I getting this right?
What is the best approach to delete old data (house-keeping). Let's assume, I want delete all articles with validFrom > 20150101 and validFrom < 20151231. Since I don't have a primary key, this would be inefficient, even if I use an index on validFrom, right? How can I achieve this?
You can use external tools for that:
Spark with Spark Cassandra Connector (even in the local mode). Code could look as following (note that I'm using validfrom as name, not validFrom, as it's not escaped in your schema):
import com.datastax.spark.connector._
val data = sc.cassandraTable("test", "article")
.where("validfrom >= '2020-07-28T11:50:00Z' AND validfrom < '2020-07-28T12:50:00Z'")
.select("id", "validfrom")
data.deleteFromCassandra("test", "article", keyColumns=SomeColumns("id", "validfrom"))
use DSBulk to do find the matching entries and output them into the file (output.csv in my case), and then perform their deletion:
bin/dsbulk unload -url output.csv \
-query "SELECT id, validfrom FROM test.article WHERE token(id) > :start AND token(id) <= :end AND validFrom >= '2020-07-28T11:50:00Z' AND validFrom < '2020-07-28T12:50:00Z' ALLOW FILTERING"
bin/dsbulk load -query "DELETE from test.article WHERE id = :id and validfrom = :validfrom" \
-url output.csv
To add to Alex Ott's answer, this comment of yours is incorrect:
This query uses article id as restriction, query uses the partition key. Since the clustering order is based on price, cassandra can efficient perform this query.
The rows are not ordered by price. They are ordered by validFrom in reverse-chronological order. Cheers!

PySpark Pushing down timestamp filter

I'm using PySpark version 2.4 to read some tables using jdbc with a Postgres driver.
df = spark.read.jdbc(url=data_base_url, table="tablename", properties=properties)
One column is a timestamp column and I want to filter it like this:
df_new_data = df.where(df.ts > last_datetime )
This way the filter is pushed down as a SQL query but the datetime format
is not right. So I tried this approach
df_new_data = df.where(df.ts > F.date_format( F.lit(last_datetime), "y-MM-dd'T'hh:mm:ss.SSS") )
but then the filter is no pushed down anymore.
Can someone clarify why this is the case ?
While loading the data from a Database table, if you want to push down queries to database and get few result rows, instead of providing the 'table', you can provide the 'Query' and return just the result as a DataFrame. This way, we can leverage database engine to process the query and return only the results to Spark.
The table parameter identifies the JDBC table to read. You can use anything that is valid in a SQL query FROM clause. Note that alias is mandatory to be provided in query.
pushdown_query = "(select * from employees where emp_no < 10008) emp_alias"
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
df.show()

Cassandra Modelling for Date Range

Cassandra Newbie here. Cassandra v 3.9.
I'm modelling the Travellers Flight Checkin Data.
My Main Query Criteria is Search for travellers with a date range (max of 7 day window).
Here is what I've come up with with my limited exposure to Cassandra.
create table IF NOT EXISTS travellers_checkin (checkinDay text, checkinTimestamp bigint, travellerName text, travellerPassportNo text, flightNumber text, from text, to text, bookingClass text, PRIMARY KEY (checkinDay, checkinTimestamp)) WITH CLUSTERING ORDER BY (checkinTimestamp DESC)
Per day, I'm expecting upto a million records - resulting in the partition to have a million records.
Now my users want search in which the date window is mandatory (max a week window). In this case should I use a IN clause that spans across multiple partitions? Is this the correct way or should I think of re-modelling the data? Alternatively, I'm also wondering if issuing 7 queries (per day) and merging the responses would be efficient.
Your Data Model Seems Good.But If you could add more field to the partition key it will scale well. And you should use Separate Query with executeAsync
If you are using in clause, this means that you’re waiting on this single coordinator node to give you a response, it’s keeping all those queries and their responses in the heap, and if one of those queries fails, or the coordinator fails, you have to retry the whole thing
Source : https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/
Instead of using IN clause, use separate query of each day and execute it with executeAsync.
Java Example :
PreparedStatement statement = session.prepare("SELECT * FROM travellers_checkin where checkinDay = ? and checkinTimestamp >= ? and checkinTimestamp <= ?");
List<ResultSetFuture> futures = new ArrayList<>();
for (int i = 1; i < 4; i++) {
ResultSetFuture resultSetFuture = session.executeAsync(statement.bind(i, i));
futures.add(resultSetFuture);
}
for (ResultSetFuture future : futures){
ResultSet rows = future.getUninterruptibly();
//You get the result set of each query, merge them here
}

Apache Spark Query with HiveContext doesn't work

I use Spark 1.6.1. In my Spark Java Programm I connect to a Postgres Database and register every table as a temporary table via JDBC. For example:
Map<String, String> optionsTable = new HashMap<String, String>();
optionsTable.put("url", "jdbc:postgresql://localhost/database?user=postgres&password=passwd");
optionsTable.put("dbtable", "table");
optionsTable.put("driver", "org.postgresql.Driver");
DataFrame table = sqlContext.read().format("jdbc").options(optionsTable).load();
table.registerTempTable("table");
This works without problems:
hiveContext.sql("select * from table").show();
Also this works:
DataFrame tmp = hiveContext.sql("select * from table where value=key");
tmp.registerTempTable("table");
And then I can see the contents of the table with:
hiveContext.sql("select * from table").show();
But now I have a Problem. When I execute this:
hiveContext.sql("SELECT distinct id, timestamp FROM measure, measure_range w WHERE tble.timestamp >= w.left and tble.timestamp <= w.right").show();
Spark does nothing, but at the origin databse on Postgres it works very good. So I decided to modify the query a little bit to this:
hiveContext.sql("SELECT id, timestamp FROM measure, measure_range w WHERE tble.timestamp >= w.left").show();
This Query is working and gives me results. But the other query is not working. Where is the difference and why is the first query not working, but the second is working good?
And the database is not very Big. For testing it has a size of 4 MB.
Since you're trying to select a distinct ID, you need to select timestamp as a part of an aggregate function and then group by ID. Otherwise, it doesn't know which time stamp to pair with the ID.

Resources