Unable to coerce to a formatted date - Cassandra timestamp type - cassandra

I have the values stored for timestamp type column in cassandra table in format of
2018-10-27 11:36:37.950000+0000 (GMT date).
I get Unable to coerce '2018-10-27 11:36:37.950000+0000' to a formatted date (long) when I run below query to get data.
select create_date from test_table where create_date='2018-10-27 11:36:37.950000+0000' allow filtering;
How to get the query working if the data is already stored in the table (of format, 2018-10-27 11:36:37.950000+0000) and also perform range (>= or <=) operations on create_date column?
I tried with create_date='2018-10-27 11:36:37.95Z',
create_date='2018-10-27 11:36:37.95' create_date='2018-10-27 11:36:37.95'too.
Is it possible to perform filtering on this kind of timestamp type data?
P.S. Using cqlsh to run query on cassandra table.

In first case, the problem is that you specify timestamp with microseconds, while Cassandra operates with milliseconds - try to remove the three last digits - .950 instead of .950000 (see this document for details). The timestamps are stored inside Cassandra as 64-bit number, and then formatted when printing results using the format specified by datetimeformat options of cqlshrc (see doc). Dates without explicit timezone will require that default timezone is specified in cqlshrc.
Regarding your question about filtering the data - this query will work only for small amounts of data, and on bigger data sizes will most probably timeout, as it will need to scan all data in the cluster. Also, the data won't be sorted correctly, because sorting happens only inside single partition.
If you want to perform such queries, then maybe the Spark Cassandra Connector will be the better choice, as it can effectively select required data, and then you can perform sorting, etc. Although this will require much more resources.
I recommend to take DS220 course from DataStax Academy to understand how to model data for Cassandra.

This is works for me
var datetime = DateTime.UtcNow.ToString("yyyy-MM-dd HH:MM:ss");
var query = $"SET updatedat = '{datetime}' WHERE ...

Related

Databricks query performance when filtering on a column correlated to the partition-column

Setting: Delta-lake, Databricks SQL compute used by powerbi.
I am wondering about the following scenario: We have a column timestamp and a derived column date (which is the date of timestamp), and we choose to partitionby date. When we query we use timestamp in the filter, not date.
My understanding is that databrikcs a priori wont connect the timestamp and the date, and seemingly wont get any advantage of the partitioning. But since the files are in fact partitioned by timestamps (implicitly), when databricks looks at the min/max timestamps of all the files, it will find that it can skip most files after all. So it seems like we can get quite a benefit of partitioning even if its on a column we dont explicitly use in the query.
Is this correct?
What is the performance cost (roughly) of having to filter away files in this way vs using the partitioning directly.
Will databricks have all the min/max information in memory, or does it have to go out and look at the files for each query?
Yes, Databricks will take implicit advantage of this partitioning through data skipping because there will be min/max statistics associated with specific data files. The min/max information will be loaded into memory from the transaction log, but it will need to make decision which files it need to hit on every query. But because everything is in memory, it shouldn't be very big performance overhead, until you have hundreds of thousands files.
One thing that you may consider - use generated column instead of explicit date column. Declare it as date GENERATED ALWAYS AS (CAST(timestampColumn AS DATE)), and partition by it. The advantage is that when you're doing a query on timestampColumn, then it should do partition filtering on the date column automatically.

Spark JDBC UpperBound

jdbc(String url,
String table,
String columnName,
long lowerBound,
long upperBound,
int numPartitions,
java.util.Properties connectionProperties)
Hello,
I want to import few table from Oracle to hdfs using spark jdbc connectivity. To ensure parallelism, I want to choose the correct upperBound for each table. I am planning put row_number as my partition column and count of the table as the upperBound. Is there a better way to chose upperBound?, since I have to connect to the table in the first time to get the count. Please help
Generally the better way to use partitioning in Spark JDBC:
Choose a numeric or date type column.
Set upper bound as the maximum value of the col
Set lower bound as the minimum value of the col
(if there is skew then there are other ways to handle, generally this is good)
Obviously the above requires some querying and handling
Keep the mapping of table: partition column (probably an external store)
Query and fetch min, max
Another tip for skipping query: if you can find a Date based column , you can probably use upper = today's date and lower = 2000's date. But again it subject's to your content. the values might not hold true.
From your question I believe you are looking for something generic which you can easily apply for all tables. I understand that's the desired state , but it was as straight forward as using row_number in DB to do so, Spark would have done that already by default.
Such functions may technically work, but will be definitely be slower than the above steps as well put extra load on your database.

Cassandra time series table design for timestamp range queries

Our problem is a bit different from a usual timeseries problem as we do not have natural partition key in our data. In our system we get not more than 5k/s messages, so following many publications (like this one) we figured out a following schema (it's more complex but the below matters most):
CREATE TABLE IF NOT EXISTS test.messages (
date TEXT,
hour INT,
createdAt TIMESTAMP,
uuid UUID,
data TEXT,
PRIMARY KEY ((date, hour), createdAt, uuid)
)
We mostly want to query the system based on the creation (event) time; other filtering will likely be done on different engines like Spark. The problem is that we may have a query that spans e.g. two months, so ideally we should put 60+ dates and 24 hours in the WHERE-IN-part of query, which is cumbersome to say the least. Of course, we can execute queries like below:
SELECT * FROM messages WHERE createdat >= '2017-03-01 00:00:00' LIMIT 10 ALLOW FILTERING;
My understanding is that, while the above works, it will make a full scan, which will be expensive on larger cluster. Or am I mistaken and C* knows, which partitions to scan?
I was thinking to add an index, but this problem likely falls into high-cardinality antipattern, as I understand.
EDIT: the question is not that much about the data model, though suggestions are welcome, but more about feasibility of making the queries with cratedat range instead or listing all date and hour values required in WHERE-IN-part of query to avoid full scans.

Cassandra Query by Date

How do I update a colum based on a greater or less then date in Casandra?
Example:
update asset_by_file_path set received = true where file_path = '/file/path' and time_received = '2015-07-24 02:14:34-0600';
This works fine. But I would like to do it for all columns that match this file path and time_received is greater then 2015-07-24 02:14:34-0600.
time_received is date, clustering column.
file_path is string, partition key
Cassandra's WHERE clause has many limitations and if you have several clustering columns things could not work as you expect, at least there are limitations for >, >=, <, <= etc operators. Here is a quite fresh blog post from Databrix about WHERE clause nuances, it also covers some upcoming features.
I think UPDATE can only modify a single row at a time, so I don't see a way to update multiple rows on the server side in CQL.
A couple possible programmatic approaches:
Do a range query to return all the rows you want to update, and then on the client side, update each row returned. Since they would all be for the same partition, you could issue the updates as batched statements.
If you have Spark available, you could read all the rows you want to update into an RDD using a range query. Then do a transformation on the RDD to set the received value to true, then save the RDD back to Cassandra.

cassandra filtering on an indexed column isn't working

I'm using (the latest version of) Cassandra nosql dbms to model some data.
I'd like to get a count of the number of active customer accounts in the last month.
I've created the following table:
CREATE TABLE active_accounts
(
customer_name text,
account_name text,
date timestamp,
PRIMARY KEY ((customer_name, account_name))
);
So because I want to filter by date, I create an index on the date column:
CREATE INDEX ON active_accounts (date);
When I insert some data, Cassandra automatically updates data on any existing primary key matches, so the following inserts only produce two records:
insert into active_accounts (customer_name, account_name, date) Values ('customer2', 'account2', 1418377413000);
insert into active_accounts (customer_name, account_name, date) Values ('customer1', 'account1', 1418377413000);
insert into active_accounts (customer_name, account_name, date) Values ('customer2', 'account2', 1418377414000);
insert into active_accounts (customer_name, account_name, date) Values ('customer2', 'account2', 1418377415000);
This is exactly what I'd like - I won't get a huge table of data, and each entry in the table represents a unique customer account - so no need for a select distinct.
The query I'd like to make - is how many distinct customer accounts are active within the last month say:
Select count(*) from active_accounts where date >= 1418377411000 and date <= 1418397411000 ALLOW FILTERING;
In response to this query, I get the following error:
code=2200 [Invalid query] message="No indexed columns present in by-columns clause with Equal operator"
What am I missing; isn't this the purpose of the Index I created?
Table design in Cassandra is extremely important and it must match the kind of queries that you are trying to preform. The reason that Cassandra is trying to keep you from performing queries on the date column, is that any query along that column will be extremely inefficient.
Table Design - Model your queries
One of the main reasons that Cassandra can be fast is that it partitions user data so that most( 99%)
of queries can be completed without contacting all of the nodes in the cluster. This means less network traffic, less disk access, and faster response time. Unfortunately Cassandra isn't able to determine automatically what the best way to partition data. The end user must determine a schema which fits into the C* datamodel and allows the queries they want at a high speed.
CREATE TABLE active_accounts
(
customer_name text,
account_name text,
date timestamp,
PRIMARY KEY ((customer_name, account_name))
);
This schema will only be efficient for queries that look like
SELECT timestamp FROM active_accounts where customer_name = ? and account_name = ?
This is because on the the cluster the data is actually going to be stored like
node 1: [ ((Bob,1)->Monday), ((Tom,32)->Tuesday)]
node 2: [ ((Candice, 3) -> Friday), ((Sarah,1) -> Monday)]
The PRIMARY KEY for this table says that data should be placed on a node based on the hash of the combination of CustomerName and AccountName. This means we can only look up data quickly if we have both of those pieces of data. Anything outside of that scope becomes a batch job since it requires hitting multiple nodes and filtering over all the data in the table.
To optimize for different queries you need to change the layout of your table or use a distributed analytics framework like Spark or Hadoop.
An example of a different table schema that might work for your purposes would be something like
CREATE TABLE active_accounts
(
start_month timestamp,
customer_name text,
account_name text,
date timestamp,
PRIMARY KEY (start_month, date, customer_name, account_name)
);
In this schema I would put the timestamp of the first day of the month as the partitioning key and date as the first clustering key. This means that multiple account creations that took place in the same month will end up in the same partition and on the same node. The data for a schema like this would look like
node 1: [ (May 1 1999) -> [(May 2 1999, Bob, 1), (May 15 1999,Tom,32)]
This places the account dates in order within each partition making it very fast for doing range slices between particular dates. Unfortunately you would have to add code on the application side to pull down the multiple months that a query might be spanning. This schema takes a lot of (dev) work so if these queries are very infrequent you should use a distributed analytics platform instead.
For more information on this kind of time-series modeling check out:
http://planetcassandra.org/getting-started-with-time-series-data-modeling/
Modeling in general:
http://www.slideshare.net/planetcassandra/cassandra-day-denver-2014-40328174
http://www.slideshare.net/johnny15676/introduction-to-cql-and-data-modeling
Spark and Cassandra:
http://planetcassandra.org/getting-started-with-apache-spark-and-cassandra/
Don't use secondary indexes
Allow filtering was added to the cql syntax to prevent users from accidentally designing queries that will not scale. The secondary indexes are really only for use by those do analytics jobs or those C* users who fully understand the implications. In Cassandra the secondary index lives on every node in your cluster. This means that any query that requires a secondary index necessarily will require contacting every node in the cluster. This will become less and less performant as the cluster grows and is definitely not something you want for a frequent query.

Resources