It was my understanding that references to streaming delta live tables require the use of the function STREAM(), supplying the table name as an argument.
Given below is a code snippet that I found in one of the demo notebooks that Databricks provide. Here, I see the use of STREAM() in the FROM clause, but it has not been used in the LEFT JOIN, even though that table is also a streaming table. This query still works.
What exactly is the correct syntax here?
CREATE OR REFRESH STREAMING LIVE TABLE sales_orders_cleaned(
CONSTRAINT valid_order_number EXPECT (order_number IS NOT NULL) ON VIOLATION DROP ROW
)
COMMENT "The cleaned sales orders with valid order_number(s) and partitioned by order_datetime."
AS
SELECT f.customer_id, f.customer_name, f.number_of_line_items,
timestamp(from_unixtime((cast(f.order_datetime as long)))) as order_datetime,
date(from_unixtime((cast(f.order_datetime as long)))) as order_date,
f.order_number, f.ordered_products, c.state, c.city, c.lon, c.lat, c.units_purchased, c.loyalty_segment
FROM STREAM(LIVE.sales_orders_raw) f
LEFT JOIN LIVE.customers c
ON c.customer_id = f.customer_id
AND c.customer_name = f.customer_name
Just for reference, given below are the other two tables that act as inputs to the above query,
CREATE OR REFRESH STREAMING LIVE TABLE sales_orders_raw
COMMENT "The raw sales orders, ingested from /databricks-datasets."
AS SELECT * FROM cloud_files("/databricks-datasets/retail-org/sales_orders/", "json", map("cloudFiles.inferColumnTypes", "true"))
CREATE OR REFRESH STREAMING LIVE TABLE customers
COMMENT "The customers buying finished products, ingested from /databricks-datasets."
AS SELECT * FROM cloud_files("/databricks-datasets/retail-org/customers/", "csv");
There are different types of joins on the Spark streams:
stream-static join. (doc) This is exactly your case, when you have STREAM(LIVE.sales_orders_raw) for orders, but the customers stream is considered static (it's read on each microbatch, and represents the state at moment of invocation). This is usually a case for your kind of functionality.
stream-stream join. In this case, both streams may need to align against each other, because data may come later, etc. In this case both streams will use STREAM(LIVE....) syntax. But it may not be the best case for you, because both streams need to wait until late data come, etc. - You will need to define a watermark for both streams, etc. Look for Spark documentation regarding that.
Related
I have a realtime streaming solution with Kafka, Spark (as the aggregation engine) and Cassandra (as the store). User defines the aggregates that are needed and the engine creates the aggregate and writes them to the store. Here is an example of how the aggregates are created
CREATE AGGR COUNT FROM input_data WHERE type,event,id
This creates a count aggregate for the 3 columns and writes to C*.
We have a requirement to process historical data as well. That means if an aggregate was created today, we need to go back and fix history for it. To cater to this use case, we have created a hvalue column in Cassandra. Here is the schema for reference
CREATE TABLE tbl (
key blob,
key2 blob,
key3 blob,
...
key15 blob,
column1 blob,
column2 blob,
...
column20 blob,
*hvalue* blob,
*value* blob,
PRIMARY KEY ((key, key2, key3 ... key15), column1 ... column20)
) WITH CLUSTERING ORDER BY (column1 ASC,column2 ASC .. column20 ASC)
value stores the facts that are computed while online processing. hvalue stores the value for historical processing. While querying, both the columns are retrieved, merged and returned to user.
We are using datastax leftJoin API to join with Cassandra.
RDD.leftJoinWithCassandraTable(keyspace,tableName)
.on(SomeColumns(...)
.map { case (ip, row) => row match {
case None => ip
case Some(data) => CASSANDRA_MAP_SCHEMA(...)
)
}
}.saveToCassandra(keyspace,tableName)
In short, we create a schema for the RDD, and write the row to Cassandra.
Now, here is the problem. During the historical process, we need to create a row to write to Cassandra. This means that we need to provide some data to the "value" column. If it is a new row that is not present in Cassandra, we create a null object and write back. If the row is present, we take the existing value and write it back.
The online and historical process will run at the same time. This means that when the historical process reads a row, and writes back, the online process may have created the same row. This will result in corrupt data, since the historical process may read a stale data and update the value that was written by the online process.
I am not sure how to resolve this problem. I'll appreciate if there is any other solutions to prevent this.
I tried to explain the best I can, let me know if further clarifications are needed and I'll try to add more inputs.
Thanks in advance for the help.
There are a few ways to work around this, but none are really simple. Fundamentally write after write problems are hard.
The first is that you introduce a shared external locking mechanism where you obtain a lock for the row and either release it when it is done or have a short ttl. You can use something like Redis for this.
A second option is to funnel all changes to Cassandra through a kafka queue so that only one source is allowed to write. Though there is a chance that this will make your problem worse. If you are going to do this, make sure that you are partitioning your queue based on keys so that the same key always routes to the same queue.
A third option is that the services are only allowed to operate on data for a given time range. If your online data is only allowed to work on data in the last day, or X hours, etc. and your historical is only allowed to work on data that is more than that period of time old then there is virtually no chance of running into conflicts.
The fourth option is to accept that it is a possibility and that the possibility of it happening is small enough that it isn't an issue. If the datacenter where your code runs is very close (ideally colocated with your db) and you aren't doing significant processing on the row between read and write this may be a reasonable option.
I have the following problem with PySpark Structured Streaming.
Every line in my stream data has a user ID and a timestamp. Now, for every line and for every user, I want to add a column with the difference of the timestamps.
For example, suppose the first line that I receive says: "User A, 08:00:00". If the second line says "User A, 08:00:10" then I want to add a column in the second line called "Interval" saying "10 seconds".
Is there anyone who knows how to achieve this? I tried to use the window functions examples of the Structured Streaming documentation but it was useless.
Thank you very much
Since we're speaking about Structured Streaming and "every line and for every user" that tells me that you should use a streaming query with some sort of streaming aggregation (groupBy and groupByKey).
For streaming aggregation you can only rely on micro-batch stream execution in Structured Streaming. That gives that records for a single user could be part of two different micro-batches. That gives that you need a state.
That all together gives that you need a stateful streaming aggregation.
With that, I think you want one of the Arbitrary Stateful Operations, i.e. KeyValueGroupedDataset.mapGroupsWithState or KeyValueGroupedDataset.flatMapGroupsWithState (see KeyValueGroupedDataset):
Many usecases require more advanced stateful operations than aggregations. For example, in many usecases, you have to track sessions from data streams of events. For doing such sessionization, you will have to save arbitrary types of data as state, and perform arbitrary operations on the state using the data stream events in every trigger.
Since Spark 2.2, this can be done using the operation mapGroupsWithState and the more powerful operation flatMapGroupsWithState. Both operations allow you to apply user-defined code on grouped Datasets to update user-defined state.
A state would be per user with the last record found. That looks doable.
My concerns would be:
How many users is this streaming query going to deal with? (the more the bigger the state)
When to clean up the state (of users that are no longer expected in a stream)? (which would keep the state of a reasonable size)
I have data that I would like to do a lot of analytic queries on and I'm trying to figure out if there is a mechanism I can use to store it so that Spark can efficiently do joins on it. I have a solution using RedShift, but would ideally prefer to have something that is based on files in S3 instead of having a whole RedShift cluster up 24/7.
Introduction to the data
This is a simplified example. We have 2 initial CSV files.
Person records
Event records
The two tables are linked via the person_id field. person_id is unique in the Person table. Events have a many-to-one relationship with person.
The goal
I'd like to understand how to set up the data so I can efficiently perform the following query. I will need to perform many queries like this (all queries are evaluated on a per person basis):
The query is to produce a data frame with 4 columns, and 1 row for every person.
person_id - person_id for each person in the data set
age - "age" field from the person record
cost - The sum of the "cost" field for all event records for that person where "date" is during the month of 6/2013
All current solutions I have with Spark to this problem involve reshuffling all the data, which ends up making the process slow for large amounts (hundreds of millions of people). I am happy with a solution that requires me to reshuffle the data and write it to a different format once if that can then speed up later queries.
The solution using RedShift
I can accomplish this solution using RedShift in a fairly straightforward way:
Each both files are loaded in as RedShift tables, with DISTKEY person_id, SORTKEY person_id. This distributes the data so that all the data for a person is on a single node. The following query will produce the desired data frame:
select person_id, age, e.cost from person
left join (select person_id, sum(cost) as cost from events
where date between '2013-06-01' and '2013-06-30'
group by person_id) as e using (person_id)
The solution using Spark/Parquet
I have thought of several potential ways to handle this in Spark, but none accomplishes what I need. My ideas and the issues are listed below:
Spark Dataset write 'bucketBy' - Read the CSV files and then rewrite them out as parquet files using "bucketBy". Queries on these parquet files could then be very fast. This would produce a data setup similar to RedShift, but parquet files don't support bucketBy.
Spark parquet partitioning - Parquet does support partitioning. Because parquet creates a separate set of files for each partition key, you have to create a computed column to partition on and use a hash of person_id to create the partitionKey. However, when you later join these tables in spark based on "partition_key" and "person_id", the query plan still does a full hash partition. So this approach is no better than just reading the CSVs and shuffling every time.
Stored in some other data format besides parquet - I am open to this, but don't know of another data source that will work.
Using a compound record format - Parquet supports hierarchical data formats, so can prejoin both tables into a hierarchical record (where a person record has an "events" field which is an array of struct elements) and then do processing on that. When you have a hierarchical record, there are two approaches that to processing it:
** Use explode to create separate records ** - Using this approach you explode array fields into full rows, then use standard data frame operations to do analytics, and then join them back to the main table. Unfortunately, I've been unable to get this approach to efficiently compile queries.
** Use UDFs to perform operations on subrecords ** - This preserves the structure and executes without shuffles, but is an awkward and verbose way to program. Also, it requires lots of UDFs which aren't great for performance (although they beat large scale shuffling of data).
For my use cases, Spark has advantages over RedShift which aren't obvious in this simple example, so I'd prefer to do this with Spark. Please let me know if I am missing something and there is a good approach to this.
Edited per comment.
Assumptions:
Using parquet
Here's what I would try:
val eventAgg = spark.sql("""select person_id, sum(cost) as cost
from events
where date between '2013-06-01' and '2013-06-30'
group by person_id""")
eventAgg.cache.count
val personDF = spark.sql("""SELECT person_id, age from person""")
personDF.cache.count // cache is less important here, so feel free to omit
eventAgg.join(personDF, "person_id", "left")
I just did this with some of my data and here's how it went (9
node/140 vCPUs cluster, ~600GB RAM):
27,000,000,000 "events" (aggregated to 14,331,487 "people")
64,000,000 "people" (~20 columns)
aggregated events building and caching took ~3 min
people caching took ~30 seconds (pulling from network, not parquet)
left joining took several seconds
Not caching the "people" led to the join taking a few seconds longer. Then forcing spark to broadcast the couple hundred MB aggregated events made the join take under 1 second.
Let's say I currently have a table like this
create table comment_counters
{
contentid uuid,
commentid uuid,
...
liked counter,
PRIMARY_KEY(contentid, commentid)
};
This purpose of this table is to track the comments and the number of times individual comments have been "liked".
What I would like to do is to get the top comments (let's say 20 top comments) determined by their number of likes from this table for each content.
I know there's no way to order by counters so what I would like to know is, are there any other ways to do this in Cassandra, by restructuring my tables or tracking more/different information for instance, or am I left with no choice but to do this in an RDBMS?
Sorting in client is not really an option I would like to consider at this stage.
Unfortunately there's now way to do this type of aggregations using plain Cassandra queries. The best option for doing this kind of data analysis would be to use an external tool such as Spark.
Using Spark you can start periodical jobs that would read and aggregate all counters from the comment_counters table and afterwards write the results (such as top 20 comments) to a different table that you can use to query directly afterwards.
See here to get started with Cassandra and Spark.
How do I write subqueries/nested queries in cassandra. Is this facility is provided in CQL?
Example I tried:
cqlsh:testdb> select itemname from item where itemid = (select itemid from orders where customerid=1);
It just throws the following error -
Bad Request: line 1:87 no viable alternative at input ';'
Because of its distributed nature, Cassandra has no support for RDBMS style joins. You have a few options for when you want something like a join.
One option perform separate queries and then have your application join the data itself. This makes sense if the data is relatively small and you only have to perform a small number of queries. Based on the example you gave above, this would probably be a good solution for you.
For more complicated joins, the usual strategy is to denormalize the data and store a materialized view of the join. The advantage to this is that fetching this data will be much faster than having to build it join in your application every time you need it. The cost is now you have multiple places where you are storing the same data and you will need to keep it all in sync. You can either update all your views when new data comes into the system or you can have a periodic batch job that rebuilds thems.
You might find this article useful: Do You Really Need SQL to Do It All in Cassandra? Its a bit old but its principles still apply.