When using DLT, we can create a live table with either STREAMING LIVE TABLE or LIVE TABLE, as written in the docs :
CREATE OR REFRESH { STREAMING LIVE TABLE | LIVE TABLE } table_name
What is the difference between the two syntaxes ?
It's described in the documentation, on the Concepts' page.
A live table or view always reflects the results of the query that defines it, including when the query defining the table or view is updated, or an input data source is updated. Like a traditional materialized view, a live table or view may be entirely computed when possible to optimize computation resources and time.
A streaming live table or view processes data that has been added only since the last pipeline update. Streaming tables and views are stateful; if the defining query changes, new data will be processed based on the new query and existing data is not recomputed.
Related
I get the below error when I try to do APPLY CHANGES from one delta live streaming table to another delta live stremaing table. Is this scenario not supported?
pyspark.sql.utils.AnalysisException: rajib_db.employee_address_stream is a permanent view, which is not supported by streaming reading API such as `DataStreamReader.table` yet.
If you mean to perform APPLY CHANGES to one table, and then stream from it into another table, then it's not supported yet. Please reach someone from Databricks (solution architect or customer success engineer) to provide feedback to the Databricks product team.
Also, if you'll do the describe extended on the table into which you do APPLY CHANGES then you will see that it's not a real table, but a view over another table that filters out some entries.
It was my understanding that references to streaming delta live tables require the use of the function STREAM(), supplying the table name as an argument.
Given below is a code snippet that I found in one of the demo notebooks that Databricks provide. Here, I see the use of STREAM() in the FROM clause, but it has not been used in the LEFT JOIN, even though that table is also a streaming table. This query still works.
What exactly is the correct syntax here?
CREATE OR REFRESH STREAMING LIVE TABLE sales_orders_cleaned(
CONSTRAINT valid_order_number EXPECT (order_number IS NOT NULL) ON VIOLATION DROP ROW
)
COMMENT "The cleaned sales orders with valid order_number(s) and partitioned by order_datetime."
AS
SELECT f.customer_id, f.customer_name, f.number_of_line_items,
timestamp(from_unixtime((cast(f.order_datetime as long)))) as order_datetime,
date(from_unixtime((cast(f.order_datetime as long)))) as order_date,
f.order_number, f.ordered_products, c.state, c.city, c.lon, c.lat, c.units_purchased, c.loyalty_segment
FROM STREAM(LIVE.sales_orders_raw) f
LEFT JOIN LIVE.customers c
ON c.customer_id = f.customer_id
AND c.customer_name = f.customer_name
Just for reference, given below are the other two tables that act as inputs to the above query,
CREATE OR REFRESH STREAMING LIVE TABLE sales_orders_raw
COMMENT "The raw sales orders, ingested from /databricks-datasets."
AS SELECT * FROM cloud_files("/databricks-datasets/retail-org/sales_orders/", "json", map("cloudFiles.inferColumnTypes", "true"))
CREATE OR REFRESH STREAMING LIVE TABLE customers
COMMENT "The customers buying finished products, ingested from /databricks-datasets."
AS SELECT * FROM cloud_files("/databricks-datasets/retail-org/customers/", "csv");
There are different types of joins on the Spark streams:
stream-static join. (doc) This is exactly your case, when you have STREAM(LIVE.sales_orders_raw) for orders, but the customers stream is considered static (it's read on each microbatch, and represents the state at moment of invocation). This is usually a case for your kind of functionality.
stream-stream join. In this case, both streams may need to align against each other, because data may come later, etc. In this case both streams will use STREAM(LIVE....) syntax. But it may not be the best case for you, because both streams need to wait until late data come, etc. - You will need to define a watermark for both streams, etc. Look for Spark documentation regarding that.
We're exploring the possibility to use temporary views in spark and relate this to some actual file storage - or to other temporary views. We want to achieve something like:
Some user uploads data to some S3/hdfs file storage.
A (temporary) view is defined so that spark sql queries can be run against the data.
Some other temporary view (referring to some other data) is created.
A third temporary view is created that joins data from (2) and (3).
By selecting from the view from (4) the user gets a table that reflects the joined data from (2) and (3). This result can be further processed and stored in a new temporary view and so on.
So we end up with a tree of temporary views - querying their parent temporary views until they end up loading data from the filesystem. Basically we want to store transformation steps (selecting, joining, filtering, modifying etc) on the data - without storing new versions. The spark SQL-support and temporary views seems like a good fit.
We did some successful testing. The idea is to store the specification of these temporary views in our application and recreate them during startup (as temporary or global views).
Not sure if this is viable solution? One problem is that we need to know how the temporary views are related (which one queries which). We create them like:
sparkSession.sql("select * from other_temp_view").createTempView(name)
So, when this is run we have to make sure that other_temp_view is already created in the session. Not sure how this can be achieved. One idea is to store a timestamp and recreate them in the same order. This could be ok since out views most likely will have to be "immutable". We're not allowed to change a query that other queries relies on.
Any thoughts would be most appreciated.
I would definately go with the SessionCatalog object :
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-SessionCatalog.html
You can access it with spark.sessionState.catalog
I am in process to migrate data from cassandra table(having old schema with faulty partition keys etc.) which do have materialized view created on it TO another table (redefined table) having materialized view as well.
So i redefined schema and insert data into new table.
What would be faster, efficient way to insert data to new table if we take scenarios as
Just create new table and do not create its MV until all data is inserted i.e. create MV at the end.
Create both at once and insert data on it
My perception is that option 1 would be faster as 2nd option would keep updated MV (behind the scene it creates table that will be updated on each insert).
NOTE: question is more related to performance while migrating data with or without MVs. created before or after.
If you can, follow the 1st variant - you may able to load data faster as materialized view adds overhead to every operation. After data is loaded, create materialized view, and check status with nodetool viewbuildstatus
I have a table which is relatively big. I want to create a Materialized View in Cassandra. While the view is being populated, if the base table gets updated, will the view also get updated with those changes? How does it work? Because in order to execute a batchlog on the view, the partition on the base table will be locked, therefore it cannot wait until the population has finished.
In my case, i will perform only inserts or deletes on the base table which simplifies things, I guess. But what if I would also perform updates? Would cassandra check the timestamps to detect somehow which value is most recent?