I have a data pipeline that writes data, daily, to a new partition of a folder in a mounted data lake, through Databricks.
So, my folder structure looks something like this,
.
├── _my_dataset
│ ├── partition_1
| | └── my_data.snappy.parquet
│ └── partition_2
| | └── my_data.snappy.parquet
Now, I want to create a table on top of this data, so that each time my users query it, the most up to date data will be returned. For example, when I query it today, I want to be able to view the data that was written today, in addition to the past data.
I know I can create a table as shown below,
CREATE TABLE my_table
USING PARQUET LOCATION 'dbfs:/path/to/my_table/*'
However, I won't be able to view the most recent data unless I execute a REFRESH TABLE command.
I believe that another option would be, of course, to create a view. But, for a very large dataset, this will certainly have a performance impact.
What is the best way to go about this?
I would also like to confirm something as a follow up. If my files are written in the delta format, instead of parquet and I am able to create a delta table on top of this data, I will automatically be able to view the entirety of the data. Is this understanding correct?
Related
When using DLT, we can create a live table with either STREAMING LIVE TABLE or LIVE TABLE, as written in the docs :
CREATE OR REFRESH { STREAMING LIVE TABLE | LIVE TABLE } table_name
What is the difference between the two syntaxes ?
It's described in the documentation, on the Concepts' page.
A live table or view always reflects the results of the query that defines it, including when the query defining the table or view is updated, or an input data source is updated. Like a traditional materialized view, a live table or view may be entirely computed when possible to optimize computation resources and time.
A streaming live table or view processes data that has been added only since the last pipeline update. Streaming tables and views are stateful; if the defining query changes, new data will be processed based on the new query and existing data is not recomputed.
I have a very long task that creates a bunch of views using Spark SQL and I get the following error at some step: pyspark.sql.utils.AnalysisException: The depth of view 'foobar' exceeds the maximum view resolution depth (100).
I have been searching in Google and SO and couldn't find anyone with a similar error.
I have tried caching the view foobar, but that doesn't help. I'm thinking of creating temporary tables as a workaround, as I would like not to change the current Spark Configuration if possible, but I'm not sure if I'm missing something.
UPDATE:
I tried creating tables in parquet format to reference tables and not views, but I still get the same error. I applied that to all the input tables to the SQL query that causes the error.
If it makes a difference, I'm using ANSI SQL, not the python API.
Solution
Using parque tables worked for me after all. I spotted that I was still missing one table to persist so that's why it wouldn't work.
So I changed my SQL statements from this:
CREATE OR REPLACE TEMPORARY VIEW `VIEW_NAME` AS
SELECT ...
To:
CREATE TABLE `TABLE_NAME` USING PARQUET AS
SELECT ...
To move all the critical views to parquet tables under spark_warehouse/ - or whatever you have configured.
Note:
This will write the table on the master node's disk. Make sure you have enough disk or consider dumping in an external data store like s3 or what have you. Read this as an alternative - and now preferred - solution using checkpoints.
I have a databricks notebook in which I currently create a view based off several delta tables, then update some of those same delta tables based on this view. However I'm getting incorrect results Because as the delta tables change the data in the view changes. What I effectively need is to take a snapshot of the data at the point the notebook starts to run which I can then use throughout the notebook, akin to a SQL temporary table. Currently I'm working around this by persisting the data into a table and dropping the table at the end of the notebook, but I wondered if there was a better solution?
The section Pinned view of a continuously updating Delta table across multiple downstream jobs contains the following example code:
version = spark.sql("SELECT max(version) FROM (DESCRIBE HISTORY my_table)")\
.collect()
# Will use the latest version of the table for all operations below
data = spark.table("my_table#v%s" % version[0][0]
data.where("event_type = e1").write.jdbc("table1")
data.where("event_type = e2").write.jdbc("table2")
...
data.where("event_type = e10").write.jdbc("table10")
We're exploring the possibility to use temporary views in spark and relate this to some actual file storage - or to other temporary views. We want to achieve something like:
Some user uploads data to some S3/hdfs file storage.
A (temporary) view is defined so that spark sql queries can be run against the data.
Some other temporary view (referring to some other data) is created.
A third temporary view is created that joins data from (2) and (3).
By selecting from the view from (4) the user gets a table that reflects the joined data from (2) and (3). This result can be further processed and stored in a new temporary view and so on.
So we end up with a tree of temporary views - querying their parent temporary views until they end up loading data from the filesystem. Basically we want to store transformation steps (selecting, joining, filtering, modifying etc) on the data - without storing new versions. The spark SQL-support and temporary views seems like a good fit.
We did some successful testing. The idea is to store the specification of these temporary views in our application and recreate them during startup (as temporary or global views).
Not sure if this is viable solution? One problem is that we need to know how the temporary views are related (which one queries which). We create them like:
sparkSession.sql("select * from other_temp_view").createTempView(name)
So, when this is run we have to make sure that other_temp_view is already created in the session. Not sure how this can be achieved. One idea is to store a timestamp and recreate them in the same order. This could be ok since out views most likely will have to be "immutable". We're not allowed to change a query that other queries relies on.
Any thoughts would be most appreciated.
I would definately go with the SessionCatalog object :
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-SessionCatalog.html
You can access it with spark.sessionState.catalog
In several places it's advised to design our Cassandra tables according to the queries we are going to perform on them. In this article by DataScale they state this:
The truth is that having many similar tables with similar data is a good thing in Cassandra. Limit the primary key to exactly what you’ll be searching with. If you plan on searching the data with a similar, but different criteria, then make it a separate table. There is no drawback for having the same data stored differently. Duplication of data is your friend in Cassandra.
[...]
If you need to store the same piece of data in 14 different tables, then write it out 14 times. There isn’t a handicap against multiple writes.
I have understood this, and now my question is: provided that I have an existing table, say
CREATE TABLE invoices (
id_invoice int PRIMARY KEY,
year int,
id_client int,
type_invoice text
)
But I want to query by year and type instead, so I'd like to have something like
CREATE TABLE invoices_yr (
id_invoice int,
year int,
id_client int,
type_invoice text,
PRIMARY KEY (type_invoice, year)
)
With id_invoice as the partition key and year as the clustering key, what's the preferred way to copy the data from one table to another to perform optimized queries later on?
My Cassandra version:
user#cqlsh> show version;
[cqlsh 5.0.1 | Cassandra 3.5.0 | CQL spec 3.4.0 | Native protocol v4]
You can use cqlsh COPY command :
To copy your invoices data into csv file use :
COPY invoices(id_invoice, year, id_client, type_invoice) TO 'invoices.csv';
And Copy back from csv file to table in your case invoices_yr use :
COPY invoices_yr(id_invoice, year, id_client, type_invoice) FROM 'invoices.csv';
If you have huge data you can use sstable writer to write and sstableloader to load data faster.
http://www.datastax.com/dev/blog/using-the-cassandra-bulk-loader-updated
To echo what was said about the COPY command, it is a great solution for something like this.
However, I will disagree with what was said about the Bulk Loader, as it is infinitely harder to use. Specifically, because you need to run it on every node (whereas COPY needs to only be run on a single node).
To help COPY scale for larger data sets, you can use the PAGETIMEOUT and PAGESIZE parameters.
COPY invoices(id_invoice, year, id_client, type_invoice)
TO 'invoices.csv' WITH PAGETIMEOUT=40 AND PAGESIZE=20;
Using these parameters appropriately, I have used COPY to successfully export/import 370 million rows before.
For more info, check out this article titled: New options and better performance in cqlsh copy.
An alternative to using COPY command (see other answers for examples) or Spark to migrate data is to create a materialized view to do the denormalization for you.
CREATE MATERIALIZED VIEW invoices_yr AS
SELECT * FROM invoices
WHERE id_client IS NOT NULL AND type_invoice IS NOT NULL AND id_client IS NOT NULL
PRIMARY KEY ((type_invoice), year, id_client)
WITH CLUSTERING ORDER BY (year DESC)
Cassandra will fill the table for you then so you wont have to migrate yourself. With 3.5 be aware that repairs don't work well (see CASSANDRA-12888).
Note: that Materialized Views are probably not best idea to use and has been changed to "experimental" status