We have ingested CDC data to S3 raw layer. This CDC data in JSON file and has DML records (delete, update etc). We used spark streaming with delta lake to de-dup S3- raw layer data and move to standard layer. Used table partitioning on certain column.
I have 2 questions:
Can we use indexing also in delta table (if supported ) to index on primary key apart from partition ( inside partition data indexed on primary key)
What visualization tool and necessary infra (spark or presto as compute) we can use to analyze delta table in S3. What would be the best approach? Data volume is too high. Should we move delta table to RDBMS and use visualization on top of that (but this will incur cost)
Related
recently I'm working with spark stateful streaming (mapGroupsWithState and flatMapGroupsWithState). As I'm working on Databricks I'm trying to write results from the stateful streaming do Delta Table, but it is not possible in any of output modes (complete, append, update). Ideally, I would like to store all states in memory and saves periodical snapshots in Delta table. Do you have any idea how to achieve that?
I am importing fact and dimension tables from SQL Server to Azure Data Lake Gen 2.
Should I save the data as "Parquet" or "Delta" if I am going to wrangle the tables to create a dataset useful for running ML models on Azure Databricks ?
What is the difference between storing as parquet and delta ?
Delta is storing the data as parquet, just has an additional layer over it with advanced features, providing history of events, (transaction log) and more flexibility on changing the content like, update, delete and merge capabilities. This link delta explains quite good how the files organized.
One drawback that it can get very fragmented on lots of updates, which could be harmful for performance. AS the AZ Data Lake Store Gen2 is anyway not optimized for large IO this is not really a big problem. Some optimization on the parquet format though will not be very effective this way.
I would use delta, just for the advanced features. It is very handy if there is a scenario where the data is updating over time, not just appending. Specially nice feature that you can read the delta tables as of a given point in time they existed.
SQL as of syntax
This is useful for having consistent training sets (to always have the same training dataset without separating to individual parquet files). In case for the ML models handling delta format as input may could be problematic, as likely only few frameworks will be able to read it in directly, so you will need to convert it during some pre-processing step.
Delta Lake uses versioned Parquet files to store your data in your cloud storage. Apart from the versions, Delta Lake also stores a transaction log to keep track of all the commits made to the table or blob store directory to provide ACID transactions.
Reference : https://learn.microsoft.com/en-us/azure/databricks/delta/delta-faq
As per the other answers Delta Lake is a feature layer over Parquet.
Consider - do you need Delta features? if you are just reading the data & wrangling elsewhere Delta is just extra complexity for little additional benefit.
Also Parquet is compatible with almost every data system out there, Delta is widely adopted but not everything can work with Delta.
Consider using parquet if you don't need a transaction log.
We extract data daily and replace it with the Delta file. However, it re-creates the same number of parquet files every time though there is a minor change to data.
I am planning to use SparkSQL (not pySpark) on top of data in Amazon S3. So I believe I need to create Hive external table and then can use SparkSQL. But S3 data is partitioned and want to have the partitions reflected in Hive external table also.
What is the best way to manage the hive table on a daily basis. Since
, everyday new partitions can be created or old partitions can be
overwritten and what to do , so as to keep the Hive external table
up-to-date?
Create a intermediate table and load to your hive table with insert overwrite partition on date.
What is the purpose of spark delta tables? Does they meant to store data permanently or only holds the processing data till the session lasts. How can I view them in spark cluster and what database they belongs to.
What is the purpose of spark delta tables?
The primary goal is to enable single table transnational writes in multicluster setups. This is achieved by keeping a transaction log (idea very similar to append-only tables in typical database systems).
Does they meant to store data permanently or only holds the processing data till the session lasts.
There are persistent, and by definition scoped across the sessions.
. How can I view them in spark cluster and what database
Same as any other table in Spark. There are not specific to any database, and written using delta format.
My Cassandra schema contains a table with a partition key which is a timestamp, and a parameter column which is a clustering key.
Each partition contains 10k+ rows. This is logging data at a rate of 1 partition per second.
On the other hand, users can define "datasets" and I have another table which contains, as a partition key the "dataset name" and a clustering column which is a timestamp referring to the other table (so a "dataset" is a list of partition keys).
Of course what I would like to do looks like an anti-pattern for Cassandra as I'd like to join two tables.
However using Spark SQL I can run such a query and perform the JOIN.
SELECT * from datasets JOIN data
WHERE data.timestamp = datasets.timestamp AND datasets.name = 'my_dataset'
Now the question is: is Spark SQL smart enough to read only the partitions of data which correspond to the timestamps defined in datasets?
Edit: fix the answer with regard to join optimization
is Spark SQL smart enough to read only the partitions of data which correspond to the timestamps defined in datasets?
No. In fact, since you provide the partition key for the datasets table, the Spark/Cassandra connector will perform predicate push down and execute the partition restriction directly in Cassandra with CQL. But there will be no predicate push down for the join operation itself unless you use the RDD API with joinWithCassandraTable()
See here for all possible predicate push down situations: https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra/BasicCassandraPredicatePushDown.scala