One of my delta tables is being used by an external service. However, the main requirement here is a good performance, and the tool performs really poorly reading delta format. Hence, what we currently do, is to use the Vacuum command in order to keep only the latest version of data in the delta table. The tool ignores delta log, so otherwise it would read all existing versions of the table at once. Then, it reads the parquet from a given directory.
I would like to go away from using Vacuum here, due to the issues with concurrency, and high costs it incurs on a table with big number of partitions. Say my delta table is currently partitioned on columns A, and B. Is there a way to force delta to write parquet files corresponding to different versions of the table into separate directories?
So that I can have a path where I know I only have the files that belong to the latest version of my delta table?
I.e
delta_table/A/B/version_1/
-> new version created ->
delta_table/A/B/
version_1/
version_2/
Check out delta-rs.
You can install it with pip install deltalake.
Here's how to get all the latest files in the Delta table:
dt = DeltaTable("resources/delta/1")
filenames = ["resources/delta/1/" + f for f in dt.files()]
delta-rs doesn't have a Spark dependency, so it's portable and lightweight.
Related
I am importing fact and dimension tables from SQL Server to Azure Data Lake Gen 2.
Should I save the data as "Parquet" or "Delta" if I am going to wrangle the tables to create a dataset useful for running ML models on Azure Databricks ?
What is the difference between storing as parquet and delta ?
Delta is storing the data as parquet, just has an additional layer over it with advanced features, providing history of events, (transaction log) and more flexibility on changing the content like, update, delete and merge capabilities. This link delta explains quite good how the files organized.
One drawback that it can get very fragmented on lots of updates, which could be harmful for performance. AS the AZ Data Lake Store Gen2 is anyway not optimized for large IO this is not really a big problem. Some optimization on the parquet format though will not be very effective this way.
I would use delta, just for the advanced features. It is very handy if there is a scenario where the data is updating over time, not just appending. Specially nice feature that you can read the delta tables as of a given point in time they existed.
SQL as of syntax
This is useful for having consistent training sets (to always have the same training dataset without separating to individual parquet files). In case for the ML models handling delta format as input may could be problematic, as likely only few frameworks will be able to read it in directly, so you will need to convert it during some pre-processing step.
Delta Lake uses versioned Parquet files to store your data in your cloud storage. Apart from the versions, Delta Lake also stores a transaction log to keep track of all the commits made to the table or blob store directory to provide ACID transactions.
Reference : https://learn.microsoft.com/en-us/azure/databricks/delta/delta-faq
As per the other answers Delta Lake is a feature layer over Parquet.
Consider - do you need Delta features? if you are just reading the data & wrangling elsewhere Delta is just extra complexity for little additional benefit.
Also Parquet is compatible with almost every data system out there, Delta is widely adopted but not everything can work with Delta.
Consider using parquet if you don't need a transaction log.
We extract data daily and replace it with the Delta file. However, it re-creates the same number of parquet files every time though there is a minor change to data.
In our data pipelines ,we ingest CDC events from data-sources and write these changes into "incremental data" folder in AVRO format.
Then periodically, we run Spark jobs to merge this "incremental data" with our current version of the "snapshot table" (ORC format) to get the latest version of the upstream snapshot.
During this merge logic :
1) we load the "incremental data" as an DataFrame df1
2) load the current "snapshot table" as an DataFrame df2
3) merge df1 and df2 de-duplicating ids and taking the latest version of the rows (using update_timestamp column)
This logic loads the entire data for both "incremental data" and current "snapshot table" into Spark memory which can be quite huge depend on the database.
I noticed that in Delta Lake, similar operation is done using following code:
import io.delta.tables._
import org.apache.spark.sql.functions._
val updatesDF = ... // define the updates DataFrame[date, eventId, data]
DeltaTable.forPath(spark, "/data/events/")
.as("events")
.merge(
updatesDF.as("updates"),
"events.eventId = updates.eventId")
.whenMatched
.updateExpr(
Map("data" -> "updates.data"))
.whenNotMatched
.insertExpr(
Map(
"date" -> "updates.date",
"eventId" -> "updates.eventId",
"data" -> "updates.data"))
.execute()
Here, the "updatesDF" can be considered our "incremental data" which coming from a CDC source.
My questions:
1) How does merge/upsert internally works? Does it load entire "updatedDF" and "/data/events/" into Spark memory?
2) If not, does it apply the delta changes something similar to Apache Hudi ?
3) During deduplication how this upsert logic knows to take the latest version of a record? Because I don't see any setting to specify the "update timestamp" column?
1) How does merge/upsert internally works? Does it load entire "updatedDF" and
"/data/events/" into Spark memory?
Nope, Spark does not need to load entire Delta DF it needs to update into memory.
It wouldn't be scalable otherwise.
Approach it takes is very similar to other jobs that Spark does - the whole table is split into multiple partitions transparently if the dataset is large enough (or you cloud create explicit partitioning). Then each partition is assigned a single task that makes up your merge job. Tasks can run on different Spark executors etc.
2) If not, does it apply the delta changes something similar to Apache Hudi ?
I heard about Apache Hudi, but haven't looked at it.
Internally, Delta looks like versioned parquet files.
Changes to the table are stored as ordered, atomic units called commits.
When you save a table - look at what files it has - it will have files
like 000000.json, 000001.json, etc, and each of of them will reference a
set of operations on underlying parquet files in subdirectories. For example,
000000.json will say that this version in time references parquet files 001
and 002, and 000001.json will say that this version in time shouldn't reference
those two older parquet files, and only use parquet file 003.
3) During deduplication how this upsert logic knows to take the latest version of a record?
Because I don't see any setting to specify the "update timestamp" column?
By default it references most recent changeset.
Timestamping is internal to how this versioning is implemented in Delta.
You can reference an older snapshot though through AS OF syntax - see
https://docs.databricks.com/delta/delta-batch.html#syntax
I'm trying to learn the whole open source big data stack, and I've started with HDFS, Hadoop MapReduce and Spark. I'm more or less limited with MapReduce and Spark (SQL?) for "ETL", HDFS for storage, and no other limitation for other things.
I have a situation like this:
My Data Sources
Data Source 1 (DS1): Lots of data - totaling to around 1TB. I have IDs (let's call them ID1) inside each row - used as a key. Format: 1000s of JSON files.
Data Source 2 (DS2): Additional "metadata" for data source 1. I have IDs (let's call them ID2) inside each row - used as a key. Format: Single TXT file
Data Source 3 (DS3): Mapping between Data Source 1 and 2. Only pairs of ID1, ID2 in CSV files.
My workspace
I currently have a VM with enough data space, about 128GB of RAM and 16 CPUs to handle my problem (the whole project is a research for, not a production-use-thing). I have CentOS 7 and Cloudera 6.x installed. Currently, I'm using HDFS, MapReduce and Spark.
The task
I need only some attributes (ID and a few strings) from Data Source 1. My guess is that it comes to less than 10% in data size.
I need to connect ID1s from DS3 (pairs: ID1, ID2) to IDs in DS1 and ID2s from DS3 (pairs: ID1, ID2) to IDs in DS2.
I need to add attributes from DS2 (using "mapping" from the previous bullet) to my extracted attributes from DS1
I need to make some "queries", like:
Find the most used words by years
Find the most common words, used by a certain author
Find the most common words, used by a certain author, on a yearly basi
etc.
I need to visualize data (i.e. wordclouds, histograms, etc.) at the end.
My questions:
Which tool to use to extract data from JSON files the most efficient way? MapReduce or Spark (SQL?)?
I have arrays inside JSON. I know the explode function in Spark can transpose my data. But what is the best way to go here? Is it the best way to
extract IDs from DS1 and put exploded data next to them, and write them to new files? Or is it better to combine everything? How to achieve this - Hadoop, Spark?
My current idea was to create something like this:
Extract attributes needed (except arrays) from DS1 with Spark and write them to CSV files.
Extract attributes needed (exploded arrays only + IDs) from DS1 with Spark and write them to CSV files - each exploded attribute to own file(s).
This means I have extracted all the data I need, and I can easily connect them with only one ID. I then wanted to make queries for specific questions and run MapReduce jobs.
The question: Is this a good idea? If not, what can I do better? Should I insert data into a database? If yes, which one?
Thanks in advance!
Thanks for asking!! Being a BigData developer for last 1.5 years and having experience with both MR and Spark, I think I may guide you to the correct direction.
The final goals which you want to achieve can be obtained using both MapReduce and Spark. For visualization purpose you can use Apache Zeppelin, which can run on top of your final data.
Spark jobs are memory expensive jobs, i.e, the whole computation for spark jobs run on memory, i.e, RAM. Only the final result is written to the HDFS. On the other hand, MapReduce uses less amount of memory and used HDFS for writing intermittent stage results, thus making more I/O operations and more time consuming.
You can use Spark's Dataframe feature. You can directly load data to Dataframe from a structured data (it can be plaintext file also) which will help you to get the required data in a tabular format. You can write the Dataframe to a plaintext file, or you can store to a hive table from where you can visualize data. On the other hand, using MapReduce you will have to first store in Hive table, then write hive operations to manipulate data, and store final data to another hive table. Writing native MapReduce jobs can be very hectic so I would suggest to refrain from choosing that option.
At the end, I would suggest to use Spark as processing engine (128GB and 16 cores is enough for spark) to get your final result as soon as possible.
We have fact table(30 columns) stored in parquet files on S3 and also created table on this files and cache it afterwards. Table is created using this code snippet:
val factTraffic = spark.read.parquet(factTrafficData)
factTraffic.write.mode(SaveMode.Overwrite).saveAsTable("f_traffic")
%sql CACHE TABLE f_traffic
We run many different calculations on this table(files) and are looking the best way to cache data for faster access in subsequent calculations. Problem is, that for some reason it's faster to read the data from parquet and do the calculation then access it from memory. One important note is that we do not utilize every column. Usually, around 6-7 columns per calculation and different columns each time.
Is there a way to cache this table in memory so we can access it faster then reading from parquet?
It sounds like you're running on Databricks, so your query might be automatically benefitting from the Databricks IO Cache. From the Databricks docs:
The Databricks IO cache accelerates data reads by creating copies of remote files in nodes’ local storage using fast intermediate data format. The data is cached automatically whenever a file has to be fetched from a remote location. Successive reads of the same data are then executed locally, which results in significantly improved reading speed.
The Databricks IO cache supports reading Parquet files from DBFS, Amazon S3, HDFS, Azure Blob Storage, and Azure Data Lake. It does not support other storage formats such as CSV, JSON, and ORC.
The Databricks IO Cache is supported on Databricks Runtime 3.3 or newer. Whether it is enabled by default depends on the instance type that you choose for the workers on your cluster: currently it is enabled automatically for Azure Ls instances and AWS i3 instances (see the AWS and Azure versions of the Databricks documentation for full details).
If this Databricks IO cache is taking effect then explicitly using Spark's RDD cache with an untransformed base table may harm query performance because it will be storing a second redundant copy of the data (and paying a roundtrip decoding and encoding in order to do so).
Explicit caching can still can make sense if you're caching a transformed dataset, e.g. after filtering to significantly reduce the data volume, but if you only want to cache a large and untransformed base relation then I'd personally recommend relying on the Databricks IO cache and avoiding Spark's built-in RDD cache.
See the full Databricks IO cache documentation for more details, including information on cache warming, monitoring, and a comparision of RDD and Databricks IO caching.
The materalize dataframe in cache, you should do:
val factTraffic = spark.read.parquet(factTrafficData)
factTraffic.write.mode(SaveMode.Overwrite).saveAsTable("f_traffic")
val df_factTraffic = spark.table("f_traffic").cache
df_factTraffic.rdd.count
// now df_factTraffic is materalized in memory
See also https://stackoverflow.com/a/42719358/1138523
But it's questionable whether this makes sense at all because parquet is a columnar file format (meaning that projection is very efficient), and if you need different columns for each query the caching will not help you.
I'm writing to a partitioned, external hive table (format parquet) in Apache Spark 1.6.3. I want to write 1 file per partition (data is small, and I want to avoid the overhead of many small files). Let say the partition column is date. I can achieve this like that:
df
.coalesce(1)
.write.mode("overwrite").partitionBy("date")
.insertInto(newTable)
But this makes me only 1 task for the insertInto stage, although many different file are created. So why isn't spark using multiple tasks to write the files instead of using 1 task to write all files.
On the other hand, If I omit the coalesce(1), I get a huge amount of tiny files per hive partition.
So how can I speed-up the above code? Or more generally speaking, how can I control the number of files created when inserting into an Hive table (without using coalesce(1))?