pyspark partitioning create an extra empty file for every partition - azure

I am facing with one problem in Azure Databricks. In my notebook I am executing simple write command with partitioning:
df.write.format('parquet').partitionBy("startYear").save(output_path,header=True)
And I see something like this:
Can someone explain why spark is creating this additional empty files for every partition and how to disable it?
I tried different mode for write, different partitioning and spark versions

I reproduced the above and got the same results when I use Blob Storage.
Can someone explain why spark is creating this additional empty files
for every partition and how to disable it?
Spark won't create these types of files. Blob Storage creates the blobs like above when we create parquet files by partitions.
We cannot avoid these if you use Blob Storage. You can avoid it by using ADLS Storage.
These are my Results with ADLS:

Related

Writing spark dataframe in Azure Databricks

I am new to Azure Databricks. I have two input files and python AI model, I am cleaning the input files and applying AI model on input Files to get final probabilities. Reading files, loading model, cleaning data, preprocessing the data and displaying output with probabilities taking me only few minutes.
But while I am trying to write the result to Table or parquet file it is taking me more than 4-5 hours. I have tried various approaches of repartition/partitionBy/saveAsTable but none of it is fast enough.
My output spark dataframe consists of three columns with 120000000 rows. My shared cluster size is 9-Node cluster with each Node of 56GB memory.
My doubts are :-
1.) Is it expected behavior in azure databricks with slow writing capabilities.
2.) Is it true that we can't tune spark configurations in azure databricks, azure databricks tunes itself with available memory.
The performance depends on multiple factors: To investigate further, could you please share the below details:
What is the size of the data?
What is the size of the worker type?
Share the code which you are running?
I would suggest you go through the below articles, which helps to improve the performance:
Optimize performance with caching
7 Tips to Debug Apache Spark Code Faster with Databricks
Azure Databricks Performance Notes
I have used azure databricks and have written data to azure storage and it has been fast.
Also the databricks is hosted on Azure like in Aws.So all configurations of spark can be set.
As pradeep asked, what is the datasize and number of partitions? you can get it using df.rdd.getNumPartitions().
Have you tried a repartition before write? Thanks.

How merge files in Hive partitioned and bucketed files into one big file?

I am working on Azure HDInsight cluster for big data processing. A few days back I created a partitioned and bucketed table in hive by merging many files.
Since Azure does not give any option to stop the cluster,therefore I had to delete the cluster to save the cost. The data is independently stored in Azure storage account. When I create new cluster using the same storage account, I can see the database and the table using HDFS commands but hive cannot read that database or table, maybe hive does not have metadata about that.
The only option I am left with is to merge all those partitioned and bucketed files into a single file and then create the table again. So is there any way by which I can migrate that table to another database or merge it so that it would be easier to migrate??
You can create an EXTERNAL TABLE (with same properties as before) pointing to that HDFS location. Since you mentioned it has partitions you can run MSCK REPAIR TABLE table-name so that you can see partitions as well.
Hope this helps

Write a csv file into azure blob storage

I am trying use pyspark to analyze my data on databricks notebooks. Blob storage has been mounted on the databricks cluster and after ananlyzing, would like to write csv back into blob storage. As pyspark working in distributed fashion, csv file is broken into small blocks and written on the blob storage. How to overcome this and write as a single csv file on blob when we do analysis using pyspark. Thanks.
Do you really want a single file? If yes, the only way you can overcome it by merging all the small csv files into a single csv file. You can make use of map function on the databricks cluster to merge it or may be you can use some background job to do the same.
Have a look here: https://forums.databricks.com/questions/14851/how-to-concat-lots-of-1mb-cvs-files-in-pyspark.html

DATABRICKS DBFS

I need some clarity on Databricks DBFS.
In simple basic terms, what is it, what is the purpose of it and what does it allow me to do?
The documentation on databricks, says to this effect..
"Files in DBFS persist to Azure Blob storage, so you won’t lose data even after you terminate a cluster."
Any insight will be helpful, haven't been able to find documentation that goes into the details of it from architecture and usage perspective
I have experience with DBFS, it is a great storage which is holding data which you can upload from your local computer using DBFS CLI! The CLI setup a bit tricky, but when you manage, you can easily move whole folders around in this environment (remember using -overwrite! )
create folders
upload files
modify, remove files and folders
With Scala you can easily pull in the data you store in this storage with a code like this:
val df1 = spark
.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("dbfs:/foldername/test.csv")
.select(some_column_name)
Or read in the whole folder to process all csv the files available:
val df1 = spark
.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("dbfs:/foldername/*.csv")
.select(some_column_name)
I think it is easy to use and learn, I hope you find this info helpful!
Databricks File System (DBFS) is a distributed file system mounted into a Databricks workspace and available on Databricks clusters.
DBFS is an abstraction on top of scalable object storage and offers the following benefits:
1) Allows you to mount storage objects so that you can seamlessly access data without requiring credentials.
2) Allows you to interact with object storage using directory and file semantics instead of storage URLs.
Persists files to object storage(Blob), so you won’t lose data after you terminate a cluster.
Below link will help you to get more understanding on the Databricks utils commands:
databricks-file-system link
A few points in addition to the other answers worth mentioning:
AFAIK, You don’t pay for storage costs associated with DBFS. Instead you pay an hourly fee to run jobs on DBX.
Even though it is storing the data in blob/s3 in the cloud, you can’t access that storage directly. That means you have to use the DBX APIs or cli to access this storage.
Which leads to the third and obvious point, Using DBFS will more tightly couple your spark applications to DBX. Which may or may not be what you want to do.

Spark Permanent Tables Management

I have a question regarding best practices for managing permanent tables in Spark. I have been working previously with Databricks, and in that context, Databricks manages permanent tables so you do not have to 'create' or reference them each time a cluster is launched.
Let's say in a Spark cluster session, a permanent table is created with saveAsTable command using option to partition the table. Data is stored in S3 as parquet files.
Next day, a new cluster is created and it needs to access that table for different purposes:
SQL query for exploratory analysis
ETL process for appending a new chunk of data
What is the best way to make saved table available again as the same table with same structure/options/path? Maybe there a way to store hive metastore settings to be reused between spark sessions? Or maybe each time a spark cluster is created, I should do CREATE EXTERNAL TABLE with the correct options to tell the format (parquet), the partitioning and the path?
Furthermore, if I want to access those parquet files from another application, i.e. Apache Impala, is there a way to store and retrieve hive metastore information or the table has to be created again?
Thanks

Resources