How to write partitioned parquet files to blob storage - azure

I want to load data from On Premise SQL SERVER to blob storage with copy activity in ADF, the target file is parquet, the size of this one is 5 Gb.
The pipeline work well and he wrote one parquet file, now i need to split this file in multiple parquet file to optimise loading data with Poly base and for another uses.
With Spark we can partition file in multiple file by this syntaxe :
df.repartition(5).write.parquet("path")

Short question, short answer.
Partitioned data: https://learn.microsoft.com/en-us/azure/data-factory/how-to-read-write-partitioned-data
Parquet format: https://learn.microsoft.com/en-us/azure/data-factory/format-parquet
Blob storage connector: https://learn.microsoft.com/en-us/azure/data-factory/connector-azure-blob-storage
Hope this helped!

Related

Use Unmanaged table in Delta lake on Top of ADLS Gen2

I use ADF to ingest the data from SQL server to ADLS GEN2 in a Parquet Snappy format, But the size of the file in sink goes upto 120 GB, The size causes me a lot of problem when I read this file in Spark and join the data from this file with many other Parquet files.
I am thinking to use Delta lake's unmanage table with the location pointing to the ADLS location, I am able to create an UnManaged table if I don't specify any partition using this
" CONVERT TO DELTA parquet.PATH TO FOLDER CONTAINING A PARQUET FILE(S)"
But if I would want to partition this file for query optimization
" CONVERT TO DELTA parquet.PATH TO FOLDER CONTAINING A PARQUET FILE(S), PARTITIONED_COLUMN DATATYPE"
It gives me error like the one mentioned in the screenshot (find the attachment).
Error in Text :-
org.apache.spark.sql.AnalysisException: Expecting 1 partition column(s): [<PARTITIONED_COLUMN>], but found 0 partition column(s): [] from parsing the file name: abfss://mydirectory#myADLS.dfs.core.windows.net/level1/Level2/Table1.parquet.snappy;
There is no way that I can create this Parquet file using ADF with partition details (Am open for suggestions)
Am I giving a wrong Syntax or this can be even done?
Ok, I found the answer to this. While you convert parquet files to delta using the above approach, Delta would look for the correct directory structure with partition information along with the name of the column mentioned in "Partitioned By" clause.
For E.g, I have a folder called /Parent, inside this I have a directory structure with partition information, the partitioned parquet files are kept one level further inside the partitioned folders, the folder names are like this
/Parent/Subfolder=0/part-00000-62ef2efd-b88b-4dd1-ba1e-3a146e986212.c000.snappy.parquet
/Parent/Subfolder=1/part-00000-fsgvfabv-b88b-4dd1-ba1e-3a146e986212.c000.snappy.parquet
/Parent/Subfolder=2/part-00000-fbfdfbfe-b88b-4dd1-ba1e-3a146e986212.c000.snappy.parquet
/Parent/Subfolder=3/part-00000-gbgdbdtb-b88b-4dd1-ba1e-3a146e986212.c000.snappy.parquet
in this case, subfolder is the partitions created inside parent.
CONVERT TO DELTA parquet./Parent/ partitioned by (Subfolder INT)
will just take this directory structure and convert the whole partitioned data to delta and will store the partitioned information in metastore.
Summary:- This command is only to utilize already created partitioned Parquet files. To create partition on single Parquet file you would have to take different route, Which I can explain you later if you are interested ;)

Would S3 Select speed up Spark analyses on Parquet files?

You can use S3 Select with Spark on Amazon EMR and with Databricks, but only for CSV and JSON files. I am guessing that S3 Select isn't offered for columnar file formats because it wouldn't help that much.
Let's say we have a data lake of people with first_name, last_name and country columns.
If the data is stored as CSV files and you run a query like peopleDF.select("first_name").distinct().count(), then S3 will transfer all the data for all the columns to the ec2 cluster to run the computation. This is really inefficient because we don't need all the last_name and country data to run this query.
If the data is stored as CSV files and you run the query with S3 select, then S3 will only transfer the data in the first_name column to run the query.
spark
.read
.format("s3select")
.schema(...)
.options(...)
.load("s3://bucket/filename")
.select("first_name")
.distinct()
.count()
If the data is stored in a Parquet data lake and peopleDF.select("first_name").distinct().count() is run, then S3 will only transfer the data in the first_name column to the ec2 cluster. Parquet is a columnar file format and this is one of the main advantages.
So based on my understanding, S3 Select wouldn't help speed up an analysis on a Parquet data lake because columnar file formats offer the S3 Select optimization out of the box.
I am not sure because a coworker is certain I am wrong and because S3 Select supports the Parquet file format. Can you please confirm that columnar file formats provide the main optimization offered by S3 Select?
This is an interesting question. I don't have any real numbers, though I have done the S3 select binding code in the hadoop-aws module. Amazon EMR have some values, as do databricks.
For CSV IO Yes, S3 Select will speedup given aggressive filtering of source data, e.g many GB of data but not much back. Why? although the read is slower, you save on the limited bandwidth to your VM.
For Parquet though, the workers split up a large file into parts and schedule the work across them (Assuming a splittable compression format like snappy is used), so > 1 worker can work on the same file. And they only read a fraction of the data (==bandwidth benefits less), But they do seek around in that file (==need to optimise seek policy else cost of aborting and reopening HTTP connections)
I'm not convinced that Parquet reads in the S3 cluster can beat a spark cluster if there's enough capacity in the cluster and you've tuned your s3 client settings (for s3a this means: seek policy, thread pool size, http pool size) for performance too.
Like I said though: I'm not sure. Numbers are welcome.
Came across this spark package for s3 select on parquet [1]
[1] https://github.com/minio/spark-select

Renaming Exported files from Spark Job

We are currently using Spark Job on Databricks which do processing on our data lake which in S3.
Once the processing is done we export our result to S3 bucket using normal
df.write()
The issue is when we write dataframe to S3 the name of file is controlled by Spark, but as per our agreement we need to rename this files to a meaningful name.
Since S3 doesn't have renaming feature we are right now using boto3 to copy and paste file with expected name.
This process is very complex and not scalable with more client getting onboard.
Do we have any better solution to rename exported files from spark to S3 ?
It's not possible to do it directly in Spark's save
Spark uses Hadoop File Format, which requires data to be partitioned - that's why you have part- files. If the file is small enough to fit into memory, one work around is to convert to a pandas dataframe and save as csv from there.
df_pd = df.toPandas()
df_pd.to_csv("path")

Best file formats for S3 using Spark for ETL on EMR

We are planning to perform ETL processing using Spark with source data sitting on S3. The data volume for ETL processing is less than 100 million. What is the best format to store data in S3 in this scenario i.e. the best compression and file format (text, sequence, parquet etc.)
ORC or Parquet for queries, compressed with Snappy. Avro is another general purpose format, but way less efficient for SparkSQL queries as you have to scan a lot more data.
Important At the time of writing (June 2017), you cannot safely use S3 as a direct destination of spark RDD/dataframe queries (i.e. save()) calls. See Cloud Integration for an explanation. Write to HDFS then copy

Create Hive ORC table from ORC files of other server

we have 2 clusters one Map R and another our own. We want created new setup in our own hardware using the Map R data.
I have copied all the orc files from the Map R cluster and followed the same folder structure
Created a orc formatted table with location of #1
then executed this command "MSCK REPAIR TABLE <>"
above steps passed without error, but when i query the partitions then job fails with below error
java.lang.IllegalArgumentException: Buffer size too small. size = 262144 needed = 4958903
at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.readHeader(InStream.java:193)
at org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.read(InStream.java:238)
Can some one tell me can we create HIVE ORC partition tables directly from the orc files?
My storage is Azure data lake.
According to your description, based on my understanding, I think you want to copy all orc files from a cluster to another and load these orc files as a hive table.
For doing it, please just try to follow the command below to create external table for loading orcfile data.
CREATE EXTERNAL TABLE IF NOT EXSISTS <table name> (<column_name column_type>, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS ORC
LOCATION '<orcfile path>'
If not aware of the columns list of an orc file, you can refer to the Hive manual ORC File Dump Utility to print the ORC file metadata in JSON format via hive --orcfiledump -j -p <location-of-orc-file-or-directory>.

Resources