Databricks Spark CREATE TABLE takes forever for 1 million small XML files - apache-spark

I have a set of 1 million XML files, each of size ~14KB in Azure Blob Storage, mounted in Azure Databricks, and I am trying to use CREATE TABLE, with the expectation of one record for each file.
The Experiment
The content structure of the files is depicted below. For simplicity and performance experimentation, all content of the files except the <ID> element is kept identical.
<OBSERVATION>
<HEADER>...</HEADER>
<RESULT>
<ID>...</ID>
<VALUES>...</VALUES>
</RESULT>
</OBSERVATION>
For parsing/deserialization, I am using spark-xml by Databricks. At this moment, I am expecting records having two columns HEADER and RESULT, which is what I am getting.
CREATE TABLE Observations
USING XML
OPTIONS (
path "/mnt/blobstorage/records/*.xml",
rowTag "RESULT",
rootTag "OBSERVATION",
excludeAttribute True
)
The Problem
The CREATE TABLE statement runs for 5.5 hours (a SQL query having name sql at SQLDriverLocal.scala:87 in the Spark UI) out of which only 1 hour is spent in Spark jobs (in the Jobs tab of the Spark UI).
I have noticed that the cell with the CREATE TABLE command remains stuck at Listing files at "/mnt/blobstorage/records/*.xml" for most of the time. First I thought it is a scaling problem in the storage connector. However, I can run the command on ~500K JSON files of similar size in ~25s (A problem with XML vs JSON?).
I also know that spark-xml reads all the files to infer the schema, which might be the bottleneck. To eliminate this possibility, I tried to:
predefine a schema (from only the first XML file)
ingest as plaintext without parsing (using the TEXT provider).
The same problem persists in both cases.
The same statement runs within 20s for 10K records, and in 30 mins for 200K records. With linear scaling (which is obviously not happening), 1 million records would have been done in ~33 minutes.
My Databricks cluster has 1 worker node and 3 driver nodes, each having 256 GB of RAM and 64 cores, so there should not be a caching bottleneck. I have successfully reproduced the issue in multiple runs over 4 days.
The Question
What am I doing wrong here? If there is some partitioning / clustering I can do during the CREATE TABLE, how do I do it?

My guesss is that you are running into a small file problem as you are processing only 15 GB. I would merge the small files in bigger files each ca. 250 MB of size.
As your dataset is still small you could do this on the driver. The following code shows this doing a merge on a driver node (without considering optimal filesize):
1. Copy the files from Blob to local file-system and generate a script for file merge:
# copy files from mounted storage to driver local storage
dbutils.fs.cp("dbfs:/mnt/blobstorage/records/", "file:/databricks/driver/temp/records", recurse=True)
unzipdir= 'temp/records/'
gzipdir= 'temp/gzip/'
# generate shell-script and write it into the local filesystem
script = "cat " + unzipdir + "*.xml > " + gzipdir + """all.xml gzip """ + gzipdir + "all.xml"
dbutils.fs.put("file:/databricks/driver/scripts/makeone.sh", script, True)
2. Run the shell script
%sh
sudo sh ./scripts/makeone.sh
3. Copy the files back to the mounted storage
dbutils.fs.mv("file:/databricks/driver/" + gzipdir, "dbfs:/mnt/mnt/blobstorage/recordsopt/", recurse=True)
Another important point is that the spark-xml library does a two step approach:
It parses the data to infer the schema. If the parameter samplingRatio is not changed, it does this for the whole dataset. Often it is enough only to do this for a smaller sample, or you can predefine the schema (use the parameter schema for this), then you don' t need this step.
Reading the data.
Finally I would recommend to store the data in parquet, so do the more sophisticated queries on a column based format then directly on the xmls and use the spark-xml lib for this preprocessing step.

Related

Output Many CSV files, and Combing into one without performance impact with Transform data using mapping data flows via Azure Data Factory

I followed the example below, and all is going well.
https://learn.microsoft.com/en-gb/azure/data-factory/tutorial-data-flow
Below is about the output files and rows:
If you followed this tutorial correctly, you should have written 83
rows and 2 columns into your sink folder.
Below is the result from my example, which is correct having the same number of rows and columns.
Below is the output. Please note that the total number of files is 77, not 83, not 1.
Question:: Is it correct to have so many csv files (77 items)?
Question:: How to combine all files into one file without slowing down the process?
I can create one file by following the link below, which warns of slowing down the process.
How to remove extra files when sinking CSV files to Azure Data Lake Gen2 with Azure Data Factory data flow?
The number of files generated from the process is dependent upon a number of factors. If you've set the default partitioning in the optimize tab on your sink, that will tell ADF to use Spark's current partitioning mode, which will be based on the number of cores available on the worker nodes. So the number of files will vary based upon how your data is distributed across the workers. You can manually set the number of partitions in the sink's optimize tab. Or, if you wish to name a single output file, you can do that, but it will result in Spark coalescing to a single partition, which is why you see that warning. You may find it takes a little longer to write that file because Spark has to coalesce existing partitions. But that is the nature of a big data distributed processing cluster.

LoadIncrementalHFiles: Split occurred while grouping HFiles

I implemented a Spark(v2.4) application that processes raw data and stores it into containerized Hbase(v2.1).
I would like to bulk load the data into Hbase and for that purpose, I use apache/hbase-connectors. I followed this example.
I pre-split Hbase regions to 10 and transformed each key by hashing and applying modulo upon the hash value, then concatenated it as a prefix to the key.
for example: key = a123, newKey = 0_a123 (assume: hash(a123) mod 10 = 0).
When I run my Spark app, I can see that the Hfiles has been created but when I'm trying to doBulkLoad with LoadIncrementalHFiles I get the following error:
LoadIncrementalHFiles: Split occurred while grouping HFiles, retry
attempt 12 with 10 files remaining to group or split
I saw the following solution which I think is similar to what I have already done.
Why does LoadIncrementalHFiles fail?
Should the Hfiles be on the container as well before doing the LoadIncrementalHFiles in a containerized environment?
Should I pre-split Hbase regions differently?
Is there any formula to calculate the number of regions?
In Hbase logs I can see the following error:
regionserver.SecureBulkLoadManager: Failed to complete bulk load
java.io.FileNotFoundException: File ... does not exist
The problem was with the location of the Hfiles.
I read Hbase logs and saw it looks for the Hfiles which were on my host machine, FileNotFoundException was thrown.
I mount the Hfiles directory to the Hbase container and the problem was solved.
Since you can hash your rows for better distribution, you might want to pre-split your table by using the Hex region splitter utility. It will automatically figure out how to split the table nicely across the hexadecimal space, based on how many region servers you have. Maybe this can help you bypass the unnecessary splitting on the fly. You can use it from command-line like this:
hbase org.apache.hadoop.hbase.util.RegionSplitter TableName HexStringSplit -c 10 -f CF
TableName is your table name
10 is the number of region servers you have in the cluster
CF is the name of the Column Family to create
The table shouldn't exist when you are launching this.

Apache Spark: How to read millions (5+ million) small files (10kb each) from S3

A high level overview of my goal: I need to find the file(s) (they are in JSON format) that contain a particular ID. Basically need to return a DF (or a list) of the ID and the file name that contains it.
// Read in the data from s3
val dfLogs = spark.read.json("s3://some/path/to/data")
.withColumn("fileSourceName", input_file_name())
// Filter for the ID and select then id and fileSourceName
val results = dfLogs.filter($"id" === "some-unique-id")
.select($"id", $"fileSourceName")
// Return the results
results.show(false)
Sounds simple enough, right? However, the challenge I'm facing is that the S3 directory I'm reading from contains millions (approximately 5+ million) files which average in size of 10kb. Small file problem! To do this I've been spinning up a 5 Node cluster (m4.xlarge) on EMR and using Zeppelin to interactively run the above code.
However, I keep getting thrown the following error when running the first spark statement (read):
org.apache.thrift.transport.TTransportException at
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
I'm having a hard time finding out more about the above error but I suspect it has to do with the requests being made from my spark job to s3.
Does anyone have any suggestions on how to handle so many small files? Should I do a s3-dist-cp from S3 -> HDFS on the EMR cluster and then run the query above but read from HDFS? Or some other option? This is a one time activity...is it worth creating a super large cluster? Would that improve the performance or solve my error? I've thought about trying to group the files together into bigger ones...but I need the unique files in which contain the ID.
I would love to change the way in which these files are being aggregated in S3...but there is nothing I can do about it.
Note: I've seen a few posts around here but they're quite old. Another link, but this I do not think pertains to my situation

Atomic data reaggregation with Apache Spark

We've implemented batch processing with apache spark. One batch appears once in 15 minutes and contains around 5GB of data in parquet format. We are storing data partitioned with schema
/batch=11/dt=20170102/partition=2
batch is monotonically increasing number, dt is date,partition is a number from 0 to 30 partitioned by clientid, it's needed for faster querying.
Mainly data is being searched from this structure by date and/or by clientid. From this folder we are preparing additional transformations using batch id as pointer.
For one day we get 3000 (100 batcher per day) folders and around 3000000 files inside. After some time we want to make bigger batches in order to reduce amount of folders and files stored in hdfs.
For example from
/batch=100/dt=.../partition=...
...
/batch=9999/dt=.../partition=...
we want to make
/batch=9999/dt=20170102/partition=...
/batch=9999/dt=20170103/partition=...
etc...
But problem is that users can run queries on this folder, and if we will move data between batches clients can read the same data twice or didn't read at all.
Can you suggest any appropriate solution how to tighten batches in atomic way? Or may be you can suggest better storing schema for such purposes?

Azure SQL DW data loads taking long time

I am trying to load the data from my External Tables to SQL DW Internal tables. I have the data stores in a compressed format in BLOB Storage and External tables are pointed to the BLOB Storage Location.
I have around 24 files, which is around 22GB of size and trying to load the data from External table to a Internal table on 300 DWU with a largerc resource class service/user account.
My insert into statement ( which is very straight forward) is running for more than 10 hours.
insert into Trxdata.Details_data select * from Trxdata.Stage_External_Table_details_data;
I also tried with below statement, thats also running for more than 10 hours.
CREATE TABLE Trxdata.Details_data12
WITH
(
DISTRIBUTION = ROUND_ROBIN
)
AS
SELECT *
FROM Trxdata.Stage_External_Table_details_data
;
I see - both the SQLs are running with ACTIVE status in "sys"."dm_pdw_exec_requests" [ I was thinking, it may be concurrency slot issue and it hasnt got concurrency slots to run, but its not the case]
and I was hoping , increasing/scaling up DWU - might improve the performance. but looking at the DWU usage in portal.azure.com - I am not convinced to increased the DWU because the DWU usage chart shows <50DWU for the last 12 hours
DWU USage chart
So, I am trying to understand- how can I find - what is taking such a long time, How can I improve the performance of my data load ?
I suspect your problem lies with the file(s) being compressed. Many azure documents state that you will only get one reader per compressed file. As a test I would suggest you decompress your data and try a load and see if decompressing/load is faster than then 10 hours loading compressed data you are currently seeing. I also have better luck with several files rather than 1 large file, if that is an option for your system.
Please have a look at the below blog from SQL CAT on data loading optimizations.
https://blogs.msdn.microsoft.com/sqlcat/2016/02/06/azure-sql-data-warehouse-loading-patterns-and-strategies/
Based on the info provided, a couple things to consider are:
1) Locality of the blob files compared to the DW instance. Make sure they are in the same region.
2) Clustered Columnstore is on by default. If you are loading 22GB of data, a HEAP load may perform better (but not sure on row count either). So:
CREATE TABLE Trxdata.Details_data12
WITH (HEAP, DISTRIBUTION = ROUND_ROBIN)
AS SELECT * FROM Trxdata.Stage_External_Table_details_data ;
If the problem still persists, please file a support ticket:
https://azure.microsoft.com/en-us/documentation/articles/sql-data-warehouse-get-started-create-support-ticket/
You mention that the data is in a compressed format. How many compressed files does the data reside in? For compressed files, you'll achieve more parallelism and thus better performance when the data is spread across many files. Having the data in multiple files is not needed for uncompressed files in order to achieve better performance, so another way to test if this is your performance issue is to un-compress your files.

Resources