Convert schema evolving (SCD) JSON/XML into ORC/Parquet format - apache-spark

We are getting varieties of JSONs/XMLs as input where schema is always evolving. I want to process them using ORC or Parquet format in Hadoop/Hive environment for performance gain.
I know below common style of achieving same objective :
Use JSONSerde or XMLSerde library, first create hive table using these serde. Later select * fields query will be fired on each xml/json hive table to save as orc or save as parquet into another table. Once done successful I can drop these Serde Table and XML/JSON data.
What would be another good ways of doing same ?

As suggested by you, this is the most common way to do an offline conversion of JSON/XML data to parquet format.
But another way could be to parse the JSON/XML and create Parquet Groups for each of the JSON records. Essentially:
Open the JSON file
Read each individual record
Open another file
Create a Parquet Group from the record read in #2
Write the parquet group to the file created in #3
Do this for all records in the file
Close both files.
We came up with such a converter for one of our used case.

Related

What Happens When a Delta Table is Created in Delta Lake?

With the Databricks Lakehouse platform, it is possible to create 'tables' or to be more specific, delta tables using a statement such as the following,
DROP TABLE IF EXISTS People10M;
CREATE TABLE People10M
USING parquet
OPTIONS (
path "/mnt/training/dataframes/people-10m.parquet",
header "true"
);
What I would like to know is, what exactly happens behind the scenes when you create one of these tables? What exactly is a table in this context? Because the data is actually contained in files in data lake (data storage location) that delta lake is running on top of.. right? Are tables some kind of abstraction that allows us to access the data stored in these files using something like SQL?
What does the USING parquet portion of this statement do? Are parquet tables different to CSV tables in some way? Or does this just depend on the format of the source data?
Any links to material that explains this idea would be appreciated? I want to understand this in depth from a technical point of view.
There are few aspects here. Your table definition is not a Delta Lake, it's Spark SQL (or Hive) syntax to define a table. It's just a metadata that allows users easily use the table without knowing where it's located, what data format, etc. You can read more about databases & tables in Databricks documentation.
The actual format for data storage is specified by the USING directive. In your case it's parquet, so when people or code will read or write data, underlying engine will first read table metadata, figure out location of the data & file format, and then will use corresponding code.
Delta is another file format (really a storage layer) that is built on the top of Parquet as data format, but adding additional capabilities such as ACID, time travel, etc. (see doc). If you want to use Delta instead of Parquet then you either need to use CONVERT TO DELTA to convert existing Parquet data into Delta, or specify USING delta when creating a completely new table.

Update existing records of parquet file in Azure

I am converting my table into parquet file format using Azure Data Factory. Performing query on parquet file using databricks for reporting. I want to update only existing records which are updated in original sql server table. Since I am performing it on very big table and daily I don't want to perform truncate and reload entire table as it will be costly.
Is there any way I can update those parquet file without performing truncate and reload operation.
Parquet is by default immutable, so only way to rewrite the data is to rewrite the table. But that is possible to do if you switch to use of Delta file format that supports updating/deleting the entries, and is also supports MERGE operation.
You can still use Parquet format for production of the data, but then you need to use that data to update the Delta table.
I have found a workaround to this problem.
Read the parquet file into data frame using any tool or Python scripts.
create a temporary table or view from data frame.
Run SQL query to modify, update and delete the record.
Convert table back into data frame
Overwrite existing parquet files with new data.
Always go for soft Delete while working in No-Sql. Hard delete if very costly.
Also, with soft-Delete, down stream pipeline can consume the update and act upon it.

Question about using parquet for time-series data

I'm exploring ways to store a high volume of data from sensors (time series data), in a way that's scalable and cost-effective.
Currently, I'm writing a CSV file for each sensor, partitioned by date, so my filesystem hierarchy looks like this:
client_id/sensor_id/year/month/day.csv
My goal is to be able to perform SQL queries on this data, (typically fetching time ranges for a specific client/sensor, performing aggregations, etc) I've tried loading it to Postgres and timescaledb, but the volume is just too large and the queries are unreasonably slow.
I am now experimenting with using Spark and Parquet files to perform these queries, but I have some questions I haven't been able to answer from my research on this topic, namely:
I am converting this data to parquet files, so I now have something like this:
client_id/sensor_id/year/month/day.parquet
But my concern is that when Spark loads the top folder containing the many Parquet files, the metadata for the rowgroup information is not as optimized as if I used one single parquet file containing all the data, partitioned by client/sensor/year/month/day. Is this true? Or is it the same to have many parquet files or a single partitioned Parquet file? I know that internally the parquet file is stored in a folder hierarchy like the one I am using, but I'm not clear on how that affects the metadata for the file.
The reason I am not able to do this is that I am continuously receiving new data, and from my understanding, I cannot append to a parquet file due to the nature that the footer metadata works. Is this correct? Right now, I simply convert the previous day's data to parquet and create a new file for each sensor of each client.
Thank you.
You can use Structured Streaming with kafka(as you are already using it) for real time processing of your data and store data in parquet format. And, yes you can append data to parquet files. Use SaveMode.Append for that such as
df.write.mode('append').parquet(path)
You can even partition your data on hourly basis.
client/sensor/year/month/day/hour which will further provide you performance improvement while querying.
You can create hour partition based on system time or timestamp column based on type of query you want to run on your data.
You can use watermaking for handling late records if you choose to partition based on timestamp column.
Hope this helps!
I could share my experience and technology stack that being used at AppsFlyer.
We have a lot of data, about 70 billion events per day.
Our time-series data for near-real-time analytics are stored in Druid and Clickhouse. Clickhouse is used to hold real-time data for the last two days; Druid (0.9) wasn't able to manage it. Druid holds the rest of our data, which populated daily via Hadoop.
Druid is a right candidate in case you don't need a row data but pre-aggregated one, on a daily or hourly basis.
I would suggest you let a chance to the Clickhouse, it lacks documentation and examples but works robust and fast.
Also, you might take a look at Apache Hudi.

What is the Parquet summary file?

On Apache's official website, this is the official explanation of this parameter:
When true, the Parquet data source merges schemas collected from all data files, otherwise the schema is picked from the summary file or a random data file if no summary file is available.
In fact, my question is, what is the summary file?
Apache Parquet uses metadata to store all information required to load the data from a file, like column metadata, dictionaries row groups and so on.
The format is designed to keep this metadata embeded in the file itself, or stored a separate file. This is what summary file is.
Parquet summary file contains a collection of footers from actual Parquet data files in a directory. It can be used to skip RowGroups when reading w/o fetching the footer from each individual Parquet file which may be expensive if you have a lot of files and/or on Blob stores.
https://github.com/apache/parquet-mr/blob/65b95fb72be8f5a8a193a6f7bc4560fdcd742fc7/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L407
Parquet storage format is the columnar oriented file format, that means data for a particular column for all the rows will be stored adjacent to each other, which results in two main benefits - better compression ratio and increased query performance.

Design of Spark + Parquet "database"

I've got 100G text files coming in daily, and I wish to create an efficient "database" accessible from Spark. By "database" I mean the ability to execute fast queries on the data (going back about a year), and incrementally add data each day, preferably without read locks.
Assuming I want to use Spark SQL and parquet, what's the best way to achieve this?
give up on concurrent reads/writes and append new data to the existing parquet file.
create a new parquet file for each day of data, and use the fact that Spark can load multiple parquet files to allow me to load e.g. an entire year. This effectively gives me "concurrency".
something else?
Feel free to suggest other options, but let's assume I'm using parquet for now, as from what I've read this will be helpful to many others.
My Level 0 design of this
Use partitioning by date/time (if your queries are based on date/time to avoid scanning of all data)
Use Append SaveMode where required
Run SparkSQL distributed SQL engine so that
You enable querying of the data from multiple clients/applications/users
cache the data only once across all clients/applications/users
Use just HDFS if you can to store all your Parquet files
I have very similar requirement in my system. I would say if load the whole year's data -for 100g one day that will be 36T data ,if you need to load 36TB daily ,that couldn't be fast anyway. better to save the processed daily data somewhere(such as count ,sum, distinct result) and use that to go back for whole year .

Resources