AWS Glue for handling incremental data loading of different sources - apache-spark

I am planning to leverage AWG Glue for incremental data processing. Based on hourly schedule a trigger will invoke Glue Crawler and Glue ETL Job which loads incremental data to catalog and processed the incremental files through ETL. And looks pretty straight forward as well. With this I ran into couple of issues.
Let's say we have data getting streamed for various tables and for various data bases to S3 locations, and we want to create data bases and tables based on landing data.
eg: s3://landingbucket/database1/table1/YYYYMMDDHH/some_incremental_files.json
s3://landingbucket/database1/table2/YYYYMMDDHH/some_incremental_files.json
s3://landingbucket/database1/somedata/tablex/YYYYMMDDHH/some_incremental_files.json
s3://landingbucket/database2/table1/YYYYMMDDHH/some_incremental_files.json
s3://landingbucket/datasource_external/data/table1/YYYYMMDDHH/some_incremental_files.json
With the data getting landed in above s3 structure, we want to create glue catalog for these data bases and tables with limited Crawlers. Here we have number of databases as number of crawlers.
Note: We have a crawler for database1, its creating tables under database1, which is good and as expected, but we have an exceptional guy "somedata" in database1, whose structure is not in standard with other tables, with this it created table somedata and with partitions "partitions_0=tablex and partition_1=YYYYMMDDHH". Is there a better way to handle these with less number of crawlers than one crawler per data base.
Glue ETL, we have similar challenge, we want to format the incoming data to standard parquet format, and have one bucket per database and tables will be sitting under that, as the data is huge we don't want one table with partitions as data_base and data. So that we will not getting into s3 slowdown issues for the incoming load. As many teams will be querying the data from this, so we don't want to have s3 slowdown issue coming for their analytics jobs.
Instead of having one ETL job per table, per data base, is there a way we can handle this with limited jobs. As and when new tables are coming, there should be a way the ETL job should transform this json data to formatted zone. So input data and output path both can be handled dynamically, instead of hardcoding.
Open for any better idea!
Thanks,
Krish!

Related

Best option for storage in spark

A third party is producing a complete daily snapshot of their database table (Authors) and is storing it as a Parquet file in S3. Currently the number of records are around 55 million+. This will increase daily. There are 12 columns.
Initially I want to take this whole dataset and do some processing on the records, normalise them and then block them into groups of authors based on some specific criterias. I will then need to repeat this process daily, and filter it to only include authors that have been added or updated since the previous day.
I am using AWS EMR on EKS (Kubernetes) as my Spark cluster. My current thoughts are that I can save my blocks of authors on HDFS.
The main use for the blocks of data will be a separate Spark Streaming job that will then be deployed unto the same EMR cluster, and will read events from a Kafka topic and do a quick search to see which blocks of data are related to that event, and then it will do some matching (pairwise) against each item of that block.
I have two main questions:
Is using HDFS a performant and viable option for this use case?
The third party database table dump is going to be an initial goal. Later on, there will be quite possibly 10s or even 100s of other sources that I would need to do matching against. Which means trillions of data that are blocked and those blocks need to be stored somewhere. Would this option still be viable at that stage?

Can I use aws glue crawlers to create master data in delta lake tables?

I am setting up a new data lake and have been tasked with creating the master data tables in the data bricks delta lake component. I'm trying to do this in a use-case agnostic way (or as agnostic as possible), and need to automate the process where possible. I have researched aws glue crawlers, and it seems it is a good way to automatically create a schema and catalog for the data.
However, I'm not sure how to proceed. I'm assuming that creating the master data means identifying common fields in all the data sources and creating a schema for all the data using a single crawler, and then dividing this schema into facts and dimensions. After that I could use spark jobs on data bricks to extract what I need from the raw data and to populate the master data, while checking for duplicates and doing whatever other transformations that need to be done.
This plan seems like it requires a lot of manual labor though, and it's not use case agnostic in any way. Does anyone know how it could be automated further?
Any help would be much appreciated.

Question about using parquet for time-series data

I'm exploring ways to store a high volume of data from sensors (time series data), in a way that's scalable and cost-effective.
Currently, I'm writing a CSV file for each sensor, partitioned by date, so my filesystem hierarchy looks like this:
client_id/sensor_id/year/month/day.csv
My goal is to be able to perform SQL queries on this data, (typically fetching time ranges for a specific client/sensor, performing aggregations, etc) I've tried loading it to Postgres and timescaledb, but the volume is just too large and the queries are unreasonably slow.
I am now experimenting with using Spark and Parquet files to perform these queries, but I have some questions I haven't been able to answer from my research on this topic, namely:
I am converting this data to parquet files, so I now have something like this:
client_id/sensor_id/year/month/day.parquet
But my concern is that when Spark loads the top folder containing the many Parquet files, the metadata for the rowgroup information is not as optimized as if I used one single parquet file containing all the data, partitioned by client/sensor/year/month/day. Is this true? Or is it the same to have many parquet files or a single partitioned Parquet file? I know that internally the parquet file is stored in a folder hierarchy like the one I am using, but I'm not clear on how that affects the metadata for the file.
The reason I am not able to do this is that I am continuously receiving new data, and from my understanding, I cannot append to a parquet file due to the nature that the footer metadata works. Is this correct? Right now, I simply convert the previous day's data to parquet and create a new file for each sensor of each client.
Thank you.
You can use Structured Streaming with kafka(as you are already using it) for real time processing of your data and store data in parquet format. And, yes you can append data to parquet files. Use SaveMode.Append for that such as
df.write.mode('append').parquet(path)
You can even partition your data on hourly basis.
client/sensor/year/month/day/hour which will further provide you performance improvement while querying.
You can create hour partition based on system time or timestamp column based on type of query you want to run on your data.
You can use watermaking for handling late records if you choose to partition based on timestamp column.
Hope this helps!
I could share my experience and technology stack that being used at AppsFlyer.
We have a lot of data, about 70 billion events per day.
Our time-series data for near-real-time analytics are stored in Druid and Clickhouse. Clickhouse is used to hold real-time data for the last two days; Druid (0.9) wasn't able to manage it. Druid holds the rest of our data, which populated daily via Hadoop.
Druid is a right candidate in case you don't need a row data but pre-aggregated one, on a daily or hourly basis.
I would suggest you let a chance to the Clickhouse, it lacks documentation and examples but works robust and fast.
Also, you might take a look at Apache Hudi.

Azure Data Lake incremental load with file partition

I'm designing Data Factory piplelines to load data from Azure SQL DB to Azure Data Factory.
My initial load/POC was a small subset of data and was able to load from SQL tables to Azure DL.
Now, there are huge volume of tables (that has even billion +) that I want to load from SQL DB using DF to Azure DL.
MS docs mentioned two options, i.e. watermark columns and change tracking.
Let's say I have a "cust_transaction" table that has millions of rows and if I load to DL then it loads as "cust_transaction.txt".
Questions.
1) What would an optimal design to incrementally load the source data from SQL DB into that file in the data lake?
2) How do I split or partition the files into smaller files?
3) How should I merge and load the deltas from source data into the files?
Thanks.
You will want multiple files. Typically, my data lakes have multiple zones. The first zone is Raw. It contains a copy of the source data organized into entity/year/month/day folders where entity is a table in your SQL DB. Typically, those files are incremental loads. Each incremental load for an entity has a file name similar to Entity_YYYYMMDDHHMMSS.txt (and maybe even more info than that) rather than just Entity.txt. And the timestamp in the file name is the end of the incremental slice (max possible insert or update time in the data) rather than just current time wherever possible (sometimes they are relatively the same and it doesn't matter, but I tend to get a consistent incremental slice end time for all tables in my batch). You can achieve the date folders and timestamp in the file name by parameterizing the folder and file in the dataset.
Melissa Coates has two good articles on Azure Data Lake: Zones in a Data Lake and Data Lake Use Cases and Planning. Her naming conventions are a bit different than mine, but both of us would tell you to just be consistent. I would land the incremental load file in Raw first. It should reflect the incremental data as it was loaded from the source. If you need to have a merged version, that can be done with Data Factory or U-SQL (or your tool of choice) and landed in the Standardized Raw zone. There are some performance issues with small files in a data lake, so consolidation could be good, but it all depends on what you plan to do with the data after you land it there. Most users would not access data in the RAW zone, instead using data from Standardized Raw or Curated Zones. Also, I want Raw to be an immutable archive from which I could regenerate data in other zones, so I tend to leave it in the files as it landed. But if you found you needed to consolidate there, that would be fine.
Change tracking is a reliable way to get changes, but I don't like their naming conventions/file organization in their example. I would make sure your file name has the entity name and a timestamp on it. They have Incremental - [PipelineRunID]. I would prefer [Entity]_[YYYYMMDDHHMMSS]_[TriggerID].txt (or leave the run ID off) because it is more informative to others. I also tend to use the Trigger ID rather than the pipeline RunID. The Trigger ID is across all the packages executed in that trigger instance (batch) whereas the pipeline RunID is specific to that pipeline.
If you can't do the change tracking, the watermark is fine. I usually can't add change tracking to my sources and have to go with watermark. The issue is that you are trusting that the application's modified date is accurate. Are there ever times when a row is updated and the modified date is not changed? When a row is inserted, is the modified date also updated or would you have to check two columns to get all new and changed rows? These are the things we have to consider when we can't use change tracking.
To summarize:
Load incrementally and name your incremental files intelligently
If you need a current version of the table in the data lake, that is a separate file in your Standardized Raw or Curated Zone.

Design of Spark + Parquet "database"

I've got 100G text files coming in daily, and I wish to create an efficient "database" accessible from Spark. By "database" I mean the ability to execute fast queries on the data (going back about a year), and incrementally add data each day, preferably without read locks.
Assuming I want to use Spark SQL and parquet, what's the best way to achieve this?
give up on concurrent reads/writes and append new data to the existing parquet file.
create a new parquet file for each day of data, and use the fact that Spark can load multiple parquet files to allow me to load e.g. an entire year. This effectively gives me "concurrency".
something else?
Feel free to suggest other options, but let's assume I'm using parquet for now, as from what I've read this will be helpful to many others.
My Level 0 design of this
Use partitioning by date/time (if your queries are based on date/time to avoid scanning of all data)
Use Append SaveMode where required
Run SparkSQL distributed SQL engine so that
You enable querying of the data from multiple clients/applications/users
cache the data only once across all clients/applications/users
Use just HDFS if you can to store all your Parquet files
I have very similar requirement in my system. I would say if load the whole year's data -for 100g one day that will be 36T data ,if you need to load 36TB daily ,that couldn't be fast anyway. better to save the processed daily data somewhere(such as count ,sum, distinct result) and use that to go back for whole year .

Resources