Can I use aws glue crawlers to create master data in delta lake tables? - delta-lake

I am setting up a new data lake and have been tasked with creating the master data tables in the data bricks delta lake component. I'm trying to do this in a use-case agnostic way (or as agnostic as possible), and need to automate the process where possible. I have researched aws glue crawlers, and it seems it is a good way to automatically create a schema and catalog for the data.
However, I'm not sure how to proceed. I'm assuming that creating the master data means identifying common fields in all the data sources and creating a schema for all the data using a single crawler, and then dividing this schema into facts and dimensions. After that I could use spark jobs on data bricks to extract what I need from the raw data and to populate the master data, while checking for duplicates and doing whatever other transformations that need to be done.
This plan seems like it requires a lot of manual labor though, and it's not use case agnostic in any way. Does anyone know how it could be automated further?
Any help would be much appreciated.

Related

Azure Data Factory ETL Process

I would like to merge data from different data sources (ERP system, Excel files) with the ADF and make it available in an AzureSQLDB for further analyzing.
I'm not sure where and when I do the transformations and joins between the tables. Can I run all of this directly in the pipeline and then load the data into AzureDB, or do I need to stage the data first?
My understanding is to load the data into the ADF using Copy Activities and Datasets. Transforming and merging the datasets there with MappingDataFlows or similar activities. Then they are loaded into the AzureSQLDB
Your question is fully requirement based. You can go for either ETL or ELT process. Since your sink is AzureSQLDB , I would suggest to go with ELT , as you can handle lots of transformations in SQL itself by creating views on the landing tables.
If you have complex transformations to handle , then go with ETL and use Dataflow instead.
Also, regarding staging tables, if your requirement is to perform daily incremental load after the first full load, then you should opt for staging table.
Checkout this video for full load and incremental load .

how to synchronize an external database on Spark session

I have a Delta Lake on an s3 Bucket.
Since I would like to use Spark's SQL API, I need to synchronize the Delta Lake with the local Spark session. Is there a quick way to have all the tables available, without having to create a temporary view for each one?
At the moment this is what I do (Let's suppose I have 3 tables into the s3_bucket_path "folder").
s3_bucket_path = 's3a://bucket_name/delta_lake/'
spark.read.format('delta').load(s3_bucket_path + 'table_1').createOrReplaceTempView('table_1')
spark.read.format('delta').load(s3_bucket_path + 'table_2').createOrReplaceTempView('table_2')
spark.read.format('delta').load(s3_bucket_path + 'table_3').createOrReplaceTempView('table_3')
I was wondering if there was a quicker way to have all the tables available (without having to use boto3 and iterate through the folder to get the table names), or if I wasn't following the best practices in order to work with Spark Sql Apis: should I use a different approach? I've been studying Spark for a week and I'm not 100% familiar with its architecture yet.
Thank you very much for your help.
Sounds like you'd like to use managed tables, so you have easy access to query the data with SQL, without manually registering views.
You can create a managed table as follows:
df.write.format("delta").saveAsTable("table_1")
The table path and schema information is stored in the Hive megastore (or another metastore if you've specified another metastore). Managed tables will prevent you from manually having to create the views yourself.

AWS Glue for handling incremental data loading of different sources

I am planning to leverage AWG Glue for incremental data processing. Based on hourly schedule a trigger will invoke Glue Crawler and Glue ETL Job which loads incremental data to catalog and processed the incremental files through ETL. And looks pretty straight forward as well. With this I ran into couple of issues.
Let's say we have data getting streamed for various tables and for various data bases to S3 locations, and we want to create data bases and tables based on landing data.
eg: s3://landingbucket/database1/table1/YYYYMMDDHH/some_incremental_files.json
s3://landingbucket/database1/table2/YYYYMMDDHH/some_incremental_files.json
s3://landingbucket/database1/somedata/tablex/YYYYMMDDHH/some_incremental_files.json
s3://landingbucket/database2/table1/YYYYMMDDHH/some_incremental_files.json
s3://landingbucket/datasource_external/data/table1/YYYYMMDDHH/some_incremental_files.json
With the data getting landed in above s3 structure, we want to create glue catalog for these data bases and tables with limited Crawlers. Here we have number of databases as number of crawlers.
Note: We have a crawler for database1, its creating tables under database1, which is good and as expected, but we have an exceptional guy "somedata" in database1, whose structure is not in standard with other tables, with this it created table somedata and with partitions "partitions_0=tablex and partition_1=YYYYMMDDHH". Is there a better way to handle these with less number of crawlers than one crawler per data base.
Glue ETL, we have similar challenge, we want to format the incoming data to standard parquet format, and have one bucket per database and tables will be sitting under that, as the data is huge we don't want one table with partitions as data_base and data. So that we will not getting into s3 slowdown issues for the incoming load. As many teams will be querying the data from this, so we don't want to have s3 slowdown issue coming for their analytics jobs.
Instead of having one ETL job per table, per data base, is there a way we can handle this with limited jobs. As and when new tables are coming, there should be a way the ETL job should transform this json data to formatted zone. So input data and output path both can be handled dynamically, instead of hardcoding.
Open for any better idea!
Thanks,
Krish!

Synchronize data lake with the deleted record

I am building data lake to integrate multiple data sources for advanced analytics.
In the begining, I select HDFS as data lake storage. But I have a requirement for updates and deletes in data sources which I have to synchronise with data lake.
To understand the immutable nature of Data Lake I will consider LastModifiedDate from Data source to detect that this record is updated and insert this record in Data Lake with a current date. The idea is to select the record with max(date).
However, I am not able to understand how
I will detect deleted records from sources and what I will do with Data Lake?
Should I use other data storage like Cassandra and execute a delete command? I am afraid it will lose the immutable property.
can you please suggest me good practice for this situation?
1. Question - Detecting deleted records from datasources
Detecting deleted records from data sources, requires that your data sources supports this. Best is that deletion is only done logically, e. g. with a change flag. For some databases it is possible to track also deleted rows (see for example for SQL-Server). Also some ETL solutions like Informatica offer CDC (Changed Data Capture) capabilities.
2. Question - Changed data handling in a big data solution
There are different approaches. Of cause you can use a key value store adding some kind of complexity to the overall solution. First you have to clarify, if it is also of interest to track changes and deletes. You could consider loading all data (new/changed/deleted) into daily partitions and finally build an actual image (data as it is in your data source). Also consider solutions like Databricks Delta addressing this topics, without the need of an additional store. For example you are able to do an upsert on parquet files with delta as follows:
MERGE INTO events
USING updates
ON events.eventId = updates.eventId
WHEN MATCHED THEN
UPDATE SET
events.data = updates.data
WHEN NOT MATCHED
THEN INSERT (date, eventId, data) VALUES (date, eventId, data)
If your solution also requires low latency access via a key (e. g. to support an API) then a key-values store like HBase, Cassandra, etc. would be helpfull.
Usually this is always a constraint while creating datalake in Hadoop, one can't just update or delete records in it. There is one approach that you can try is
When you are adding lastModifiedDate, you can also add one more column naming status. If a record is deleted, mark the status as Deleted. So the next time, when you want to query the latest active records, you will be able to filter it out.
You can also use cassandra or Hbase (any nosql database), if you are performing ACID operations on a daily basis. If not, first approach would be your ideal choice for creating datalake in Hadoop

Azure Data Factory Data Migration

Not really sure this is an explicit question or just a query for input. I'm looking at Azure Data Factory to implement a data migration operation. What I'm trying to do is the following:
I have a No SQL DB with two collections. These collections are associated via a common property.
I have a MS SQL Server DB which has data that is related to the data within the No SQL DB Collections via an attribute/column.
One of the NoSQL DB collections will be updated on a regular basis, the other one on a not so often basis.
What I want to do is be able to prepare a Data Factory pipline that will grab the data from all 3 DB locations combine them based on the common attributes, which will result in a new dataset. Then from this dataset push the data wihin the dataset to another SQL Server DB.
I'm a bit unclear on how this is to be done within the data factory. There is a copy activity, but only works on a single dataset input so I can't use that directly. I see that there is a concept of data transformation activities that look like they are specific to massaging input datasets to produce new datasets, but I'm not clear on what ones would be relevant to the activity I am wanting to do.
I did find that there is a special activity called a Custom Activity that is in effect a user defined definition that can be developed to do whatever you want. This looks the closest to being able to do what I need, but I'm not sure if this is the most optimal solution.
On top of that I am also unclear about how the merging of the 3 data sources would work if the need to join data from the 3 different sources is required but do not know how you would do this if the datasets are just snapshots of the originating source data, leading me to think that the possibility of missing data occurring. I'm not sure if a concept of publishing some of the data someplace someplace would be required, but seems like it would in effect be maintaining two stores for the same data.
Any input on this would be helpful.
There are a lot of things you are trying to do.
I don't know if you have experience with SSIS but what you are trying to do is fairly common for either of these integration tools.
Your ADF diagram should look something like:
1. You define your 3 Data Sources as ADF Datasets on top of a
corresponding Linked service
2. Then you build a pipeline that brings information from SQL Server into a
temporary Data Source (Azure Table for example)
3. Next you need to build 2 pipelines that will each take one of your NoSQL
Dataset and run a function to update the temporary Data Source which is the ouput
4. Finally you can build a pipeline that will bring all your data from the
temporary Data Source into your other SQL Server
Steps 2 and 3 could be switched depending on which source is the master.
ADF can run multiple tasks one after another or concurrently. Simply break down the task in logical jobs and you should have no problem coming up with a solution.

Resources