how to synchronize an external database on Spark session - apache-spark

I have a Delta Lake on an s3 Bucket.
Since I would like to use Spark's SQL API, I need to synchronize the Delta Lake with the local Spark session. Is there a quick way to have all the tables available, without having to create a temporary view for each one?
At the moment this is what I do (Let's suppose I have 3 tables into the s3_bucket_path "folder").
s3_bucket_path = 's3a://bucket_name/delta_lake/'
spark.read.format('delta').load(s3_bucket_path + 'table_1').createOrReplaceTempView('table_1')
spark.read.format('delta').load(s3_bucket_path + 'table_2').createOrReplaceTempView('table_2')
spark.read.format('delta').load(s3_bucket_path + 'table_3').createOrReplaceTempView('table_3')
I was wondering if there was a quicker way to have all the tables available (without having to use boto3 and iterate through the folder to get the table names), or if I wasn't following the best practices in order to work with Spark Sql Apis: should I use a different approach? I've been studying Spark for a week and I'm not 100% familiar with its architecture yet.
Thank you very much for your help.

Sounds like you'd like to use managed tables, so you have easy access to query the data with SQL, without manually registering views.
You can create a managed table as follows:
df.write.format("delta").saveAsTable("table_1")
The table path and schema information is stored in the Hive megastore (or another metastore if you've specified another metastore). Managed tables will prevent you from manually having to create the views yourself.

Related

Can I use aws glue crawlers to create master data in delta lake tables?

I am setting up a new data lake and have been tasked with creating the master data tables in the data bricks delta lake component. I'm trying to do this in a use-case agnostic way (or as agnostic as possible), and need to automate the process where possible. I have researched aws glue crawlers, and it seems it is a good way to automatically create a schema and catalog for the data.
However, I'm not sure how to proceed. I'm assuming that creating the master data means identifying common fields in all the data sources and creating a schema for all the data using a single crawler, and then dividing this schema into facts and dimensions. After that I could use spark jobs on data bricks to extract what I need from the raw data and to populate the master data, while checking for duplicates and doing whatever other transformations that need to be done.
This plan seems like it requires a lot of manual labor though, and it's not use case agnostic in any way. Does anyone know how it could be automated further?
Any help would be much appreciated.

Writing Data to External Databases Through PySpark

I want to write the data from a PySpark DataFrame to external databases, say an Azure MySQL database. So far, I have managed to do this using .write.jdbc(),
spark_df.write.jdbc(url=mysql_url, table=mysql_table, mode="append", properties={"user":mysql_user, "password": mysql_password, "driver": "com.mysql.cj.jdbc.Driver" })
Here, if I am not mistaken, the only options available for mode are append and overwrite, however, I want to have more control over how the data is written. For example, I want to be able to perform update and delete operations.
How can I do this? Is it possible to say, write SQL queries to write data to the external databases? If so, please give me an example.
First I suggest you use the specific Azure SQL connector. https://learn.microsoft.com/en-us/azure/azure-sql/database/spark-connector.
Then I recommend you use bulk mode as row by row mode is slow, and can incur unexpected charges if you have log analytics turned on.
Lastly, for any kind of data transformation, you should use an ELT pattern:
Load raw data into an empty staging table
Run SQL code, or even better, a stored procedure which performs required logic (for example merging into a final table) run DML such as a stored proc

Truncate tables on databricks

I'm working with two environments in Azure: Databricks and SQL Database. I'm working with a function that generate a dataframe that it's going to be used to overwrite the table that is stored in the SQL Database. I have many problems because the df.write.jdbc(mode = 'overwrite') only drops the table and, I'm guessing, my user didn't have the right permissions to created again (I've already seen for DML and DDL permission that I need to do that). In resume, my functions only drops the table but without recreating again.
We discuss about what could be the problem and we conclude that maybe the best thing that I can do is truncate the table and re-add the new data there. I'm trying to find how to truncate the table, I tried these two approaches but I can't find more information related to that:
df.write.jdbc()
&
spark.read.jdbc()
Can you help me with these? The overwrite doesn't work (maybe I don't have the adequate permissions) and I can't figure out how to truncate that table using a jdbc.
It's in the Spark documentation - you need to add the truncate when writing:
df.write.mode("overwrite").option("truncate", "true")....save()
Also, if you have a lot of data, then maybe it's better to use Microsoft's Spark connector for SQL Server - it has some performance optimizations that should allow to write faster.
You can create stored procedure for truncating or dropping in SQL Server and call that stored procedure in databricks using ODBC connection.

Synchronize data lake with the deleted record

I am building data lake to integrate multiple data sources for advanced analytics.
In the begining, I select HDFS as data lake storage. But I have a requirement for updates and deletes in data sources which I have to synchronise with data lake.
To understand the immutable nature of Data Lake I will consider LastModifiedDate from Data source to detect that this record is updated and insert this record in Data Lake with a current date. The idea is to select the record with max(date).
However, I am not able to understand how
I will detect deleted records from sources and what I will do with Data Lake?
Should I use other data storage like Cassandra and execute a delete command? I am afraid it will lose the immutable property.
can you please suggest me good practice for this situation?
1. Question - Detecting deleted records from datasources
Detecting deleted records from data sources, requires that your data sources supports this. Best is that deletion is only done logically, e. g. with a change flag. For some databases it is possible to track also deleted rows (see for example for SQL-Server). Also some ETL solutions like Informatica offer CDC (Changed Data Capture) capabilities.
2. Question - Changed data handling in a big data solution
There are different approaches. Of cause you can use a key value store adding some kind of complexity to the overall solution. First you have to clarify, if it is also of interest to track changes and deletes. You could consider loading all data (new/changed/deleted) into daily partitions and finally build an actual image (data as it is in your data source). Also consider solutions like Databricks Delta addressing this topics, without the need of an additional store. For example you are able to do an upsert on parquet files with delta as follows:
MERGE INTO events
USING updates
ON events.eventId = updates.eventId
WHEN MATCHED THEN
UPDATE SET
events.data = updates.data
WHEN NOT MATCHED
THEN INSERT (date, eventId, data) VALUES (date, eventId, data)
If your solution also requires low latency access via a key (e. g. to support an API) then a key-values store like HBase, Cassandra, etc. would be helpfull.
Usually this is always a constraint while creating datalake in Hadoop, one can't just update or delete records in it. There is one approach that you can try is
When you are adding lastModifiedDate, you can also add one more column naming status. If a record is deleted, mark the status as Deleted. So the next time, when you want to query the latest active records, you will be able to filter it out.
You can also use cassandra or Hbase (any nosql database), if you are performing ACID operations on a daily basis. If not, first approach would be your ideal choice for creating datalake in Hadoop

How to migrate data between two tables in Cassandra properly

I have to change the schema of one of my tables in Cassandra. It's cannot be done by simply using ALTER TABLE command, because there are some changes in primary key.
So the question is: How to do such a migration in the best way?
Using COPY command in cql is not an option in here because dump file can be really huge.
Can I solve this problem by not creating some custom application?
Like Guillaume has suggested in the comment - you can't do this directly in cassandra. Schema altering operations are very limited here. You have to perform such migration manually using one of suggested there tools OR if you have very large tables you can leverage Spark.
Spark can efficiently read data from your nodes, transform them locally and save them back to db. Remember that such migration requires reading whole db content, so might take a while. It might be the most performant solution, however needs some bigger preparation - Spark cluster setup.

Resources