Writing Data to External Databases Through PySpark - apache-spark

I want to write the data from a PySpark DataFrame to external databases, say an Azure MySQL database. So far, I have managed to do this using .write.jdbc(),
spark_df.write.jdbc(url=mysql_url, table=mysql_table, mode="append", properties={"user":mysql_user, "password": mysql_password, "driver": "com.mysql.cj.jdbc.Driver" })
Here, if I am not mistaken, the only options available for mode are append and overwrite, however, I want to have more control over how the data is written. For example, I want to be able to perform update and delete operations.
How can I do this? Is it possible to say, write SQL queries to write data to the external databases? If so, please give me an example.

First I suggest you use the specific Azure SQL connector. https://learn.microsoft.com/en-us/azure/azure-sql/database/spark-connector.
Then I recommend you use bulk mode as row by row mode is slow, and can incur unexpected charges if you have log analytics turned on.
Lastly, for any kind of data transformation, you should use an ELT pattern:
Load raw data into an empty staging table
Run SQL code, or even better, a stored procedure which performs required logic (for example merging into a final table) run DML such as a stored proc

Related

Truncate tables on databricks

I'm working with two environments in Azure: Databricks and SQL Database. I'm working with a function that generate a dataframe that it's going to be used to overwrite the table that is stored in the SQL Database. I have many problems because the df.write.jdbc(mode = 'overwrite') only drops the table and, I'm guessing, my user didn't have the right permissions to created again (I've already seen for DML and DDL permission that I need to do that). In resume, my functions only drops the table but without recreating again.
We discuss about what could be the problem and we conclude that maybe the best thing that I can do is truncate the table and re-add the new data there. I'm trying to find how to truncate the table, I tried these two approaches but I can't find more information related to that:
df.write.jdbc()
&
spark.read.jdbc()
Can you help me with these? The overwrite doesn't work (maybe I don't have the adequate permissions) and I can't figure out how to truncate that table using a jdbc.
It's in the Spark documentation - you need to add the truncate when writing:
df.write.mode("overwrite").option("truncate", "true")....save()
Also, if you have a lot of data, then maybe it's better to use Microsoft's Spark connector for SQL Server - it has some performance optimizations that should allow to write faster.
You can create stored procedure for truncating or dropping in SQL Server and call that stored procedure in databricks using ODBC connection.

copy data from one spanner db to an existing spanner db

I need to find a tool or a technique that will generate insert statements from a spanner db so I can insert them into another spanner db. I need to selectively choose which insert statements or rows to migrate so the spanner export/import tool will not work. The destination db will already exist and it will have existing data in it. The amount of data is small - roughly 15 tables with 10 to 20 rows in each table. Any suggestions would be greatly appreciated.
You can use the Cloud Spanner Dataflow Connector to write your pipeline/data loader to move data in and out of Spanner. You can use a custom SQL query with the Dataflow reader to read the subset of data that you want to export.
Depending on how wide your tables are, if you are dealing with a relatively small amount of data, a simpler way to this could be using the gcloud spanner databases execute-sql command-line utility. For each of your tables, you could use the utility to run a SQL query to get the rows you want to export from the table and write the result to a file in the csv format using the --format=csv argument. Then you could write a small wrapper around Cloud Spanner Insert APIs to read the data from the CSV files and send insert mutations to the target database.

Is Presto is a data store for storing data?

I am new to work on Presto. I have some doubts regarding Presto.
Whether Presto is a data store(database)?
If it is a query engine ? Whether there is any common query syntax for accessing Hive, SQL, Cassandra data using connectors or it will accept all data source queries based on connectors ?
Where the query execution will takes place in Presto or in connected data source end?
It is a query engine. However it accesses data from many different data sources.
Yes. It is ANSI SQL. When accessing data from underlying data source then it's specific interface is used (thrift, hdfs, jdbc etc), but this is hidden from the user.
In both places. Presto is capable to push down some data filtering down to underlying data source (projection, where clauses). There is current effort to also push more parts of SQL query (see https://github.com/prestosql/presto/issues/18). Rest is evaluated in Presto.

Synchronize data lake with the deleted record

I am building data lake to integrate multiple data sources for advanced analytics.
In the begining, I select HDFS as data lake storage. But I have a requirement for updates and deletes in data sources which I have to synchronise with data lake.
To understand the immutable nature of Data Lake I will consider LastModifiedDate from Data source to detect that this record is updated and insert this record in Data Lake with a current date. The idea is to select the record with max(date).
However, I am not able to understand how
I will detect deleted records from sources and what I will do with Data Lake?
Should I use other data storage like Cassandra and execute a delete command? I am afraid it will lose the immutable property.
can you please suggest me good practice for this situation?
1. Question - Detecting deleted records from datasources
Detecting deleted records from data sources, requires that your data sources supports this. Best is that deletion is only done logically, e. g. with a change flag. For some databases it is possible to track also deleted rows (see for example for SQL-Server). Also some ETL solutions like Informatica offer CDC (Changed Data Capture) capabilities.
2. Question - Changed data handling in a big data solution
There are different approaches. Of cause you can use a key value store adding some kind of complexity to the overall solution. First you have to clarify, if it is also of interest to track changes and deletes. You could consider loading all data (new/changed/deleted) into daily partitions and finally build an actual image (data as it is in your data source). Also consider solutions like Databricks Delta addressing this topics, without the need of an additional store. For example you are able to do an upsert on parquet files with delta as follows:
MERGE INTO events
USING updates
ON events.eventId = updates.eventId
WHEN MATCHED THEN
UPDATE SET
events.data = updates.data
WHEN NOT MATCHED
THEN INSERT (date, eventId, data) VALUES (date, eventId, data)
If your solution also requires low latency access via a key (e. g. to support an API) then a key-values store like HBase, Cassandra, etc. would be helpfull.
Usually this is always a constraint while creating datalake in Hadoop, one can't just update or delete records in it. There is one approach that you can try is
When you are adding lastModifiedDate, you can also add one more column naming status. If a record is deleted, mark the status as Deleted. So the next time, when you want to query the latest active records, you will be able to filter it out.
You can also use cassandra or Hbase (any nosql database), if you are performing ACID operations on a daily basis. If not, first approach would be your ideal choice for creating datalake in Hadoop

Can i upload data from multiple datasources to azure DW at same time

Can i retrieve data from multiple data sources to Azure SQL DataWarehouse at the same time using single pipeline?
SQL DW can certainly load multiple tables concurrently using external (aka PolyBase) tables, bcp, or insert statements. As hirokibutterfield asks, are you referring to a specific loading tool like Azure Data Factory?
Yes you can, but there you have to mention a copy activity for each of the data source being copied to the azure data warehous.
Yes you can, and depending on the extent of transformation required, there would be 2 ways to do this. Regardless of the method, the data source does not matter to ADF since your data movement happens via the copy activity which looks at the dataset and takes care of firing the query on the related datasource.
Method 1:
If all your transformation for a table can be done in a SELECT query on the source systems, you can have a set of copy activities specifying SELECT statements. This is the simple approach
Method 2:
If your transformation requires complex integration logic, first use copy activities to copy over the raw data from the source systems to staging tables in the SQLDW instance (Step 1). Then use a set of stored procedures to do the transformations (Step 2).
The ADF datasets which are the output from Step1 will be input datasets to Step 2 in order to maintain consistency.

Resources