I am working with a team to build a spark datawarehouse/data platform. The approved architecture suggests that the databases are created on the fly with data from persisted parquet files. The parquet files are made up to date with current and historical data.
The data is then loaded to a landing, staging and presentation layers (built in Apache Spark pool in Azure Synapse) using Azure Synapse pipelines. These layers, as per the architecture, are not persisted. They are dropped and recreated every night.
Since I did not want the users to impacted with a failure I was looking at options to load the data into new databases every night, drop the old databases and then rename the newly created databases. This will ensure that the business users will have access to data at all times.
Question:
Is there anyway to rename spark databases in Azure Synapse?
NB: Not sure if this question is to be asked in this format, but any comments/suggestions will be much appreciated.
Related
I am trying to load the data (tabular data in tables, in a schema named 'x' from a spark pool in Azure Synapse. I can't seem to find how to do that. Until now i have only linked synapse and my pool to the ML studio. How can I do that?
The Lake Database contents are stored as Parquet files and exposed via your Serverless SQL endpoint as External Tables, so you can technically just query them via the endpoint. This is true for any tool or service that can connect to SQL, like Power BI, SSMS, Azure Machine Learning, etc.
WARNING, HERE THERE BE DRAGONS: Due to the manner in which the serverless engine allocates memory for text queries, using this approach may result in significant performance issues, up to and including service interruption. Speaking from personal experience, this approach is NOT recommended. I recommend that you limit use of the Lake Database for Spark workloads or very limited investigation in the SQL pool. Fortunately there are a couple ways to sidestep these problems.
Approach 1: Read directly from your Lake Database's storage location. This will be in your workspace's root container (declared at creation time) under the following path structure:
synapse/workspaces/{workspacename}/warehouse/{databasename}.db/{tablename}/
These are just Parquet files, so there are no special rules about accessing them directly.
Approach 2: You can also create Views over your Lake Database (External Table) in a serverless database and use the WITH clause to explicitly assign properly sized schemas. Similarly, you can ignore the External Table altogether and use OPENROWSET over the same storage mentioned above. I recommend this approach if you need to access your Lake Database via the SQL Endpoint.
So i have data providers who will upload files using Power Apps, than a ETL job will run and read the content of file and save to DB in cloud. I want a solution in which when the schema of the file changes (column added, removed or changed ) the ETL job will handle it itself.
Is it possible in ADF?
Yes. You will use ADF Data Flows for this ETL scenario. https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-schema-drift
My requirements are as below :
Move 3 SAP local databases to 3 Azure SQL DB.
Then Sync daily transactions or data to azure every night. If transactions of local DB are already exists in azure, update process will do on these transactions if not insert process will do.
Local systems will not stop after moving to azure. They will still goes about 6 months.
Note :
We are not compatible with Azure Data Sync process because of it's
limitations - only support 500 tables, can't sync no primary keys
table, no views and no procedure. It also increase database size on
both(local and azure).
Azure Data Factory Pipeline can fulfill my requirements but I have
to create pipeline and procedure manually for each table. (SAP has
over 2000 tables, not good for me)
We don't use azure VM and Manage Instance
Can you guide me the best solution to move and sync? I am new to azure.
Thanks all.
Since you mentioned that ADF basically meets your needs, I will try to start from ADF. Actually,you don't need to manually create each table one by one.The creation could be done in the ADF sdk or powershell script or REST api. Please refer to the official document:https://learn.microsoft.com/en-us/azure/data-factory/
So,if you could get the list of SAP table names(i found this thread:https://answers.sap.com/questions/7375575/how-to-get-all-the-table-names.html) ,you could loop the list and execute the codes to create pipelines in the batch.Only table name property need to be set.
I have several external data APIs that I access using some Python scripts. My scripts run from an on-premises server, transform the data, and store it in a SQL Server database on the same server. I suppose it's a rudimentary ETL system run with Python and T-SQL.
The system is about to grow quite a bit with new APIs and will require more complex data pipelines (for example, some of the API data will be spun off to more than one table). I think this would be a good time to move the system onto Azure (we are heavily integrated with Microsoft so it will have to be Azure!).
I have spent a few days researching the Azure products that would let me run Python scripts to access data from web APIs and store the processed data in a cloud database. I'm looking for advice on what sort of Azure products other people have used for similar jobs. At the moment it seems I will need:
Azure SQL Database to hold the processed data that can be accessed by various colleagues.
Azure Data Factory to manage, log, and schedule the pipeline jobs and to run my custom Python scripts (is this even possible?).
Azure Batch to run the aforementioned Python scripts but I'm not sure about this.
I want to put together a proposal basically and start thinking about costs but it would be good to hear from someone who has done something similar - am I on the right track or completely off? Should I just stay on-premises? Thank you in advance.
Azure SQL Database, Azure SQL Data Warehouse are good for relational data. And if you want to use NoSQL, you could go with Azure Cosmos DB. If you want to use Files to store data, you could use Azure Data Lake.
For python scripts, you could use custom activity or Data bricks for Azure Data Factory.
Azure SQL Warehouse should be used if the amount of data you want to load is in petabytes. Also, Azure Data warehouse is not meant for complex transformations. I would recommend it for plain data load with PolyBase.
I am working for an energy provider company. Currently, we are generating 1 GB data in form of flat files per day. We have decided to use azure data lake store to store our data, in which we want to do batch processing on a daily basis. My question is that what is the best way to transfer the flat files into azure data lake store? and after the data is pushed into azure I am wondering whether it is good idea to process the data with HDInsight spark? like Dataframe API or SparkSQL and finally visualize it with azure?
For a daily load from a local file system I would recommend using Azure Data Factory Version 2. You have to install Integration Runtimes on Premise (more than one for High Avalibility). You have to consider several security topics (local firewalls, network connectivity etc.) A detailed documentation can be found here. There are also some good Tutorials available. With Azure Data Factory you can trigger your upload to Azure with a Get-Metadata-Activity and use e. g. an Azure Databricks Notebook Activity for further Spark processing.