Disclaimer: Not a code query, but directly related to it.
I find it difficult in Databricks to handle such scenarios where there's no shell prompt; just the notebooks. I have two clusters on Azure dev & prod. The database & tables can be accessed via Databricks Notebooks of separate environments.
The problem arises when I want to:
Query data in dev, but from prod environment & vice-versa. On a sql prompt, it just seems impossible to achieve this.
If I want to populate dev table from prod table; there's no way to establish a connection from within the dev notebook to query the table of prod environment.
The workaround I've established for now to copy the prod data into dev is:
Download full dump from production in csv in my local machine.
Upload to DBFS in dev environment.
Create temp table/directly insert the csv in the dev table.
Any comments on how I remove this download-upload process & query prod directly from dev notebook?
DBFS root is not really a production-grade solution, it's recommended that you always mount an external storage (e.g. Azure Storage - blob or ADLS Gen2)and use it to store your tables.
If you use external storage the problem becomes quite simple - all you have to do is mount the production storage on the dev cluster and you can access it as tables can be defined both over root dbfs and mounted data sources. So you can have a notebook that copies data from one to the other (and hopefully does all of the data anonymization / sampling that you need). You can also setup a more explicit process for that using Azure Data Factory, in most cases using only simple copy activity.
Related
I am trying to load the data (tabular data in tables, in a schema named 'x' from a spark pool in Azure Synapse. I can't seem to find how to do that. Until now i have only linked synapse and my pool to the ML studio. How can I do that?
The Lake Database contents are stored as Parquet files and exposed via your Serverless SQL endpoint as External Tables, so you can technically just query them via the endpoint. This is true for any tool or service that can connect to SQL, like Power BI, SSMS, Azure Machine Learning, etc.
WARNING, HERE THERE BE DRAGONS: Due to the manner in which the serverless engine allocates memory for text queries, using this approach may result in significant performance issues, up to and including service interruption. Speaking from personal experience, this approach is NOT recommended. I recommend that you limit use of the Lake Database for Spark workloads or very limited investigation in the SQL pool. Fortunately there are a couple ways to sidestep these problems.
Approach 1: Read directly from your Lake Database's storage location. This will be in your workspace's root container (declared at creation time) under the following path structure:
synapse/workspaces/{workspacename}/warehouse/{databasename}.db/{tablename}/
These are just Parquet files, so there are no special rules about accessing them directly.
Approach 2: You can also create Views over your Lake Database (External Table) in a serverless database and use the WITH clause to explicitly assign properly sized schemas. Similarly, you can ignore the External Table altogether and use OPENROWSET over the same storage mentioned above. I recommend this approach if you need to access your Lake Database via the SQL Endpoint.
I have a backup file that came from Server A and I copied that .bak files into my local and setup that DB into my Sql Server Management Studio. Now After setting it up I deployed it in Azure Sql Database. But now there were change in the Data in Server A because it's still being used, so I need to get all those changes to the Azure SQL Database that I just deployed. How am I going to do that?
Note: I'm using Azure for my server and I have a local copy of Server A database. So basically in terms of data and structure my local and the previous Server A db is the same. But after a few days Server A data is now updated and my local DB is still the same as when I just backup the db in Server A.
How can I update the DB in Azure to take all the changes in Server A and deploy it in Azure?
You've got a few choices. It's just about migrating data. It's also a question of which data you're going to migrate. Let's say it's a neat, complete replacement. Then, I'd suggest looking at the bacpac mechanism. That's a way to export a database, it's structure and data, then import it into a new location. This is one mechanism of moving to Azure.
If you can't simply replace everything, you need to look at other options. First, there's SSIS. You can build a pipeline to move the data you need. There's also export and import through sqlcmd, which can connect to Azure SQL Database. You can also look to a third party tool like Redgate SQL Data Compare as a way to pick and choose the data that gets moved. There are a whole bunch of other possible Extract/Transform/Load (ETL) tools out there that can help.
Do you want to sync schema changes as well as Data change or just Data? If it is just Data then the best service to be used would be Azure Data Migration Service, where this service can help you copy the delta with respect to Data to Azure incrementally, both is online and offline manner and you can also decide on the schedule.
Copying data between various instances of ADLS using DISTCP
Hi All
Hope you are doing well.
We have a use case around using ADLS as different tiers of the ingestion process, just required you valuable opinions regarding the feasibility of the same.
INFRASTRUCTURE: There will two instances of ADLS named LAND & RAW. LAND instance will be getting the file directly from the source while RAW instance will be getting the file once validations are passed in LAND instance. We also have a Cloudera cluster hosted on Azure platform which will have connectivity established to both the ADLS instances.
PROCESS: We will have a set of data & control files landing in one of the ADLS instances (say landing). We need to run a spark code on Cloudera cluster to perform count validation between Data & control file present in Land ADLS instance. Once the validation is successful, we want distcp command to copy data from Land ADLS instance to Raw ADLS instance. We are assuming that Distcp utility will be already installed on the Cloudera cluster.
Can you guys suggest if above approach looks fine?
Primarily our question is whether DISTCP utility will support data movement between two different ADLS instances?
We also considered other options like ADLCopy but Distcp appeared better.
NOTE: We haven't considered use Azure Data Factory since it may has certain security challenges though we know Data Factory is best suited for above use case.
If your use case requires you to copy data between multiple storage accounts, distcp is the right way to execute this.
Note that even if you were to encapsulate this solution in data factory, the pipeline with copy activity will invoke distcp.
I have a directory on Azure Data Lake mountedd to an Azure Data Bricks cluster. Browsing through the file system using the CLI tools or just running dbfs utils through a notebook, I can see that there are files and data in that directory. Further - executing queries against those files is successful, data is succesfully read in and written out.
I can also successfully browse to the root of my mount ('/mnt', just because that's what the documentation used here: https://docs.databricks.com/spark/latest/data-sources/azure/azure-datalake.html) in the 'Create New Table' UI (via Data -> Add Table -> DBFS).
However, there are no subdirs listed under that root directory.
Is this a quirk of DBFS? A quirk of the UI? Or do I need to reconfigure something to allow me to add tables via that UI?
The Data UI currently is not supporting mounts, it only works for the internal DBFS. So currently there is no configuration option. If you want to use this UI for data upload (and not e. g. Storage Explorer) only solution would be to move the data afterwards from the internal DBFS to the mount dir via dbutils.fs.mv.
We're trying to get information about Azure and/or AWS in terms of their ability to create snapshots of data drives that are writable and can be attached to VMs.
Currently we use a model with our test environments on-prem, where we have a clone of a set of production databases/logs on drives that are quite large (+2TB) on our EMC SAN. Instead of making full copies of the clone for each test environment DB server, we use EMC VNX redirect-on-write snapshots. This allows us to quickly provision the DB server VM in the test environment without having to make a full copy of the DB/logs, and saves on SAN space, as only the delta from new writes to the snapshot are stored as new data. This works really well as we only need one full copy of the source DBs/logs.
Does anyone know if Azure or AWS has the ability to do something similar or a reasonable alternative? Making full copies of the databases/logs for each test environment is not really an option for us. We started looking at the Azure SQL Database copy feature but we were not sure if this creates a full copies or writable snapshots.
Thanks in advance.
Does anyone know if Azure or AWS has the ability to do something similar or a reasonable alternative?
Azure VM Disk uses Azure Page Blob to store data. Until now, the snapshot of a Azure blob can be read, copied, or deleted, but not modified.
I am sorry to tell that Azure doesn't provide the similar thing to fit your requirement. In Azure, we do need to use AzCopy to copy the whole blob to make the new blob writable.