How to query Azure Data Lake? [closed] - azure

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
Coming from the database world, when we have something related to Data we use a ui tool to query data. Be it big or small.
Is there anything like SSMS, SQL WorkBench (For Big Data Redshift), Athena (Query Big Data S3) for Azure Data Lake?
I see Data Lake Analytics just queries the data and store it in file. Is there is anyway to query the data on Azure Data Lake via a UI Tool or WebBased Tool?

No there is not (yet). Sure, you can run a query using the portal or using Visual Studio (docs) or Visual Studio Code (docs) but all those tools will provide access to the generated file (which can be easily obtained or previewed)
Main reason is that u-sql / data lake analytics is geared toward long running jobs (that can take up from a few minutes to hours) to process the vast amount of data. Keeping that in mind you can hopefully better understand why these kind of direct query tooling is not (yet?) available.
EDIT: try upvoting this on the feedback site. What you are asking is a highly requested feature.

You can download the Azure Data Explorer form here https://azure.microsoft.com/en-us/features/storage-explorer/
Upload, download, and manage Azure blobs, files, queues, and tables, as well as Azure Cosmos DB and Azure Data Lake Storage entities. Easily access virtual machine disks, and work with either Azure Resource Manager or classic storage accounts. Manage and configure cross-origin resource sharing rules

You can create External table in SQL Server pointing to Data Lake Files.Only thing is we have to take care of the schema changes manually.

you can use Spark SQL through azure data bricks to query Azure Data Lake Files.

Related

Design Data Processing Solution in Azure [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
We have large amounts of CSV files which arrive to a dedicated drive (e.g. D:) on a daily basis. Then, a set of SSIS packages will pick up those files, performs transformations on them, and then ingest the result into several tables in a database. Logging and error handling do also exist.
As we are experimenting a possible move to the cloud (Azure in particular), we went for a lift and shift scenario at the beginning. In this approach, we simply deployed the same SSIS packages into Azure SQL Server, created Azure Data Factory ADF pipelines, and run those packages from there.
We would like to re-factor our solution to replace SSIS packages, with cloud-native services of Azure.
My questions would be:
Based on the scenario explained in the first paragraph, is this considered as a Batch Processing scenario ? Does Azure Batch Service fit in as potential service to use ? Or it would be more efficient to use Azure DataBricks with ADF ?
Below is the solution environment and main tasks on-premises. I would like to have a comparison between what we do in SSIS, and the counterpart in Azure world:
Item
On-Premise World
Azure World
Storage to receive CSV files
Normal Disk Drive D:\
?
CSV File Processing
SSIS -> Data Flow -> Script Component
?
Ingest to Destination Table
SSIS -> Data Flow -> OLE DB Destination
?
Custom Scripting
Script Task & Script Component
?
Database
SQL Server
?
Recommendations/best practices/approaches used in similar migration projects ?
You could use Azure Data Factory for the ETL part. (In fact, it even support your existing SSIS packages).
I don't think Azure Batch is the right choice in this case, but you can use it. Azure Batch is more used for intensive compute process e.g. rendering 3d arts
Azure Synapse Analytics is probably a good fit for this. You could stand up the individual products (eg SQL DB, Azure Data Factory etc) but you get easier integration between the components with Synapse:
Item
On-Premise World
Azure World
Storage to receive CSV files
Normal Disk Drive D:\
Azure Data Lake Gen 2
CSV File Processing
SSIS -> Data Flow -> Script Component
Polybase
Ingest to Destination Table
SSIS -> Data Flow -> OLE DB Destination
CTAS
Custom Scripting
Script Task & Script Component
Synapse Notebooks
Database
SQL Server
Dedicated SQL Pools
Components
Azure Data Lake Gen 2 - when you provision a Synapse workspace you will have the option to provision data lake storage if you don't have
Polybase - or external tables, allows you to create virtual tables over your .csv files stored in the lake for easy reading
CTAS - or CREATE TABLE AS. Use CTAS to materialise your csv files from the data lake into physical tables in your database
Synapse Notebooks - use Synapse Notebooks for very custom processing, basically things you can't already do easily with SQL or Synapse Pipelines. These even support c# so you could do more of a lift-and-shift for certain pieces of code although some customisation would definitely be required
dedicated SQL pools - scalable MPP database with pause and resume.
If you have < 1TB you should consider looking at Azure SQL DB and ADF standalone instead.

Azure Datalake on-premise or hybrid stack

We are trying to evaluate a good fit for our solution. We want to process big-data, for that we want build the solution around Hadoop stack. We wanted to know how azure can help in these situations. The solution we are building is a SAAS. But some of our clients have confidential data which they want to hold only in their premise.
So can we run azure data lake on premise for those clients?
Can we have a hybrid model where storage used will be on premise but the processing done will be on cloud.
The reason we are asking this is to answer the question of scalability and reliability.
I know this is vague but if you need more clarification please let us know.
Azure Data Lake (gen2) Hierarchical Filesystem support in Azure Stack would enable you to use it natively for your storage requirements. Unfortunately currently Azure Stack does not support Azure Data Lake.
You can find the list of available services here: https://azure.microsoft.com/en-gb/overview/azure-stack/keyfeatures/. Some of the big data ecosystem Azure tools are in development but not yet generally available.
There is a feature request to add this support. You can vote this up to help Microsoft prioritize this. https://feedback.azure.com/forums/344565-azure-stack/suggestions/38222872-support-adls-gen2-on-azure-stack

How can I backup an Azure Cosmos DB [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 2 years ago.
Improve this question
I have an Azure Cosmos DB and I need to delete all the resources from this subscription. Is there any way to take a backup offline from the portal?
UPDATE : Cosmos DB now supports backup Online backup and on-demand data restore in Azure Cosmos DB
You can use Data Migration Tool suggested on Automatic online backup and restore with Azure Cosmos DB article to do the same.
There is no way to take a backup and import to Azure CosmosDB.
The recommendation is to open a support ticket (e.g. via Azure Portal) or call Azure Support to streamline the backup/restore strategy and to request Azure to restore the latest backup in case of a disaster event. In addition, you can contact the Azure CosmosDB team by sending an email to AskCosmosDB#microsoft.com.
Sajeetharan's answer is wrong.
mongodump --uri="PRIMARY_CONNECTION_STRING"
Use this command. It will create a dump in your current working DIR.
MS has finally introduced a backup policy feature to CosmosDB!See https://learn.microsoft.com/en-us/azure/cosmos-db/online-backup-and-restore#modify-the-backup-interval-and-retention-period
I guess this removes the need for third party and/or custom tools to do such a basic Ops routine.
Automatic backup is not for free... Standard support plan ($100/mo)
I'm using free A̶z̶u̶r̶e̶ ̶C̶o̶s̶m̶o̶s̶ ̶D̶B̶ ̶D̶a̶t̶a̶ ̶M̶i̶g̶r̶a̶t̶i̶o̶n̶ ̶T̶o̶o̶l̶.
EDIT: thx #e2eDev link Azure Cosmos DB Data Migration Tool 1.8.3 on github
There is both GUI tool dtui.exe even CLI tool dt.exe
Supports many protocols and even JSON format (for both import and export).

How is Azure Storage Tables implemented?

I'm the type of developer that likes to understand the whole stack and viewing Azure Storage Tables as a black box makes me uncomfortable.
RDBMS is an entire field of study in Computer Science. The components necessary to support ACID operations, query optimizations down to the details of B-trees to create indexes is essentially a well documented, solved problem.
Apache HBase and MongoDB are open source and Google has published multiple papers on BigTable, but I can't find anything on Microsoft's Azure Storage Tables, other than usage / developer documentation. Has Microsoft published any details on the actual implementation (algorithms, data structures and infrastructure) behind Azure Storage Tables?
The Azure Storage team presented a paper at SOSP11 describing the inner workings of the Azure Storage Service (including the Table Services).

What tools can provide scheduled backups of Azure blob storage? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I'm looking for the best way to prevent accidental deletion by IT - perhaps copying to disk or a separate Azure Storage account or Amazon. What tools can do this? Redgate Cloud Services seems like the closest fit for what I want but it seems to require config per container. I know of some other tools like Cloud Storage Studio and Azure Sync Tool exist but I don't think they support scheduled backups of blob storage.
Windows Azure storage is backed up Geo-replication which means there are total 6 copies of your data at any given time. There is no built-in service available in Windows Azure to backup data on Azure Storage to outside Azure Storage or user defined location.
Windows Azure Azure is manged by RESTful interface so 3rd party vendors have created application for such purposes. Besides above I had chance to use Gladinet Cloud Backup solution could be useful in your case. Based on my experience, there are a few backup tools available however and not a single one perfect to match everybody expectation.
A cheap way to prevent accidental deletion by IT is to snapshot the blobs into a backup container. IT would have to be very persistent and delete all of the snapshots taken of the original blob in order to accidentally delete it.
"A blob that has snapshots cannot be deleted unless the snapshots are also deleted. You can delete a snapshot individually, or tell the storage service to delete all snapshots when deleting the source blob. If you attempt to delete a blob that still has snapshots, your call will return an error."
http://msdn.microsoft.com/en-us/library/windowsazure/hh488361
CloudBerry Backup: it supports Amazon S3, Azure, Google, and much more cloud storage providers
http://www.cloudberrylab.com/amazon-s3-microsoft-azure-google-storage-online-backup.aspx

Resources