Azure CosmosDB - Download all documents in collection to local directory - azure

I am trying to download all the documents in my cosmosDB collection to a local directory. I want to modify a few things in all of the JSON documents using python, then upload them to another Azure account. What is the simplest, fastest way to download all of the documents in my collection? Should I use the CosmosDB emulator? I've been told to check out Azure's data factory? Would that help with downloading files locally? I've also been referred to CosmosDB's data migration tool and I saw that it facilitates import data to CosmosDB but I can't find much on exporting. I have about 6GB of Json files in my collection.
Thanks.

In the past I've used the DocumentDb (CosmosDb) Data Migration Tool which is available for download from Microsoft.
When running the app you need to specify source and target as in the screenshot below
Make sure that you choose to Import from DocumentDb and specify the connection string and collection you want to export from. If you want to dump the entire contents of your collection the query would just be
SELECT * FROM c
Then under the Target Information you can choose a JSON file which will be saved to your local hard drive. You're free to modify the contents of that file in any way and then use it as Source Information later when you're ready to import it back to another collection.

I used the migration tool and found that it is great if you have a reasonably sized db as it does use processing and bandwidth for a considerable period. I had to chunk a 10GB db and that took too long so ended up using Data Lake Analytics to transfer via script to SQL server and Blob Storage. It gives you a lot of flexibility to transform the data and store either in Data Lake of other distributed systems. As well if needed it helps if you are using cosmos for staging and need to run the data through any cleaning algorithms.
The other advantages are that you can set up batching and you get a lot of processing stats to determine how to optimize large data transformations. Hope this helps. Cheers.

Related

Copy millions of files form root AZStorage Blob to subfolders

I’ve got multiple Azure storage blob containers each with over 1M JSON files include the root. Impossible to work with (no shocker) so trying to use Data Factory to move them to multiple folders using a timestamp in the files to create a YYYY-MM-DD/HH folder setup as a partition system. But every approach I’ve tried fails with timeouts / too many item limits. Need to open each file, get the timestamp, and use it to move the file to a dynamic path using the timestamp data. Ideas?
UPDATE: I was able to get around this, but I wouldn't call it a "answer" so I'll just update the question. To create smaller collections, I parameterized the pipeline to accept a file name wildcard. I then created another pipeline that uses an array of 0-9,a-z to use that as an parameter on the dataset. Brute force workaround... assume there's got to be a better solution, but this works for now.
Read doc: Move data to and from Azure Blob storage
The following articles describe how to move data to and from Azure Blob storage using different technologies.
Azure Storage-Explorer
AzCopy
Python-SDK (Others: .NET, Java, Node.js, Python, Go, PHP, Ruby.)
SSIS
In your case, I would suggest you to use SDK, which supports .NET, Java, Node.js, Python, Go, PHP, Ruby.
Believe me , if you want to migrate your datas from AzureBlob , DataFactory is not a good way, it makes the problem more complicated.
( This is my suggestion after I migrated over 100 million JSON-files (over 2TB) from AzureBlob)
If you have time... I would do the following:
Create an Azure Function to read the file and get your timestamp and do your move operation. scope the function just to use a single file. Then use events (EventGrid) in the storage account to trigger the function on create of a blob. Then you know for any new files it will move the file to the right spot. (Remember you need to reach a million executions in the consumption model for functions to start billing, so this is a low cost option).
For the current files, create another function (or if you want some more control, use a logic app, but your cost will be a bit more) and set your parralelism on the function or logic app to a low amount (to keep an eye on your executions). that run a simple for each with limits that run your first function. This will slowly move your files out of that container eventually getting you into a reasonable item count to work with on with stuff like ADF. This might just solve your problem for the long run as any new files will be categorized accordingly, and your backlog is slowly being moved as required. If you need to update a DB with a pointer to where your file lives you could put that piece of code also in your function or logic app. Just my two cents :)
It is not clear if you are using the hierarchical folder structure provided by Azure Data Lake Storage Gen2, the generation 1 simulates a folders structure but it is not optimum.
There are several advantages on the ADLSV2 that should help in your case mainly related to move operations.
To migrate from ADLS Gen 1 to ADLS Gen 2 have a look here.
Additionally, you may explore optimizations on your specific case with the following paper here.

Best practice - Storage options for external reference data that is queried in different ways

We have a cloud platform with various Health Care applications. Each application needs what we call reference data. Reference data is always external data coming from a provider on a daily or some regular schedule. An example of reference data is FDB MedKnowledge which includes a comprehensive compendium of consumer medication monographs, along with drug images and imprints.
Various applications will query the reference data to present it to their target customers (who can be physicians, nurses, technicians, procurement department etc...). A common global API will be developed to return the requested data.
Historical information is required ( for ex: FDB in 2017 had NDC1 which then got deleted from the FDB feed in 2019. So a physician who prescribed NDC1 should be able to query the information of that drug going through history).
Daily we receive the feed from the external provider and use it as input source to merge ( update, insert, delete) our reference data copy such that its live table reflects the latest external feed.
In Azure, we have the following storage options:
Blob storage
Cosmos Db
Azure sql database with system versioning
Azure Datawarehouse
Azure Data lake
What is the best practice to store external reference data? We are leaning toward azure sql database with system versioning. Have any of you worked with external reference data? If yes, what is your storage decision and has it worked well for you? I would like to hear your comments and opinions. Thank you!
You need to base your choice on the type of data you are trying to store, and how you need to reference it. It sounds like you might actually need a few different technologies here.
For example, Azure SQL is great for storing relational data. So if your data is tabular in form and needs to have relationships between it, then this is a good choice. However, if you're going to be storing millions and millions of rows then performance might suffer in a relational database. In that sort of scenario, or one where you have lots of transactional data you might want to look at Cosmos DB.
You mentioned images at one point, putting these in a database is not a good idea, in this sort of scenario you are going to want to look at using blob storage.
"Reference Data" really doesn't mean anything, look at the individual types of data you need to store, and how this data is used, and make decisions based on this. For lots of different types of data, there is unlikely to be a one size fits all solution.

what would be the best way to migration data from SQL Azure to Azure Table

For a project, I am using both SQL Azure and Azure table. A requirement here is that for the first 7 days, all data are stored in SQL Azure. After the first 7 days, the data are migrated into Azure table.
Is there any reliable project to achieve this goal? Or any idea to implement this?
thanks,
I think your best best is to have a set of SQL queries (or sprocs) that return data older than 7 days. Then have table-insertion code that writes this data to one or more tables, with appropriate partition/row key based on your query needs. Then, just build some type of background operation to perform the read+write+delete. There's no tool to do this (that I know of), since one is a relational database and the other is a NoSQL variant with no specific schema.
To optimize your writes, see if you can write batches of rows at the same time (this is called an Entity Group Transaction). It optimizes # of transactions, plus the rows in a group will be written atomically. See more info on entity group transactions, here.
You also may want to consider using a queue for workload assignment. That is, maybe once a day (or hour, whenever), push a queue message telling some background process to transfer data from SQL to Table Storage. This way, in case something fails during the operation, you can process it again later, since the queue message will still be there (you'd only delete the message if the operation succeeded).
If you're looking for a tool to do so, take a look at Cloud Storage Studio (http://www.cerebrata.com/products/cloudstoragestudio) which has a feature to import data from SQL Server to Azure Table Storage. I haven't checked for a long time but I believe ClumsyLeaf's TableXplorer (http://www.clumsyleaf.com) also has this feature. Long time back, we also built an open source tool to do the same. You can find it here: http://azuredatabaseupload.codeplex.com/.
As David mentioned, you could basically write some views in your database to fetch data older than 7 days. The idea is simple: You fetch the data, map the SQL Server data types to Azure data types, choose appropriate PartitionKey/RowKey values, convert the data into entities and then upload entities in batches.

Azure: SQL Compact possible?

I have a RESTful service running on azure. Currently, it has zero persistence. (It is just a REST gateway to another api.) I run it in a single, minimal Azure instance, and expect this will handle all the load this will ever get.
I now need to add some very lightweight persistence to it. A simple table, of 40-200 rows, eight data columns. The data is very static.
Doing the whole SQL Azure thing seems big overkill for my needs.
My thoughts have been to use:
An XML file, and load it into memory, as the db. XML file is
deployed with code.
Some better way to deploy XML, so it can be
rolled out/updated easier
SQL Compact (can I do this on Azure?)
___ ?
What is the right path here?
Thank you!
SQL Server Compact would need to store its data somewhere in persistent manner, so you would need to sync it regularly to a persistent storage and that's a lot of extra work and I have no idea how to do that reliably, so it's likely not a very good idea.
For your simple table the Azure Table Storage might be just enough. If that's not enough then SQL Azure is the next choice.
You can use the XML file as your store, there is no harm it it, rather this is a very easy and cost efficient solution, but there is a catch. As you mentioned currently you are using only azure instance, in this case you can store the XML file in your App_Data, but if in future if you want to shift to 2 azure instance, you will have to replicate the App_Data folder. In other words you will need to keep App_Data folder in sync.
Suggestion
Instead of storing file in App_Data store it in BLOB, you can retrieve it using WebClient and the store it in memory.
Pros: The advantage of BLOB is, you don't have to sync it.
Cons: There is a cost associated on the number of transactions you can make. This will depend upon how many times you update the file.
Summary
If you are going to work with only one Azure Instance, use App_Data
More than one Azure Instance, use BLOB with no syncing or use App_Data with sync.
Do not use Azure Table, as BLOB is the designated store provided for this purpose only.
EDIT
From MSDN post
As far as I know, Windows Azure does not support SQL Compact Edition. SQL Compact Edition stores data in file system which will not be synchronized in multiple instances (a web role may be deployed to more than one instance. An instance is similar to a virtual machine). And files stored in file system will lost when the instance is restarted or reimaged.
Hope this helps you.

How can I back up my Windows Azure table storage?

I would like to be able to back up my table storage and also I have a need to move the data (export and import) from my production to development environment on my desktop.
Does anyone know of any tools or method that I can use to this.
You can use Cerebrata's Azure Management Cmdlets product. It allows you to download and restore your Azure table storage (and many more things). You can download it from here.
http://clumsyleaf.com/products/tablexplorer
TableXplorer will let you export all table data to an XML or CSV file.
As mentioned by others, there are tools out there that let you download your data in various formats, but it's worth noting that none of these are a true backup like you might be used to getting with SQL server.
As far as I'm aware they all just run a regular table storage query to scan through all of the records in the table and save out the results. If you have a reasonable amount of data (and if you're using table storage then you probably do) it's quite possible that this backup could take an hour or more.
For the sake of simplicity let's say you have two large related tables, A and B. If the backup starts by backing up table A, then moves on to table B, by the time it finishes backing up table B, it might contain records that rely on data in table A that's just not there.
If you just want to refresh the data in your development environment, this could be perfectly acceptable, but you do need to be aware of it.
You can try this solution it lets you to backup/restore your tables and blobs to the same or different storage account.

Resources