Migrating existing Cosmosdb collections to Autopilot - azure

I have a number of existing collection with manual RU provisioning which I would like to migrate to be Autopilot managed to better automatically deal with varying levels of demand.
The collections contain many GB of historical timeseries data, and I cannot have any downtime where new or historical data is not available to customers. I must also ensure no data is lost during the migration
Once a day, a new day of data is bulk uploaded to the cosmosdb collection, and the collections can be queried at any time by the customer-facing service in front of them.
For migration, I was considering the following:
1. Create new autopilot collection
2. Modify service to query both old and new collection and deduplicate any data present in both
3. Redirect data upload to new collection
4. Use ADF (Azure data factory) to copy the contents of the old collection to the new Autopilot one
5. Update service to only query the new collection
6. Drop old collection.
Is this the best migration strategy, or is there an alternative approach which would provide a better customer experience, or be less work?

While in Preview you will need to manually migrate data to AutoPilot containers. Once we GA we are planning to allow customers to seamlessly migrate containers from regular to AutoPilot throughput.
For the scenario you describe I find it easier to use ChangeFeed when I need to do a near zero downtime migration. Create a new AutoPilot configured container, then create an Azure Function using the Cosmos DB bindings to read from the source container and write to the new AutoPilot container to allow data to stay in sync.
Rewrite your consuming apps to use the new container and your bulk load scripts to write to the new container. Once that is done, deploy the changes. I like using slots for Web Apps (or whatever you choose) for zero or near zero downtime.
One thing to keep an eye on is since you are bulk loading this data, Azure Functions will likely fall far behind keeping the data in sync. You'll want to monitor to see how long that takes just so you know when you can flip the switch on the migration.
Hope that helps.

Related

Is there a way to purge all documents from an CosmosDB Container using the Azure Portal?

I'm developing an app that uses a CosmosDB container. Every so often, I want to purge one of the testing containers. Using the Azure Portal, I drop the container and create a new one with the same name and attributes.
Simple enough, but this feels unnecessary. Every so often I'll type something wrong. Is there a way to delete all documents in a container, without the need to recreate it, via the web Portal? It feels as if this might exist in a menu somewhere and I'm just unable to find it.
You can se the time to live of the container to something like 1 second Link. It will take some time depending on the number of documents and the throughput of your Cosmos DB.
Deletion by TTL will only use left over RU/s so it will not affect your application if your application is live.

Convert CosmosDB Serverless to Provisioned throughput DB

I am getting ready to create a brand new mobile application that communicates with CosmosDB and I will probably go the serverless way. The serverless way has some little disadvantages compared to the provisioned throughput (eg. only 50GB per container, no Geo-Redundancy, no Multi-region Writes, etc.).
If I need later on to convert my DB to a provisioned throughput one, can I do it somehow?
I know that I can probably use the change-feed and from that (I guess) recreate a new DB from it (provisioned throughput one) but this might open the Pandora's box especially while a mobile app connects to a specific DB.
As Gaurav mentioned, there is no way to change to Provisioned from Serverless plan once you create an account.
You will need to recreate the account with Serverless as type and follow the below ways to migrate the data,
(i) Data Migration Tool - You can easily migrate from one account to another
(ii) ChangeFeed and Restore - push the changes to the new the instance of Azure Cosmos DB
Once you are synced switch back to the new one.
Based on the documentation available here: https://learn.microsoft.com/en-us/azure/cosmos-db/serverless#using-serverless-resources, it is currently not possible to change a Cosmos DB server less account to provisioned throughput.

Azure LogicApp for migration of millions of files

I have the following requirements, where I consider using Azure LogicApp:
Files placed in Azure Blob Storage must be migrated into a custom place (it can be different from case to case)
Amount of files is something about 1 000 000
When the process is over, we should have a report saying how many records (files) failed
If the process stopped somewhere in the middle, the next run must take only files that have not been migrated
The process must be fast as it can be and files must be migrated within N hours
But what makes me worried is the fact that I cannot find any examples or articles (including official Azure Documentation) where the same thing is achieved by Azure LogicApp.
I have some ideas about my requirements and Azure Logic App:
I think that I must use pagination for dealing with this amount of files because Azure Logic App will not be able to read millions of file names - https://learn.microsoft.com/en-us/azure/logic-apps/logic-apps-exceed-default-page-size-with-pagination
I can add a record into Azure Table Storage to track failed migrations (something like creating a record to say that the process started and updating it when the file is moved to the destination)
I have no ideas how I can restart the Azure Logic App without using a custom tracking mechanism (for instance it can be the same Azure Table Storage instance)
And the question about splitting the work across several units is still open
Do you think that Azure Logic App is the right choice for my needs or I should consider something else? If Azure LogicApp can work for me, could you please share your thoughts and ideas on how I can achieve the given requirements?
I don't think logic app is a good solution for you to implement the requirement because the amount of files is about 1000000, that's too much. For this requirement, I suggest you to use Azure Data Factory.
To migrate data in azure blob according data factory, you can refer to this document

Cosmos DB selective regional replication

We are planning to use cosmos db single master deployment where all master data are maintained from a single region. The application is spread across various regions and we need to provide read access to the individual regions. However we would like to have filtered replication as not all regions will be interested in all data in cosmos DB. Is there any way to use selective region specific replication? I am aware that we could use Cosmos DB trigger and then have function app etc to replicate traffic but that is an overhead in terms of maintenance and monitoring. Hence would be interested to know if we can make use of any native functionality.
The built-in geo-replication mechanism is completely transparent to you. You can't see it and you can't do anything about it. There is no way to do what you described without writing something custom.
If you really want to have selected data replicated then you would need to do the following (It's a terrible solution and you should NOT go with it):
Create a main source of truth Cosmos DB account. That's "single master" that you described.
Create a few other accounts in whichever region you want.
Use a Cosmos DB trigger Azure Function or the Change Feed Processor library to listen to changes on the main account and then use your filtering logic to replicate them into the other accounts that need to use them.
Use a different connection string per application based on it's deployment environment
What's wrong with just having your data replicated across all regions though? There are no drawbacks.

Pulling data asynchronously from third-party web service on Windows Azure Platform

I want to pull large amount of data, frequently from different third party API web services and store it in a staging area (this is what I want to decide right now) from where it will be then moved one by one as required into my application's database.
I wanted to know that can I use Azure platform to achieve the above? How good is it to use Azure platform for this task?
What if the data to be pulled is of large amount and the frequency of the pull is high i.e. may be half-hourly or hourly for 2,000 different users?
I assume that if at all this is possible, then the bandwidth, data storage and server capability etc. will not be a thing to worry for me but for ©Microsoft. And obviously, I should be able to access the data back whenever I need it.
If I would have to implement it on Windows Servers, then I know that I would use a windows service to do this. But I don't know how it can be done for Windows Azure Platform if at all it is possible?
As Rinat stated, you can use Lokad's solution. If you choose to do it yourself, you can run a timed task in your worker role - maybe spawn a thread that sleeps, waking every 30 minutes to perform its task. It can then reach out to the Web Services in question (or maybe one thread per Web Service?) and fetch data. You can store it temporarily in Azure Table Storage, which is a fraction of the cost of SQL Azure (0.15 per GB), and then easily read it out of Table Storage on-demand and transfer to SQL Azure.
Assuming you host your services, storage and SQL Azure are in the same data center (by setting the affinity appropriately), you'd only pay for bandwidth when pulling data from the web service. There'd be no bandwidth charges to retrieve from Table Storage or insert into SQL Azure.
In Windows Azure that's usually Worker Role used to host the cloud processing. In order to accomplish your tasks you'll either need to implement this messaging/scheduling infrastructure yourself or use something like Lokad.Cloud or Lokad.CQRS open source projects for Azure.
We use Lokad.Cloud for distributed BI processing of hundreds of thousands of series and Lokad.CQRS allows to reliably retrieve and synchronize millions of products on schedule.
There are samples, docs and community in both projects to get you started.

Resources