Archiving Azure Search Service - azure

Need suggestion on archiving unused data from search service and reload it back when needed(reload to be done later).
Initial design draft looks like this:
Find the keys from search service based on some conditions(like take inactive, how old) that need to be archived.
Run achiever job(need suggestion here, could be a web job, function app)
Fetch the data and insert to blob storage and delete it from the search service.
Now the real way is to run the job in the pool and should be asynchronous

There's no right / wrong answer for this question. What you need to do is perform batch queries (up to 1000 docs), and schedule it to archive past data (eg. run an Azure function which will trigger and search for docs where createdDate > DataTime.Now).
Then persist that data somewhere (can be a cosmos db or as blob into storage account). Once you need to upload it again, I would consider it as a new insert, so it should follow your current insert process.
You can also take a look on this tool which helps to copy data from your index pretty quick:
https://github.com/liamca/azure-search-backup-restore

Related

Logic app to copy files from a blob even if there is no change

I have a logic app that is triggered when there is a change to the blob, this works fine but what if I want this process to run and overwrite these files is this possible. I seem to lose all dynamic options as soon as I change the bloc modified trigger
I am not sure if I understand your question. But if you just want a scheduled job kind of process to pick or put files from/to azure blob storage, you can use blob actions rather than trigger. You can use a 'Recurrence' trigger to start the logic app and use one of the appropriate blob actions to do the required operation. Let me know if you are looking for something else.

How can I do bulk inserts into the Common Data Service?

I have 1000 records that I need to sync daily from an API. I am currently bulk inserting them into a SQL Database, however I would like to use Dataverse/a Common Data Service database instead.
The Logic App connector seems to do 1 record at a time and the SDK does PUTS and POSTS. How can I either insert 1000 records into the Common Data Service in bulk OR somehow synchronise my SQL DB with the CDS?
As far as I know there is no another way to do that without programming. You can extended your Power Automate Flow with Azure Functions to insert these records in a single transaction.
In this link explain how can be do it.
https://learn.microsoft.com/en-us/powerapps/developer/data-platform/webapi/execute-batch-operations-using-web-api#when-to-use-batch-requests
Please let me know wtih anything
If you want to regularly ingest data (1000 rows) into Dataverse (CDS), then use Dataflows. The following link to MS Docs describes how to set up scheduled bulk data updates. It is therefore a pull rather than push model.
https://learn.microsoft.com/en-us/powerapps/maker/data-platform/create-and-use-dataflows

Logic App to push data from Cosmosdb into CRM and perform an update

I have created a logic app with the goal of pulling data from a container within cosmosdb (with a query), looping over the results and then pushing this data into CRM (or Common Data Service). When the data is pushed to CRM, an ID will be generated. I wish to then update cosmosdb with this new ID. Here is what I have so far:
This next step is querying for the data within our cosmosdb database and selecting all IDS with a length that is greater than 15. (This tells us that the ID is not yet within the CRM database)
Then we loop over the results and push this into CRM (Dynamics365 or the Common Data Service)
Dilemma: The first part of this process appears to be correct, however, I want to make sure that I am on the right track with this. Furthermore, once the data is successfully pushed to CRM, CRM automatically generates an ID for each record. How would I then update cosmosDB with the newly generated IDs?
Any suggestion is appreciated
Thanks
I see a red flag in your approach here with this query with length(c.id) > 15. This is not something I would do. I don't know how big your database is going to be but generally not very performant to do high volumes of cross partition queries, especially if the database is going to keep growing.
Cosmos DB already provides an awesome streaming capability so rather than doing this in a batch I would use Change Feed and use that to accomplish whatever your doing here in your Logic App. This will likely give you better control of the process and likely allow you to get the id back out of your CRM app to insert back into Cosmos DB.
Because you will be writing back to Cosmos DB, you will need a flag to ignore the update in Change Feed when the item is updated.

Bringing incremental data in from REST APIs into SQL azure

My needs are following:
- Need to fetch data from a 3rd party API into SQL azure.
The API's will be queried everyday for incremental data and may require pagination as by default any API response will give only Top N records.
The API also needs an auth token to work, which is the first call before we start downloading data from endpoints.
Due to last two reasons, I've opted for Function App which will be triggered daily rather than data factory which can query web APIs.
Is there a better way to do this?
Also I am thinking of pushing all JSON into Blob store and then parsing data from the JSON into SQL Azure. Any recommendations?
How long does it take to call all of the pages? If it is under ten minutes, then my recommendation would be to build an Azure Function that queries the API and inserts the json data directly into a SQL database.
Azure Function
Azure functions are very cost effective. The first million execution are free. If it takes longer than ten, then have a look at durable functions. For handling pagination, we have plenty of examples. Your exact solution will depend on the API you are calling and the language you are using. Here is an example in C# using HttpClient. Here is one for Python using Requests. For both, the pattern is similar. Get the total number of pages from the API, set a variable to that value, and loop over the pages; Getting and saving your data in each iteration. If the API won't provide the max number of pages, then loop until you get an error. Protip: Make sure specify an upper bound for those loops. Also, if your API is flakey or has intermittent failures, consider using a graceful retry pattern such as exponential backoff.
Azure SQL Json Indexed Calculated Columns
You mentioned storing your data as json files into a storage container. Are you sure you need that? If so, then you could create an external table link between the storage container and the database. That has the advantage of not having the data take up any space in the database. However, if the json will fit in the database, I would highly recommend dropping that json right into the SQL database and leveraging indexed calculated columns to make querying the json extremely quick.
Using this pairing should provide incredible performance per penny value! Let us know what you end up using.
Maybe you can create a time task by SQL server Agent.
SQL server Agent--new job--Steps--new step:
In the Command, put in your Import JSON documents from Azure Blob Storage sql statemanets for example.
Schedules--new schedule:
Set Execution time.
But I think Azure function is better for you to do this.Azure Functions is a solution for easily running small pieces of code, or "functions," in the cloud. You can write just the code you need for the problem at hand, without worrying about a whole application or the infrastructure to run it. Functions can make development even more productive, and you can use your development language of choice, such as C#, F#, Node.js, Java, or PHP.
It is more intuitive and efficient.
Hope this helps.
If you could set the default top N values in your api, then you could use web activity in azure data factory to call your rest api to get the response data.Then configure the response data as input of copy activity(#activity('ActivityName').output) and the sql database as output. Please see this thread :Use output from Web Activity call as variable.
The web activity support authentication properties for your access token.
Also I am thinking of pushing all JSON into Blob store and then
parsing data from the JSON into SQL Azure. Any recommendations?
Well,if you could dump the data into blob storage,then azure stream analytics is the perfect choice for you.
You could run the daily job to select or parse the json data with asa sql ,then dump the data into sql database.Please see this official sample.
One thing to consider for scale would be to parallelize both the query and the processing. If there is no ordering requirement, or if processing all records would take longer than the 10 minute function timeout. Or if you want to do some tweaking/transformation of the data in-flight, or if you have different destinations for different types of data. Or if you want to be insulated from a failure - e.g., your function fails halfway through processing and you don't want to re-query the API. Or you get data a different way and want to start processing at a specific step in the process (rather than running from the entry point). All sorts of reasons.
I'll caveat here to say that the best degree of parallelism vs complexity is largely up to your comfort level and requirements. The example below is somewhat of an 'extreme' example of decomposing the process into discrete steps and using a function for each one; in some cases it may not make sense to split specific steps and combine them into a single one. Durable Functions also help make orchestration of this potentially easier.
A timer-driven function that queries the API to understand the depth of pages required, or queues up additional pages to a second function that actually makes the paged API call
That function then queries the API, and writes to a scratch area (like Blob) or drops each row into a queue to be written/processed (e.g., something like a storage queue, since they're cheap and fast, or a Service Bus queue if multiple parties are interested (e.g., pub/sub)
If writing to scratch blob, a blob-triggered function reads the blob and queues up individual writes to a queue (e.g., a storage queue, since a storage queue would be cheap and fast for something like this)
Another queue-triggered function actually handles writing the individual rows to the next system in line, SQL or whatever.
You'll get some parallelization out of that, plus the ability to start from any step in the process, with a correctly-formatted message. If your processors encounter bad data, things like poison queues/dead letter queues would help with exception cases, so instead of your entire process dying, you can manually remediate the bad data.

Switching production azure tables powering cloud service

Would like to know what would be the best way to handle the following scenario.
I have an azure cloud service uses a Azure storage table to lookup data against requests. The data in the table is generated offline periodically (once a week).
When new data is generated offline I would need to upload it into a separate table and make config changes (change table name) to the service to pick up data from the new table and re-deploy the service. (Every time data changes I change the table name - stored as a constant in my code - and re-deploy)
The other way would be to keep a configuration parameter for my azure web role which specifies the name of the table which holds current production data. Then, within the service I read the config variable for every request - get a reference to the table and fetch data from there.
Is the second approach above ok - or would it have a performance hit because I read the config, create a table client on every request that comes to the service. (The SLA for my service is less than 2 seconds)
To answer your question, 2nd approach is definitely better than the 1st one. I don't think you will take a performance hit because the config settings are cached on 1st read (I read it in one of the threads here) and creating table client does not create a network overhead because unless you execute some methods on the table client, this object just sits in the memory. One possibility would be to read from config file and put that in a static variable. When you change the config setting, capture the role environment changing event and update the static variable to the new value from the config file.
A 3rd alternative could be to soft code the table name in another table and have your application read the table name from there. You could update the table name as part of your upload process by first uploading the data and then updating this table with the new table name where data has been uploaded.

Resources