Is it possible to Push (using API) and Pull (using indexer) data into the same Index with Azure Search? - azure

I have an index in Azure Search lets says called Hotels.
I have a hotels table in Azure SQL that has the same schema that is a copy of the hotels index found in Azure Search.
I push from my back-end to Azure SQL table and Azure Search at create/update/delete.
In a scenario my data was pushed to Azure SQL but failed to be pushed to Azure Search is it possible to have my Azure SQL Hotels table be an indexer, such that the indexer could sync data to my Azure Search index (hotels) that failed to be pushed from my backend?

Yes, you can both mix push and pull as well as have multiple pull indexers targeting the same index. We see this done often when part of the data is in one data source and part in another, where the index is the point where they converge, coordinated by their key.
The pattern you're describing is not as common, but generally speaking it should work. You'd have to account for cases where your write conflicts with an indexer write, and make sure the writes you do as they happen ultimately win. Also if you go down this path make sure to configure a change detection (and deletion detection if you delete rows) policy so we index from SQL incrementally and don't ready everything on every run.
An alternative approach if you're worried about missing writes is to push all your writes into a queue, and then pull from the queue and into Azure Search. That way you have a single stream of writes instead of two.

Related

Solution architecture for data transfer from SQL Server database to external API and back

I am looking for a proper solution architecture for a data transfer scenario from SQL Server to an external API and then from the API back to SQL Server. We are thinking of using Azure technologies.
We have a database hosted on an Azure VM. When the value of the author of the book table changes, we would like to get all the data for that book from related table and transfer it an external API. the quantity of the rows to be transferred (the select-join) is huge so it takes a long time to execute the select-join query, After this data is read it is transformed and then it is sent to an external API (over which we have no control) The transfer of the data to the API could take up to an hour. After the data is written into this API, we read some reports from this API and write these reports back into the original database.
We must repeat this process more than 50 per day.
We are thinking of using Logic app to detect the trigger from SQL Server (as it is hosted in Azure VMs) publish this even to an Azure Data grid and then use Azure Durable functions to handle the Read SQL data-Transform it- and Send to the external API.
Does this make sense? Does anybody have any better ideas?
Thanks in advance
At this moment, Logic App SQL connector can't detect when a particular row changes, it will perform a select (which you'll provide), and then it will check for changes every X interval (you'll specify).
In other words, SQL Database doesn't offer a change feed like CosmosDB where you can subscribe to events and trigger an Azure Function.
Things you can do:
1-Add a Trigger on SQL after insert / update which will insert the new/changed row into a separated table, and then you can use Logic App / Azure Functions to query this table and retrieve data.
2-Migrate to Cosmos DB and use the change feed + Azure Functions
3-Change your code to after insert into SQL Database, also add a message with the Identifier for the row you're about to insert / update, then add it to a Queue, which will be consumed by Azure Function.

Azure data factory - Continue on conflict on insert

We are building data migration pipeline using Azure data factory (ADF). We are transferring data from one CosmosDb instance to another. We plan to enable dual writes, so that we write to both the databases before migration begins to ensure that during migration if any data point changes both the databases get the most updated data. However, In ADF there is only Insert or upsert options available. Our case is on Insert if it gets 'conflict' continue and fail the pipeline. Can anyone give any pointers on how to achieve that in ADF?
Other option would be to create our own custom tool using CosmosDb libraries to transfer data.
If you are doing a live migration ADF is not the right tool to use as this is intended for offline migrations. If you are migrating from one Cosmos DB account to another your best option is to use the Cosmos DB Live Data Migrator.
This tool also provides dead letter support as well which is another requirement you have.

Is it possible to download a million files in parallel from Rest API endpoint using Azure Data Factory into Blob?

I am fairly new to Azure and I have a task in hand to make use of any Azure Service (or group of azure services in integration together for that matter) to o download a million files in parallel from a third party Rest API endpoint, that returns one file at a time, using Azure Data Factory into Blob Storage?
WHAT I RESEARCHED :
From what I researched my task had three requirements in a nutshell :
Parallel runs in millions - For this I deduced Azure Batch would be a good option as it lets run a large number of tasks in parallel on VMs ( it uses that concept for graphic rendering processes or Machine Learning Tasks)
Save response from Rest API to Blob Storage : I found that Azure Data Factory is able to handle such ETL type of operation from a Source/Sink style, where I could set the REST API as source and target as blob.
WHAT I HAVE TRIED:
Here are some things to note:
I added the REST API and Blob as linked services.
The API endpoint takes in a query string param named : fileName
I am passing the whole URL with the query string
The Rest API is protected by Bearer Token, which I am trying to pass using additional headers.
THE MAIN PROBLEM:
I get an error message on publishing pipeline that model is not appropriate, just that one line, and it gives no insight what's wrong
OTHER QUERIES:
It is possible to pass query string values dynamically from a sql table such that each filename can be picked a single row/column item from single columned rows of data from stored procedure/inline query?
Is it possible to make this pipeline run in parallel using Azure Batch somehow? How can we integrate this process ?
Is it possible to achieve the million parallel without data factory just using Batch ?
Hard to help with you main problem - you need to provide more examples of your code
In relation to your other queries:
You can use a "Lookup activity" to fetch a list of files from a database (with either sproc or inline query). The next step would be a ForEach activity that iterates over the array and copies the file from the REST endpoint to the storage account. You can adjust the parallelism on the ForEach activity to match your requirement but around 20 concurrent executions is what you normally see.
Using Azure Batch to just download a file seems a bit overkill as it should be a fairly quick operation. If you want to see an example of a Azure Batch job written in C# I can recommend this example => `https://github.com/Azure-Samples/batch-dotnet-quickstart/blob/master/BatchDotnetQuickstart. In terms of parallelism I think you will manage to achieve a higher degree on Azure Batch compared to Azure Data Factory.
In you need to actually download 1M files in parallel I don't think you have any other option then Azure Batch to get close to such numbers. But you most have a pretty beefy API if it can handle 1M requests within a second or two.

Azure Search default database type

I am new to Azure Search and I have just seen this tutorial https://azure.microsoft.com/en-us/documentation/articles/search-howto-dotnet-sdk/ on how to create/delete an index, upload and search for documents. However, I am wondering what type of database is behind the Azure Search functionality. In the given example I couldn't see it specified. Am I right if I assume it is implicitly DocumentDb?
At the same time, how could I specify the type of another database inside the code? How could I possibly use a Sql Server database? Thank you!
However, I am wondering what type of database is behind the Azure
Search functionality.
Azure Search is offered to you as a service. The team hasn't made the underlying storage mechanism public so it's not possible to know what kind of database are they using to store the data. However you interact with the service in form of JSON records. Each document in your index is sent/retrieved (and possibly saved) in form of JSON.
At the same time, how could I specify the type of another database
inside the code? How could I possibly use a Sql Server database?
Short answer, you can't. Because it is a service, you can't specify the service to index any data source. However what you could do is ask search service to populate its database (read index) through multiple sources - SQL Databases, DocumentDB Collections and Blob Containers (currently in preview). This is achieved through something called Data Sources and Indexers. Once configured properly, Azure Search Service will constantly update the index data with the latest data in the specified data source.

How do I backup my azure search index?

I am using Azure search and would like to make sure I can recover from a self inflicted disaster before I push more docs in there. How do I backup my index?
Is creating Azure Search replicas equivalent to making a backup?
How would one restore that?
Thanks
Microsoft have released a console app on github that can be used to backup and restore Azure Search Indexes - its not perfect, but I use it almost daily for backup and restores from prod to CI/QC/Dev instances
https://learn.microsoft.com/en-us/samples/azure-samples/azure-search-dotnet-samples/azure-search-backup-restore-index/
Right now you can't do that from the API or the portal, just save a copy of the JSON schema to a .js file, for example. See the Get Index API.
Normally you don't need to touch the index very often, only add, update or remove documents.
You would need to use an indexer from an external source to push the data into Search and be able to create backups at the same time.
If its an AzureSQL database, this may do it for you automatically, depending on your subscription
Create a table with the same fields in the Azure Search Index and add a deleted flag and a last update date, then import all of your data into the database. Set the date flag to the time that you imported the data.
At the top of the azure search bar, there is an option to 'Import Data'. This will allow you to connect the data source, this way you can create an index which will look at the last modified data and deleted flag when you create the connection.
The wizard will take you through all of the options
From there, just update the SQL table with your changes and the indexer will automatically push them to Azure search.
Thank you for an answer about https://learn.microsoft.com/en-us/rest/api/searchservice/Get-Index
Sometimes Azure Search index it's an only source to restore data.
For example in Microsoft QnA maker - if you will delete azure web app or azure app service- you no longer can even export Knowledge base from QnA maker.
To somehow restore data from QnA maker- I used Azure Search index.

Resources