How can I verify data uploaded to CosmosDB? - azure

I have dataset of 442k JSON documents in single ~2.13GB file in Azure Data Lake Store.
I've upload it to collection in CosmosDB via Azure Data Factory pipeline. Pipeline is completed successfully.
But when I went to CosmosDB in Azure Portal, I noticed that collection size is only 1.5 GB. I've tried to run SELECT COUNT(c.id) FROM c for this collection, but it returns only 19k. I've also seen complains that this count function is not reliable.
If I open collection preview, first ~10 records match my expectations (ids and content are the same as in ADLS file).
Is there a way to quickly get real record count? Or some other way to be sure that nothing is lost during import?

According to this article, you could find:
When using the Azure portal's Query Explorer, note that aggregation queries may return the partially aggregated results over a query page. The SDKs produces a single cumulative value across all pages.
In order to perform aggregation queries using code, you need .NET SDK 1.12.0, .NET Core SDK 1.1.0, or Java SDK 1.9.5 or above.
So I suggest you could firstly try to use azure documentdb sdk to get the count value.
More details about how to use , you could refer to this article.

Related

Is it possible to download a million files in parallel from Rest API endpoint using Azure Data Factory into Blob?

I am fairly new to Azure and I have a task in hand to make use of any Azure Service (or group of azure services in integration together for that matter) to o download a million files in parallel from a third party Rest API endpoint, that returns one file at a time, using Azure Data Factory into Blob Storage?
WHAT I RESEARCHED :
From what I researched my task had three requirements in a nutshell :
Parallel runs in millions - For this I deduced Azure Batch would be a good option as it lets run a large number of tasks in parallel on VMs ( it uses that concept for graphic rendering processes or Machine Learning Tasks)
Save response from Rest API to Blob Storage : I found that Azure Data Factory is able to handle such ETL type of operation from a Source/Sink style, where I could set the REST API as source and target as blob.
WHAT I HAVE TRIED:
Here are some things to note:
I added the REST API and Blob as linked services.
The API endpoint takes in a query string param named : fileName
I am passing the whole URL with the query string
The Rest API is protected by Bearer Token, which I am trying to pass using additional headers.
THE MAIN PROBLEM:
I get an error message on publishing pipeline that model is not appropriate, just that one line, and it gives no insight what's wrong
OTHER QUERIES:
It is possible to pass query string values dynamically from a sql table such that each filename can be picked a single row/column item from single columned rows of data from stored procedure/inline query?
Is it possible to make this pipeline run in parallel using Azure Batch somehow? How can we integrate this process ?
Is it possible to achieve the million parallel without data factory just using Batch ?
Hard to help with you main problem - you need to provide more examples of your code
In relation to your other queries:
You can use a "Lookup activity" to fetch a list of files from a database (with either sproc or inline query). The next step would be a ForEach activity that iterates over the array and copies the file from the REST endpoint to the storage account. You can adjust the parallelism on the ForEach activity to match your requirement but around 20 concurrent executions is what you normally see.
Using Azure Batch to just download a file seems a bit overkill as it should be a fairly quick operation. If you want to see an example of a Azure Batch job written in C# I can recommend this example => `https://github.com/Azure-Samples/batch-dotnet-quickstart/blob/master/BatchDotnetQuickstart. In terms of parallelism I think you will manage to achieve a higher degree on Azure Batch compared to Azure Data Factory.
In you need to actually download 1M files in parallel I don't think you have any other option then Azure Batch to get close to such numbers. But you most have a pretty beefy API if it can handle 1M requests within a second or two.

Retrieve data from Azure Logic App pagination

I am using the Azure Logic App to upload existing data from Onedrive to Azure File Storage.
In Onedrive more than 300 directories and more than 10000 files are there.
I tried to use the Onedrive list file in folder connector to list all files and directories from that result I can filter out the files. But the Onedrive file connector returns only 20 entries alone.
I could not get all entries. I searched quite a lot but I could not get any resources.
In Azure Logic App there is an option nextLink to get data from next pages but I couldn't get the proper documentation for how to use nextLink.
Does anybody have an idea about how to retrieve data from paginations in Azure Logic App?
We recently worked on a Logic App, where we get paged data from Azure Activity Logs. There also we have paged responses by default. We used 'Until' loop in Azure Logic Apps till we get the NextLink as undefined.
The following is how the condition in Until look like. (GET_Logs is our azure monitor api connector, you can replace this with your connector to get the file list from OneDrive)
#equals(coalesce(body('Get_Logs')?.nextLink, 'undefined'), 'undefined').
Hope this helps!!
Method 1 :
1. Create a variable as string type
1. Use Until connector
2. If no records further "nextLink" will be undefined.
3. Determine using coalesce. By default it is not supporting
4. Add it to to variable
Method 2 :
1. Use inline code connector which gives ability to write code in javascript

Refresh an Azure search index

I've been exploring Azure Search recently as I'd like to use it in some of our apps. I've created an index, imported the data and have began querying the data using both the Search Explorer and the REST APIs. All well and good.
I changed the underlying data to test out the fuzzy searching capabilities. However, I was getting incorrect results as the data being returned was still the old data. I eventually found how I forcibly refresh the underlying data from the Azure portal, but is there a way to do this using a REST API, or to automate this in some way. I don't want to have to keep manually refreshing the Azure Search Index going forwards.
An indexer normally runs once, immediately after it is created. You can run it again on demand using the portal, the REST API, or the .NET SDK. You can also configure an indexer to run periodically on a schedule.
Source data will change over time, and you want the Azure Cognitive Search indexers to automatically process the changed data. Schedule indexers in Azure Cognitive Search, where you can set a custom interval (between 5 minutes and 24 hours).

How to get sorted results in powershell using az storage entity query?

When you are using Azure Storage Explorer you can click on the name of each columns to sort the results by that field.
Is there any way to sort query results in PowerShell using az storage entity query?
In another word, I can get the results in Azure CLI as an object and I can sort it using Sort-Object, but I want to sort entries on the Azure Storage Server and get sorted-results. It's not useful to get all of the data from the server and sort it manually.
Please see this page https://learn.microsoft.com/en-us/rest/api/storageservices/Query-Operators-Supported-for-the-Table-Service?redirectedfrom=MSDN.
There's a complete list of supported operators you can use with Azure Storage Table, OrderBy is sadly not among the supported ones.
This means, you will need to retrieve the data first, then do the sorting.
but I want to sort entries on the Azure Storage Server and get
sorted-results. It's not useful to get all of the data from the server
and sort it manually.
It is not possible as Azure Tables does not support server-side sorting. You will need to fetch the desired data on the client and perform the sorting there only.

Azure Search default database type

I am new to Azure Search and I have just seen this tutorial https://azure.microsoft.com/en-us/documentation/articles/search-howto-dotnet-sdk/ on how to create/delete an index, upload and search for documents. However, I am wondering what type of database is behind the Azure Search functionality. In the given example I couldn't see it specified. Am I right if I assume it is implicitly DocumentDb?
At the same time, how could I specify the type of another database inside the code? How could I possibly use a Sql Server database? Thank you!
However, I am wondering what type of database is behind the Azure
Search functionality.
Azure Search is offered to you as a service. The team hasn't made the underlying storage mechanism public so it's not possible to know what kind of database are they using to store the data. However you interact with the service in form of JSON records. Each document in your index is sent/retrieved (and possibly saved) in form of JSON.
At the same time, how could I specify the type of another database
inside the code? How could I possibly use a Sql Server database?
Short answer, you can't. Because it is a service, you can't specify the service to index any data source. However what you could do is ask search service to populate its database (read index) through multiple sources - SQL Databases, DocumentDB Collections and Blob Containers (currently in preview). This is achieved through something called Data Sources and Indexers. Once configured properly, Azure Search Service will constantly update the index data with the latest data in the specified data source.

Resources