Azure Synapse (Azure Data Factory) REST source pagination with dynamic limit - azure

In Azure Synapse I'm performing a data copy action. It has a REST API as a source and needs to store this data in a on-premise SQL table. The whole setup is configured and the table is filled with 100 records. This is a limit from the API. It returns 100 records by default. The total amount of data that I need to collect from this endpoint is somewhere around the 150000 records and grows by day.
I saw that pagination might help me here. That everything when 100 records are collected I can start collecting the next 100 records up and until I've reached the total data set. I don't want to set the total limit in the configuration, I would like to see that it is being collected dynamically until it reaches the maximum by itself.
How can I set this up?
Thanks in advance!
I've set pagination with header value 'skip' and this refers to the relative URL which I defined like:
?take=100&skip={skip}. I've tried to work with the value parameters but I've no clue how to set that up.

I don't want to set the total limit in the configuration, I would like to see that it is being collected dynamically until it reaches the maximum by itself.
To achieve this scenario, we need to Follow 2 steps:
Set "AbsoluteUrl.offset": "RANGE:0::20", In the range of Pagination rules and leave the end of range empty. Here it will start from 0 and end is not defined with offset of 20.
Set End Condition rules according to different last responses. otherwise, pipeline run will not stop as we have not provided any end to range of pagination
Here I used the pagination ends when the value of the header key in response equals to user-defined const value as my data contain id if it is equal to 2000 it will stop the pipeline run
My sample data: It contains around 100 objects
My dataset setting:
Preview after pagination rule: I have set offset as 20 and limit/take as 1 because of this it is showing only 5 objects (just to check if it is going till last data or not).
pipeline completed successfully
refer Microsoft document for more understanding on pagination rule

Related

Configuring Azure Log Alerts using two columns from a summarized table

I am trying to configure an alert in Azure that will send an email when a device has responded as "offline" 3 or more times in the last 15 minutes. The query I am using returns a summarized table with two columns, "Name" and "Count", where Count represents the number of offline responses.
Name
Count
ABC
4
DEF
3
My issue comes into play when trying to set up the conditions for the alert. Ideally I want to trigger an alert for any row where Count is greater than or equal to 3.
I can measure off of Table Rows or Count, but cannot seem to wrap my head around how to set up measurement and dimension splitting in a way that behaves similar to my goal I covered above. Thus far I have been able to set it up using Count, but it seems to only allow aggregating the values in Count, rather than looking at each row individually.
I have considered writing individual queries for each device that would return every offline response, and simply alerting off of the number of rows returned. I would much rather keep this contained to a single query. Thank you for any help you can provide.

Remote pagination and last_page: filter during, or after, database query?

I would like to use Tabulator's remote pagination to load records from my database table, one page at a time. I expect that I should be able to use the page and size parameters sent by Tabulator to my remote back-end to determine which records to select from the database.
For example, with page=2 and size=10, I can use MySQL's LIMIT 10,20 to select the records to be shown on page 2 (if size is set to 10).
However, doing this precludes me from using the count of all records to determine the number of pages in the table. Doing a count on the returned records will only yield 10 records, even if there are a total of 500 records (for example), so only one pagination button will be shown (instead of the expected 50 buttons).
So in order to do remote pagination "correctly" in Tabulator, it seems I must do a query to count all records from my database (with no limits), then do a count to determine the last_page, and then do something like PHP's array_slice to extract the nth page's worth of records to return as the dataset. Or I can do 2 database queries: count all records to determine # of pages, and then do a LIMIT [start],[end] query.
Is this correct?
Tabulator needs to know the last page number in order to layout the pagination buttons in the table footer so that users can select the page they want to view from the list of pages.
You simply need to do a query to count the total number of records and divide it by the number of page size which is passed in the request. you can run a count query quite efficiently returning only the count and no data.
You can then run a standard query with a limit set on the records to retreive the records for that page.
If you want to optimize things further you could stick the count value in cache so that you dont need to generate it on each request.

Handle >5000 rows in Lookup # Azure Data Factory

I have a Copy Activity which copies a Table from MySQL to Azure Table Storage.
This works great.
But when I do a Lookup on the Azure Table I get an error. (Too much Data)
This is as designed referred to the documentation:
The Lookup activity has a maximum of 5,000 rows, and a maximum size of 2 MB.
Also there is a Workaround mentioned:
Design a two-level pipeline where the outer pipeline iterates over an inner pipeline, which retrieves data that doesn't exceed the maximum rows or size.
How can I do this? Is there a way to define a offset (e.g. only read 1000 rows)
Do you really need 5000 iterations of your foreach? What kind of process are you doing in the foreach, isn't there a more efficient way of doing that?
Otherwise, maybe the following solution might be possible.
Create a new pipeline with 2 integer variables: iterations and count with 0 as defaults.
First determine the needed number of iterations. Do a lookup to determine the total number of datasets. In your query divide this by 5000, add one and round it upwards. Set the value of the iterations variable to this value using the set variable activity.
Next, add a while loop with expression something like #less(variables('count'),variables('iterations')). in this while loop call your current pipeline and pass the count variable as a parameter. After the execute pipeline activity, set the count variable to +1.
In your current pipeline you can use the limit/offset clause in combination with the passed parameter in a MySQL query to get the first 0-5000 results for your first iteration, 5000-10000 for your second iteration etc..
If you really need to iterate on the table storage, the only solution I see is that you'll have to create pagination on the resultset yourself, you could use a logic app for this purpose and call it by using a webhook.

How to retrieve every nth row in Azure Storage?

I have the following scenario. Information collected every minute is sent to and stored in Azure table storage. Now, I am trying to display this data in a graph. If I only show data for the last day, it would be relatively easy to filter through 1440 (24 * 60) data points to only display 200. However, if we consider showing data over a month, I would have to handle over 40,000 data points (24 * 60 * 30). However I only would need to show 200 of those data points. Assuming 40,000 points, I would only select every 200th data point, or row. Is this functionality possible in azure storage. Or would I have to select bunches at a time, select the 200th element and then move on to the next data set?
You should be able to use the 200th data point in your case. You could either use $top entities from a set, or the $filterquery to get the results you want. I'd rather suggest using PowerBI which is free and compatible with Table storage, it basically turns data into graphs, and you can apply additional filters to what suits you.
You can read more about it here:
Powerbi links:https://powerbi.microsoft.com/en-us/integrations/azure-table-storage/
And https://powerbi.microsoft.com/en-us/desktop/

How to iterate over a SOLR shard which has over 100 million documents?

I would like to iterate over all these documents without having to load the entire result in memory which seems to be the case apparently - QueryResponse.getResults() returns SolrDocumentList which is an ArrayList.
Can't find anything in the documentation. Am using SOLR 4.
Note on the background of problem: I need to do this when adding a new SOLR shard to the existing shard cluster. In that case, I would like to move some documents from the existing shards to the newly added shard(s) based on consistent hashing. Our data grows constantly and we need to keep introducing new shards.
You can set the 'rows' and 'start' query params to paginate a result set. Query first with start = 0, then start = rows, start = 2*rows, etc. until you reach the end of the complete result set.
http://wiki.apache.org/solr/CommonQueryParameters#start
I have a possible solution I'm testing:
Solr paging 100 Million Document result set
pasted:
I am trying to do deep paging of very large result sets (e.g., over 100 million documents) using a separate indexed field (integer) into which I insert a random variable (between 0 and some known MAXINT). When querying large result sets, I do the initial field query with no rows returned and then based on the count, I divide the range 0 to MAXINT in order to get on average PAGE_COUNT results by doing the query again across a sub-range of the random variable and grabbing all the rows in that range. Obviously the actual number of rows will vary but it should follow a predictable distribution.

Resources