Handle >5000 rows in Lookup # Azure Data Factory - azure

I have a Copy Activity which copies a Table from MySQL to Azure Table Storage.
This works great.
But when I do a Lookup on the Azure Table I get an error. (Too much Data)
This is as designed referred to the documentation:
The Lookup activity has a maximum of 5,000 rows, and a maximum size of 2 MB.
Also there is a Workaround mentioned:
Design a two-level pipeline where the outer pipeline iterates over an inner pipeline, which retrieves data that doesn't exceed the maximum rows or size.
How can I do this? Is there a way to define a offset (e.g. only read 1000 rows)

Do you really need 5000 iterations of your foreach? What kind of process are you doing in the foreach, isn't there a more efficient way of doing that?
Otherwise, maybe the following solution might be possible.
Create a new pipeline with 2 integer variables: iterations and count with 0 as defaults.
First determine the needed number of iterations. Do a lookup to determine the total number of datasets. In your query divide this by 5000, add one and round it upwards. Set the value of the iterations variable to this value using the set variable activity.
Next, add a while loop with expression something like #less(variables('count'),variables('iterations')). in this while loop call your current pipeline and pass the count variable as a parameter. After the execute pipeline activity, set the count variable to +1.
In your current pipeline you can use the limit/offset clause in combination with the passed parameter in a MySQL query to get the first 0-5000 results for your first iteration, 5000-10000 for your second iteration etc..
If you really need to iterate on the table storage, the only solution I see is that you'll have to create pagination on the resultset yourself, you could use a logic app for this purpose and call it by using a webhook.

Related

Azure Synapse (Azure Data Factory) REST source pagination with dynamic limit

In Azure Synapse I'm performing a data copy action. It has a REST API as a source and needs to store this data in a on-premise SQL table. The whole setup is configured and the table is filled with 100 records. This is a limit from the API. It returns 100 records by default. The total amount of data that I need to collect from this endpoint is somewhere around the 150000 records and grows by day.
I saw that pagination might help me here. That everything when 100 records are collected I can start collecting the next 100 records up and until I've reached the total data set. I don't want to set the total limit in the configuration, I would like to see that it is being collected dynamically until it reaches the maximum by itself.
How can I set this up?
Thanks in advance!
I've set pagination with header value 'skip' and this refers to the relative URL which I defined like:
?take=100&skip={skip}. I've tried to work with the value parameters but I've no clue how to set that up.
I don't want to set the total limit in the configuration, I would like to see that it is being collected dynamically until it reaches the maximum by itself.
To achieve this scenario, we need to Follow 2 steps:
Set "AbsoluteUrl.offset": "RANGE:0::20", In the range of Pagination rules and leave the end of range empty. Here it will start from 0 and end is not defined with offset of 20.
Set End Condition rules according to different last responses. otherwise, pipeline run will not stop as we have not provided any end to range of pagination
Here I used the pagination ends when the value of the header key in response equals to user-defined const value as my data contain id if it is equal to 2000 it will stop the pipeline run
My sample data: It contains around 100 objects
My dataset setting:
Preview after pagination rule: I have set offset as 20 and limit/take as 1 because of this it is showing only 5 objects (just to check if it is going till last data or not).
pipeline completed successfully
refer Microsoft document for more understanding on pagination rule

Loop through Spark Dataframe, save results and use results on the previous iteration

How can I loop through a spark dataframe, apply business logic and use the results in the next iteration. I'm moving a script from pandas/numpy to spark because of the amount of data we have to process in this job. The business logic we have is very complicated and I've been able to move it to spark. The issue I'm having is how can I carry the results from Group 1 below to group 2 to be used. Also, the problem isn't that simple, there are about 10 variables that depend on the past group that will be used in the current group's calculations. I've been thinking about maybe streaming in the groups and saving the results to a temp table of some sorts and then using the results on the next stream? Not sure how that would work yet. Any ideas?
For added Context:
I have a dataframe with a ton of logic implemented into it.There's a column from 1 - 20. I have defined a ton of logic for Group 1. I need to pass in those same tranformations with the calculations in place to the next group 2 and so on and so on. Is it possible to pass the dataframe to a function with outputs?

DocumentDB COUNT Inconsistent Results

I have been trying some queries using the COUNT aggregate in DocumentDB that was just recently released. Even when I run the exact same query multiple times, I am regularly getting different results. I know my data isn't changing. Is there a bug with the aggregate functions, could I be reaching my RU limit and it is only returning the counts that fit within my RU amount, or is something else going on?
My query looks like:
Select COUNT(c.id) FROM c WHERE Array_Contains(c.Property, "SomethingIAmSearchingFor")
My collection contains about 12k documents that are very small (3 or 4 string properties each and one array with less than 10 string items in it)
In DocumentDB, aggregate functions are distributed across 1-N partitions, and within each partition executed in chunks/pages based on the available RU like guessed. The SDK fetches the partial aggregates and returns the final results (e.g. sums over the counts from each result).
If you run the query to completion, you will always get the same aggregate result even if the individual partial executions return different results.
In the portal use the "Load more →" link to get the count of the next portion. You need to manually record the counts shown so far and sum them to get the final aggregated count.

How to iterate over a SOLR shard which has over 100 million documents?

I would like to iterate over all these documents without having to load the entire result in memory which seems to be the case apparently - QueryResponse.getResults() returns SolrDocumentList which is an ArrayList.
Can't find anything in the documentation. Am using SOLR 4.
Note on the background of problem: I need to do this when adding a new SOLR shard to the existing shard cluster. In that case, I would like to move some documents from the existing shards to the newly added shard(s) based on consistent hashing. Our data grows constantly and we need to keep introducing new shards.
You can set the 'rows' and 'start' query params to paginate a result set. Query first with start = 0, then start = rows, start = 2*rows, etc. until you reach the end of the complete result set.
http://wiki.apache.org/solr/CommonQueryParameters#start
I have a possible solution I'm testing:
Solr paging 100 Million Document result set
pasted:
I am trying to do deep paging of very large result sets (e.g., over 100 million documents) using a separate indexed field (integer) into which I insert a random variable (between 0 and some known MAXINT). When querying large result sets, I do the initial field query with no rows returned and then based on the count, I divide the range 0 to MAXINT in order to get on average PAGE_COUNT results by doing the query again across a sub-range of the random variable and grabbing all the rows in that range. Obviously the actual number of rows will vary but it should follow a predictable distribution.

What's a better counting algorithm for Azure Table Storage log data?

I'm using Windows Azure and venturing into Azure Table Storage for the first time in order to make my application scalable to high density traffic loads.
My goal is simple, log every incoming request against a set of parameters and for reporting count or sum the data from the log. In this I have come up with 2 options and I'd like to know what more experienced people think is the better option.
Option 1: Use Boolean Values and Count the "True" rows
Because each row is written once and never updated, store each count parameter as a bool and in the summation thread, pull the rows in a query and perform a count against each set of true values to get the totals for each parameter.
This would save space if there are a lot of parameters because I imagine Azure Tables store bool as a single bit value.
Option 2: Use Int Values and Sum the rows
Each row is written as above, but instead each parameter column is added as a value of 0 or 1. Summation would occur by querying all of the rows and using a Sum operation for each column. This would be quicker because Summation could happen in a single query, but am I losing something in storing 32 bit integers for a Boolean value?
I think at this point for query speed, Option 2 is best, but I want to ask out loud to get opinions on the storage and retrieval aspect because I don't know Azure Tables that well (and I'm hoping this helps other people down the road).
Table storage doesn't do aggregation server-side, so for both options, you'd end up pulling all the rows (with all their properties) locally and counting/summing. That makes them both equally terrible for performance. :-)
I think you're better off keeping a running total, instead of re-summing everything everytime. We talked about a few patterns for that on Cloud Cover Episode 43: http://channel9.msdn.com/Shows/Cloud+Cover/Cloud-Cover-Episode-43-Scalable-Counters-with-Windows-Azure

Resources