CloudTable.ExecuteQuery : how to get number of transactions - azure

AFAIK the ExecuteQuery handles segmented queries internally. And every request (=call to storage) for a segment counts as a transaction to the storage (I mean billing transaction - http://www.windowsazure.com/en-us/pricing/details/storage/). Right?
Is there a way to understand in how many segments is splitted a query? If I run a query and I get 5000 items, I can suppose that my query was splitted into 5 segments (due to the limit of 1000 items per segment). But in case of a complex query there is also the timeout of 5 seconds per call.

I don't believe there's a way to get at that in the API. You could set up an HTTPS proxy to log the requests, if you just want to check it in development.
If it's really important that you know, use the BeginExecuteSegmented and EndExeceuteSegmented calls instead. Your code will get a callback for each segment, so you can easily track how many calls there are.

Related

Firestore Query performance issue on Firebase Cloud Functions

I am facing timeout issues on a firebase https function so I decided to optimize each line of code and realized that a single query is taking about 10 seconds to complete.
let querySnapshot = await admin.firestore()
.collection("enrollment")
.get()
The enrollment collection has about 23k documents, totaling approximately 6MB.
To my understanding, since the https function is running on a cloud function stateless server, it should not suffer from the query result size. Both Firestore and Cloud Functions are running on the same region (us-central). Yet 10 seconds is indeed a high interval of time for executing such a simple query that results in a small snapshot size.
An interesting fact is that later in the code I update those 23k documents back with a new field using Bulk Writter and it takes less than 3 seconds to run bulkWriter.commit().
Another fact is that the https function is not returning any of the query results to the client, so there shouldn't be any "downloading" time affecting the function performance.
Why on earth does it take 3x longer to read values from a collection than writing to it? I always thought Firestore architecture was meant for apps with high reading rates rather than writing.
Is there anything you would propose to optimize this?
When we perform the get(), a query is created to all document snapshots and the results are returned. These results are fetched sequentially within a single execution, i.e. the list is returned and parsed sequentially until all documents have been listed.
While the data may be small, are there any subcollections? This may add some additional latency as the API fetches and parses subcollections.
Updating the fields with a bulk writer update is over 3x the speed because the bulkwriter operation is performed in parallel and is queued based upon Promises. This allows many more operations per second.
The best way to optimize listing all documents is summarised in this link, and Google’s recommendation follows the same guideline being to use an index for faster queries and to use multiple readers that fetch the documents in parallel.

How to run batch transactions in neo4j

We are working on a recommendation engine use case at the production level.
We had a use case to fetch products sorted based on geography(nearest first), we thought to make use of spatial functions like distance and point to sort them.
For that, we need latitude and longitude properties in Product nodes, which we don’t have yet, but we have a postal address, so we figured, we’ll make use of call apoc.spatial.geocodeOnce, fetch, and set latitude and longitude properties in all the Product nodes.
Well, a bit of problem is it’s becoming an expensive operation, we have around 5000 product nodes, it’s taking an average 1000 ms to update each node, by that calculation, on a single core, it’d take around 90 min to update all the nodes. We are really curious to know if there’s a smarter way to handle the transaction in chunks(maybe, updating 500 products in one tx then, next 500… and so on). We thought, apoc.periodic.iterate is a way to go but we are looking for suggestions on how do we solve this problem efficiently?
P.S - When we tried out a few apoc.spatial.geocodeOnce calls to a couple of products with the postal address we have in our db, we saw a couple of calls returned no result, what could be the possible reasons for this? (maybe, we don’t have standardized postal addresses for such products? if so how shall we address the problem, shall we make use of google geocode API for those products or there are other smarter ways embedded in neo4j)
Our query below for reference:
CALL apoc.periodic.iterate(
"MATCH (p:Product) return p",
"CALL apoc.spatial.geocodeOnce(p.postal_address) YIELD location SET p.latitude=location.latitude, p.longitude=location.longitude",
{batchSize:500, iterateList:true, parallel:true}
)
It's due to throttle of apoc.spatial.geocodeOnce API calls. There's a rate-limiter on the number of requests you can make to the server.
You can buy google geocode API Key, configure inside apoc.conf file like,
apoc.spatial.geocode.provider=google
apoc.spatial.geocode.google.throttle=1
apoc.spatial.geocode.google.key={YOUR_API_KEY}
Note - throttle here, is in mSec

DocumentDB: How to run a query without timing out

I am new to the documentDb. I wrote a stored procedure that checks all records and update them under certain circumstances.
Current scenario:
It would run 100 records at a time, updates them and after running few times( taking 100 records at a time and updating) it is timing out.
Expectation
Run the script on all the records without timing out.
The document has close to a million records. So, running the same script multiple times manually is not a the way I am looking for.
Can anyone please advise me how I can achieve that?
tl;dr; Keep calling the sproc with the query continuation token being passed back and forth.
A few thoughts:
There is no capacity of RUs for collections that will allow you to do all million in one call to the sproc.
Sprocs run in isolation on a single replica. This means that they can be transactional but their use will have lower throughput than a regular query that can use all replicas to satisfy the request, so unless you need it to be in a sproc, I recommend using direct queries for reads that don't need to be transactional with writes. Even then, with a million documents, your queries will max out and you'll have to run the query again with a continuation token.
If you must use a sproc... As you are probably aware since you have done the 100 at a time thing, each query returns a continuation token. You can actually add that to the package that you send back from your sproc when it times out. Then you can pass that back into another call to the same sproc and write your sproc to pick up where you left off. The documentdb-utils library for node.js automatically re-calls the sproc until done as long as you follow this pattern for writing your sprocs. If you are using node.js, you could use that (but it has not yet been upgraded to support partitioned collections) or you could write the equivalent in whatever platform you are using.

How fast is Azure Search Indexer and how I can index faster?

Each index batch is limited from 1 to 1000 documents. When I call it from my local machine or azure VM, I got 800ms to 3000ms per 1000 doc batch. If I submit multiple batches with async, the time spent is roughly the same. That means it would take 15 - 20 hours for my ~50M document collection.
Is there a way I can make it faster?
It looks like you are using our Standard S1 search service and although there are a lot of things that can impact how fast data can be ingested. I would expect to see ingestion to a single partition search service at a rate of about 700 docs / second for an average index, so I think your numbers are not far off from what I would expect, although please note that these are purely rough estimates and you may see different results based on any number of factors (such as number of fields, quantity of facets, etc)..
It is possible that some of the extra time you are seeing is due to the latency of uploading the content from your local machine to Azure, and it would likely be faster if you did this directly from Azure but if this is just a one time-upload that probably is not worth the effort.
You can slightly increase the speed of data ingestion by increasing the number of partitions you have and the S2 Search Service will also ingest data faster. Although both of these come at a cost.
By the way, if you have 50 M documents, please make sure that you allocate enough partitions since a single S1 partition can handle 15M documents or 25GB so you will definitely need extra partitions for this service.
Also as another side note, when you are uploading your content (and especially if you choose to do parallelized uploads), keep an eye on the HTTP responses because if the search service exceeds the resources available you could get HTTP 207 (indicating one or more item failed to apply) or 503's indicating the whole batch failed due to throttling. If throttling occurs, you would want to back off a bit to let the service catch up.
I think you're reaching the request capacity:
https://azure.microsoft.com/en-us/documentation/articles/search-limits-quotas-capacity/
I would try another tier (s1, s2). If you still face the same problem, try get in touch with support team.
Another option:
Instead of pushing data, try to add your data to the blob storage, documentDb or Sql Database, and then, use the pull approach:
https://azure.microsoft.com/en-us/documentation/articles/search-howto-indexing-azure-blob-storage/

azure search lookup document count as query

If i look up document to bring data from azure search - does it affect the queries per second per index indicate in here
I want to know if i can use the azure search to host some data and access it without affecting the search performance.
thanks
Yes a lookup is considered a query. Please note that we do not throttle your queries and this number listed in the page you point to is only meant as a very rough indication of what a single search unit with an "average" index and an "average" set of queries could handle. In some cases (for example, if you were just doing lookups which are very simple queries), you might very likely get more than 15 QPS with a very good latency rate. In some cases (for example, if you have queries with a huge number of facets), you might get less. Please note, that although we do not throttle you, it is also possible that you could exceed the resources of the units allocated to you and will start to receive throttling http responses.
In general, the best thing to do is track the latency of your queries. If you start seeing the latency go higher then what you find acceptable, that is typically a good time to consider adding another replica.
Ultimately, the only way to know for sure is to test your specific index with the types of queries and load you expect.
I hope that helps.
Liam

Resources