Azure Cosmos DB Python SDK : Query items from change feed using checkpoints?

Azure Cosmos DB Python SDK : Query items from change feed using checkpoints? - python-3.x

Newbie to the CosmosDB...please shed some light
#Matias Quaranta - Thank you for the samples
From the official samples it seems like the Change feed can be queried either from the beginning or from a specific point in time.
options["startFromBeginning"] = True
or
options["startTime"] = time
What other options does the QueryItemsChangeFeed method support?
Does it support querying from a particular check point within a partition?

Glad the samples are useful. In theory, the concept of "checkpoints" does not exist in the Change Feed. "Checkpoints" is basically you storing the last processed batch or continuation after every execution in case your process halts.
When the process starts again, you can take your stored continuation and use it.
This is what the Change Feed Processor Library and our Azure Cosmos DB Trigger for Azure Functions do for you internally.
To pass the continuation in python, you can use options['continuation'] and you should be able to get them from the response headers on the 'x-ms-continuation'.

Refer to the sample code ReadFeedForTime, I has tried the options["startTime"]. But it doesn't work, the response is the same as the list of documents start from Beginning.

Related

Azure CosmosDB SQL Record counts

I have a CosmosDB Collection which I'm querying using the REST API.
I'd like to access the total number of documents which match my query. I know I can do a count, but that means two calls, one for the count and a subsequent one to retrieve the actual records.
I would assume this is not possible in a single call, BUT.. the Data Explorer in Azure Portal seems to manage it, so just wondering if anyone has been able to figure out what calls it makes, to get this:
Showing Results 1 - 10
Retrieved document count 342
Retrieved document size 2868425 bytes
Output document count 10
It's the Retrieved Document Count I need - if the portal can do it, there ought to be a way :)
I've tried the JAVA SDK as well as REST but can't see any useful options in there either

As so often is the case in this game, asking a question triggers the answer... so apologies in advance.
The answer is to send the x-ms-documentdb-populatequerymetrics header in the request.
The response then gives a whole bunch of useful stuff in x-ms-documentdb-query-metrics.
What I would like to understand still is whether this has any performance impact?

Azure function slow executing a stored procedure

I'm using an Azure function like a scheduled job, using the cron timer. At a specific time each morning it calls a stored procedure.
The function is now taking 4 mins to run a stored procedure that takes a few seconds to run in SSMS. This time is increasing despite efforts to successfully improve the speed of the stored procedure.
The function is not doing anything intensive.
using (SqlConnection conn = new SqlConnection(str))
{
conn.Open();
using (var cmd = new SqlCommand("Stored Proc Here", conn) { CommandType = CommandType.StoredProcedure, CommandTimeout = 600})
{
cmd.Parameters.Add("#Param1", SqlDbType.DateTime2).Value = DateTime.Today.AddDays(-30);
cmd.Parameters.Add("#Param2", SqlDbType.DateTime2).Value = DateTime.Today;
var result = cmd.ExecuteNonQuery();
}
}
I've checked and the database is not under load with another process when the stored procedure is running.
Is there anything I can do to speed up the Azure function? Or any approaches to finding out why it's so slow?
UPDATE.
I don't believe Azure functions is at fault, the issue seems to be with SQL Server.
I eventually ran the production SP and had a look at the execution plan. I noticed that the statistic were way out, for example a join expected the number of returned rows to be 20, but actual figure was closer to 800k.
The solution for my issue was to update the statistic on a specific table each week.
Regarding why that stats were out so much, well the client does a batch update each night and inserts several hundred thousand rows. I can only assume this affected the stats and it's cumulative, so it seems to get worse with time.

Please be careful adding with recompile hints. Often compilation is far more expensive than execution for a given simple query, meaning that you may not get decent perf for all apps with this approach.
There are different possible reasons for your experience. One common reason for this kind of scenario is that you got different query plans in the app vs ssms paths. This can happen for various reasons (I will summarize below). You can determine if you are getting different plans by using the query store (which records summary data about queries, plans, and runtime stats). Please review a summary of it here:
https://learn.microsoft.com/en-us/sql/relational-databases/performance/monitoring-performance-by-using-the-query-store?view=sql-server-2017
You need a recent ssms to get the ui, though you can use direct queries from any tds client.
Now for a summary of some possible reasons:
One possible reason for plan differences is set options. These are different environment variables for a query such as enabling ansi nulls on or off. Each different setting could change the plan choice and thus perf. Unfortunately the defaults for different language drivers differ (historical artifacts from when each was built - hard to change now without breaking apps). You can review the query store to see if there are different “context settings” (each unique combination of set options is a unique context settings in query store). Each different set implies different possible plans and thus potential perf changes.
The second major reason for plan changes like you explain in your post is parameter sniffing. Depending on the scope of compilation (example: inside a sproc vs as hoc query text) sql will sometimes look at the current parameter value during compilation to infer the frequency of the common value in future executions. Instead of ignoring the value and just using a default frequency, using a specific value can generate a plan that is optimal for a single value (or set of values) but potentially slower for values outside that set. You can see this in the query plan choice in the query store as well btw.
There are other possible reasons for performance differences beyond what I mentioned. Sometimes there are perf differences when running in mars mode vs not in the client. There may be differences in how you call the client drivers that impact perf beyond this.
I hope this gives you a few tools to debug possible reasons for the difference. Good luck!

For a project I worked on we ran into the same thing. Its not a function issue but a sql server issue. For us we were updating sprocs during development and it turns out that per execution plan, sql server will cache certain routes/indexes (layman explanation) and that gets out of sync for the new sproc.
We resolved it by specifying WITH (RECOMPILE) at the end of the sproc and the API call and SSMS had the same timings.
Once the system is settled, that statement can and should be removed.
Search on slow sproc fast ssms etc to find others who have run into this situation.

Elasticsearch how to check for a status of a bulk indexing request?

I am bulk indexing into Elasticsearch docs containing country shapes (files here), based on the cshapes dataset.
The geoshapes have a lot of points in "geometry":{"type":"MultiPolygon", and the bulk request takes a long time to complete (and sometimes does not complete, which is a separate and already reported problem).
Since the client times out (I use the official ES node.js), I would like to have a way to check what the status of the bulk request is, without having to use enormous timeout values.
What I would like is to have a status such as active/running, completed or aborted. I guess that just by querying the single doc in the batch would not tell me whether the request has been aborted.
Is this possible?

I'm not sure if this is exactly what you're looking for, but may be helpful. Whenever I'm curious about what my cluster is doing, I check out the tasks API.
The tasks API shows you all of the tasks that are currently running on your cluster. It will give you information about individual tasks, such as the task ID, start time, and running time. Here's the command:
curl -XGET http://localhost:9200/_tasks?group_by=parents | python -m json.tool

Elasticsearch doesn't provide a way to check the status of an ongoing Bulk request- documentation reference here.
First, check that your request succeeds with a smaller input, so you know there is no problem with the way you are making the request. Second, try dividing the data into smaller chunks and calling the Bulk API on them in parallel.
You can also try with a higher request_timeout value, but I guess that is something you don't want to do.

just a side note hint, of why your requests might take a lot of time (unless you are just indexing too many in a single bulk run). If you have configured your own precision for geo shapes, also make sure you are configuring distance_error_pct, otherwise no error is assumed, resulting in documents with a lot of terms that take a lot of time indexing.

Using GTFS data, how should i extend it with realtime gtfs?

I am building an application using GTFS data. I am a bit confused when it comes to GTFS-realtime.
I have stored all the GTFS information in a database(Mongo), I am able to retrieve stop times of a specific bus stop.
So now I want to integrate GTFS-realtime information to it. What will be the best way to deal with the information retrived? I am using gtfs-realtime-binding (nodsjs library) by Google.
I have the following idea:
Store the realtime-GTFS information in a separate database and query it after getting the stoptime from GTFS. And I can update the database periodically to make sure the real time info is up to date.
Also, I know the retrieve data is in .proto binary format. Should I store them as ascii or is there a better way to deal with it?
I couldnt find much information about how to deal with the realtime data so I hope someone can give me a direction on what to do next.
Thanks!

In your case GTFS-Realtime can be used as "ephemeral" data, and I would go with an object in memory, with the stop_id/route_id as keys.
For every request:
Check if the realtime object contains the id, then present realtime. Else load from the database.

Gremlin: SetProperty iteratively to existing graph database

I am trying to run JUNGs PageRank algorithm onto my existing neo4j graph database and save a node's score as a property for future reference.
So I created the following groovy file:
import edu.uci.ics.jung.algorithms.scoring.PageRank
g = new Neo4jGraph('/path/to/graph.db')
j = new GraphJung(g)
pr = new PageRank<Vertex,Edge>(j, 0.15d)
pr.evaluate()
g.V.sideEffect{it.pagerank=pr.getVertexScore(it)}
and run it through gremlin.
It runs smoothly and if I were to check the property via g.v(2381).map() I get what I'd expect.
However, when I leave gremlin and start up my neo4j server, these modifications are non-existant.
Can anyone explain why and how to fix this?
My hunch is that it has something to do with my graph in gremlin being embedded:
gremlin> g
==>neo4jgraph[EmbeddedGraphDatabase [/path/to/graph.db]]
Any ideas?

You will need a g.shutdown() at the end of your groovy script. Without a g.shutdown() all changes to the graph are most likely to stay in memory. Re-initializing the graph from disk (/path/to/graph.db in your case), will lose the changes which were still in memory. g.shutdown() will flush the current transaction from memory to disk. This will make sure your changes persist and will be retrieved when you try to access the database again.
Hope this helps.
Note: You are correct on the hunch for embedded database. This issue will not occur if you use Neo4j's REST interface because every REST API request is treated as a single transaction.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string