Throttling EF queries to save DTUs - azure

We have an asp.Net application using EF 6 hosted in Azure. The database runs at about 20% DTU usage for most of the time except for certain rare actions.
These are almost like db dumps in Excel format, like having all orders of the last X years etc. which the (power) users can trigger and then get the result later by email.
The problem is that these queries use up all DTU and the whole application goes into a crawl. We would like to kind of throttle these non-critical queries as it doesn't matter if this takes 10-15min longer.
Googling I found the option to reduce the DEADLOCK_PRIORITY but this wont fix the issue of using up all resources.
Thanks for any pointers, ideas or solutions.

Optimizing is going to be hard as it is more or less a db dump.
Azure SQL Database doesn't have Resource Governor available, so you'll have to handle this in code.
Azure SQL Database runs in READ COMMITTED SNAPSHOT mode, so slowing down the session that dumps the data from a table (or any streaming query plan) should reduce its DTU consumption without adversely affecting other sessions.
To do this put waits in the loop that reads the query results, either an IEnumerable<TEntity> returned from a LINQ query or a SqlDataReader returned from an ADO.NET SqlCommand.
But you'll have to directly loop over the streaming results. You can't copy the query results into memory first using IQueryable<TEntity>.ToList() or DataTable.Load(), SqlDataAdapter.Fill(), etc as that would read as fast as possible.
eg
var results = new List<TEntity>();
int rc = 0;
using (var dr = cmd.ExecuteReader())
{
while (dr.Read())
{
rc++;
var e = new TEntity();
e.Id = dr.GetInt(0);
e.Name = dr.GetString(1);
// ...
results.Add(e);
if (rc%100==0)
Thread.CurrentThread.Sleep(100);
}
}
or
var results = new List<TEntity>();
int rc = 0;
foreach (var e in db.MyTable.AsEnumerable())
{
rc++;
var e = new TEntity();
e.Id = dr.GetInt(0);
e.Name = dr.GetString(1);
// ...
results.Add(e);
if (rc%100==0)
Thread.CurrentThread.Sleep(100);
}
For extra credit, use async waits and stream the results directly to the client without batching in memory.
Alternatively, or in addition, you can limit the number of sessions that can concurrently perform the dump to one, or one per table, etc using named Application Locks.

Related

Some trivial transactions take dozens of seconds to complete on Spanner microinstance

Here are some bits of context.
Nodejs server, connecting to Cloud Spanner from development machine.
Most of the time the queries take like 200-400ms including data transfer from servers location to my dev machine.
But sometimes these trivial transaction takes 12-16 seconds which surely not acceptable for use case - sessions storage for backend server.
In local dev context sessions service runs on same machine as main backend, at staging at prod they run in same Kubernetes cluster.
This is not about amount of data, it is very small amount of data now in our staging Spanner database overall, like few MB across all tables and just like 10 rows in the table under question.
Spanner instance stats:
Processing units: 100
CPU utilization: 4.3% for the staging database and 10% overall for instance.
Table is like so (few other small fields omitted):
CREATE TABLE sessions
(
id STRING(255) NOT NULL,
created TIMESTAMP,
updated TIMESTAMP,
status STRING(16),
is_local BOOL,
user_id STRING(255),
anonymous BOOL,
expires_at TIMESTAMP,
last_activity_at TIMESTAMP,
json_data STRING(MAX),
) PRIMARY KEY(id);
Transaction under question makes single question like this:
UPDATE ${schema.reportsTable}
SET ${statusCol.columnName} = #status_recycled
WHERE ${idCol.columnName} = #id_value
AND ${statusCol.columnName} = #status_active
with parameters like this:
{
"id_value": "some_session_id",
"status_active": "active",
"status_recycled": "recycled"
}
Yes, that status field of STRING(16) with readable names instead of boolean field is not ideal, I know, but this concept is inherited from an older code. What concerns me is that while we do not have yet too much of data there, just 10 rows or such, experience this sort of delays is surely unexpected at this scale.
Okay, I understand I am like on other side of the globe from the Spanner servers, but this usually gives delays between 200-1200 ms, not 12-16 seconds.
Delay happens quite rarely and randomly but seems to happen on queries like this.
The delay comes at commit, not at e. g. sending SQL command itself or obtaining a transaction.
I tried different query first, like
DELETE FROM Sessions WHERE id = #id_value
and it was the same - random rare long delay of 12-16 such trivial query.
Thanks a lot for your help and time.
PS: Update: actually this 12-16 seconds delay can happen at any random transaction in described context, and all of these transactions are standard CRUD single-row operations.
Update 2:
The code that sends transaction is own wrapper over the standard #google-cloud/spanner client library for nodejs.
The library gives just an easy to use wrapping around the Spanner instance, database, and transaction.
The Spanner instance and database objects are long-living singletons, I mean they do not recreated for every transaction from scratch.
The main purpose of that wrapper is to give logic like:
let result = await useDataContext(async(ctx) => {
let sql = await ctx.getSQLRunner();
return await sql.runSQLUpdate({
sql: `Some SQL Trivial Statement`,
parameters: {
param1: 1,
param2: true,
param3: "some string"
}
});
});
purpose of that is to give some warrantees that if some changes were made over data, transaction.commit surely will be called, and if no changes were made, transaction.end will be called, and if an error boom in the called code, like invalid SQL generated or some variable will be undefined or null, transaction rollback will be initiated.

What happens when you do a ToList() on a gigantic table storage table?

I had some really old code somewhere on my application that I accidently triggered:
var json = table.CreateQuery<ActionLog>().ToList().ToJson();
another suspect is:
var action_log_list = await table.CreateQuery<ActionLog>()
.Where(log => log.StartTime > startTime)
.AsTableQuery()
.[...]
The problem is that this table is gigantic - probably hundreds of millions.
About the same time I hit this code it took out one instance of my application and that one didn't come back for more then one hour. Even after Restarts.
Now I was actually investigating some mild performance problems, so I'm wondering; was this a coincidence or could the code above bring down a table storage - like a 'really long running query' and after that blocking i.e. inserts or reads on that table?
About the same time I hit this code it took out one instance of my application and that one didn't come back for more then one hour. Even after Restarts.
Based on my knowledge, we could use the ExecuteSegmentedAsync to improve the peformance. The following is demo code.
var query = table.CreateQuery<ActionLog>().AsTableQuery();
TableContinuationToken continuationToken = null;
do
{
// Execute the query async until there is no more result
var queryResult = await query.ExecuteSegmentedAsync(continuationToken);
// to do something
continuationToken = queryResult.ContinuationToken;
} while (continuationToken != null);
As it is gigantic table, it may still need long time to do that. I don't test it on my side.
But based on my experience, if we want to deal with so huge records, I recommand that you could use Azure Data factory to do that.

Performance bottlenecks when using async-await with Azure Storage API

I'm hitting a performance bottleneck, on insertion requests using the Azure Table Storage API. I'm trying to reach of a speed of at least 1 insert per 30ms into a table (unique partition keys).
What is the recommended way to achieve this request rate and how can I fix my program to overcome my bottleneck?
I have a test programs that inserts into the azure table at roughly 1 / 30ms. With this test program, the latency continuously increases and requests begin to take even more than 15 seconds per insert.
Below is the code for my test program. It creates async tasks that log the time it takes to await on the CloudTable ExecuteAsync method. Unfortunately, the insertion latency just grows as the program runs.
List<Task> tasks = new List<Task>();
while (true)
{
Thread.Sleep(30);
tasks = tasks.Where(t => t.IsCompleted == false).ToList(); // Remove completed tasks
DynamicTableEntity dte = new DynamicTableEntity() { PartitionKey = Guid.NewGuid().ToString(), RowKey = "abcd" };
tasks.Add(AddEntityToTableAsync(dte));
}
...
public static async Task<int> AddEntityToTableAsync<T>(T entity) where T : class, ITableEntity
{
Stopwatch timer = Stopwatch.StartNew();
var tableResult = await this.cloudTable.ExecuteAsync(TableOperation.InsertOrReplace(entity));
timer.Stop();
Console.WriteLine($"Table Insert Time: {timer.ElapsedMilliseconds}, Inserted {entity.PartitionKey}");
return tableResult.HttpStatusCode;
}
I thought that it might be my test program running out of threads for the outgoing Network IO, so I tried monitoring the available thread counts during the program's execution:
ThreadPool.GetAvailableThreads(out workerThreads, out completionIoPortThreads);
It showed that nearly all the IO threads were available during execution (Just in case, I even tried increasing the available threads but that had no affect on the issue).
As I understand it, for async tasks, the completion port threads don't get "reserved" until there's data on them to process, so I started thinking that there might be an issue with my connection to Azure Table Storage.
However, I confirmed that was not the case by lowering the request rate (1 insert / 100ms) and launching 30 instances of my test program on the same machine. With 30 instances, I was able to maintain a stable ~90ms / insert without any increase in latency.
What can I do to enable a single test program to achieve a simillar performance that I was getting when running 30 programs on the same machine?
The test program was hitting the System.Net.ServicePointManager.DefaultConnectionLimit. The default value is 2
Increasing the number to 100 fixes the problem. And allows the single program to achieve the same speed as the 30 programs scenario

arangodb truncate fails on large a collection

I get a timeout in arangosh and the arangodb service gets unresponsive if I try to truncate a large collection of ~40 million docs. Message:
arangosh [database_xxx]> db.[collection_yyy].truncate() ; JavaScript exception in file '/usr/share/arangodb/js/client/modules/org/arangodb/arangosh.js' at 104,13: [ArangoError 2001: Error reading from: 'tcp://127.0.0.1:8529' 'timeout during read'] !
throw new ArangoError(requestResult); ! ^ stacktrace: Error
at Object.exports.checkRequestResult (/usr/share/arangodb/js/client/modules/org/arangodb/arangosh.js:104:13)
at ArangoCollection.truncate (/usr/share/arangodb/js/client/modules/org/arangodb/arango-collection.js:468:12)
at <shell command>:1:11
ArangoDB 2.6.9 on Debian Jessie, AWS ec2 m4.xlarge, 16G RAM, SSD.
The service gets unresponsive. I suspect it got stuck (not just busy), because it doesn't work until after I stop, delete database in /var/lib/arangodb/databases/, then start again.
I know I may be leaning towards the limits of performance due to the size, but I would guess that it is the intention not to fail regardless of size.
However on a non cloud Windows 10, 16GB RAM, SSD the same action succeeded well - after a while.
Is it a bug? I have some python code that loads dummy data into a collection if it helps. Please let me know if I shall provide more info.
Would it help to fiddle with --server.request-timeout ?
Increasing --server.request-timeout for the ArangoShell will only increase the timeout that the shell will use before it closes an idle connection.
The arangod server will also shut down lingering keep-alive connections, and that may happen earlier. This is controlled via the server's --server.keep-alive-timeout setting.
However, increasing both won't help much. The actual problem seems to be the truncate() operation itself. And yes, it may be very expensive. truncate() is a transactional operation, so it will write a deletion marker for each document it removes into the server's write-ahead log. It will also buffer each deletion in memory so the operation can be rolled back if it fails.
A much less intrusive operation than truncate() is to instead drop the collection and re-create it. This should be very fast.
However, indexes and special settings of the collection need to be recreated / restored manually if they existed before dropping it.
For a document collection, it can be achieved like this:
function dropAndRecreateCollection (collectionName) {
// save state
var c = db._collection(collectionName);
var properties = c.properties();
var type = c.type();
var indexes = c.getIndexes();
// drop existing collection
db._drop(collectionName);
// restore collection
var i;
if (type == 2) {
// document collection
c = db._create(collectionName, properties);
i = 1;
}
else {
// edge collection
c = db._createEdgeCollection(collectionName, properties);
i = 2;
}
// restore indexes
for (; i < indexes.length; ++i) {
c.ensureIndex(indexes[i]);
}
}

Azure Table Storage slow to update records

I have an Azure Table which stores 1000s of discount codes partitioned by the first letter of the code so there are roughly 30 partitions with 1000 records each. In my application I enter a code and get the specific record from the table. I then update the discount code to say that it's been used. When load testing this application with 1000 concurrent users for 30 seconds the response times for reading the codes takes less than 1 second but updating the record takes over 10 seconds. Is this typical behavior for table storage or is there a way to speed this up?
//update discount code
string code = "A0099";
CloudStorageAccount storageAccount = CloudStorageAccount.Parse("constring...");
CloudTableClient tableClient = storageAccount.CreateCloudTableClient();
CloudTable table = tableClient.GetTableReference("discounts");
string partitionKey = code[0].ToString().ToUpper();
TableOperation retrieveOperation = TableOperation.Retrieve<DiscountEntity>(partitionKey, code);
TableResult retrievedResult = table.Execute(retrieveOperation);
if (retrievedResult.Result != null) {
DiscountEntity discount = (DiscountEntity)retrievedResult.Result;
discount.Used = true;
TableOperation updateOperation = TableOperation.Replace(discount);
table.Execute(updateOperation);
}
This is not the default behavior but i've seen it before... first of all check your vm size, because the bigger the vm size, faster the I/O (theres a MS doc somewhere that says that fat VMs have "fast I/O" or something like that...) but 10 secs is alot even for the extra-small vm...
To speed things up, i would suggest you to:
implement cache!, instead of searching for 1 code at a time, capture the whole "letter" of unused codes at once, cache them up, and then search the cache for the guy to update
Dont live update, instead, update the cache and than use the async methods to save things back
One thing you can check is the E2E time for a specific request vs how much time server has spent processing the request. That would allow you to see whether the bottleneck is the client/network or the server.
For more information on enabling Windows Azure Storage Analytics (specifically Logging), please refer to How To Monitor a Storage Account and Storage Analytics articles.

Resources