Azure Table Storage: Efficient way to query multiple PK-RK pairs - azure

I am doing a query with this filter:
(PartitionKey eq 'A' or PartionKey eq 'B' or ...) and RowKey eq 'RK'
I realized that this kind of query with 20 to 100 PKs is taking 3 to 5 seconds. The total quantity of items on the table is not much (more or less 1 million)
I think is doing a partial scan query. I assumed that it would do several puntual queries, but it seems is not the case.
My other option is do independent parallel queries and then merge the results.
Is this a good option for 100 items?
I will not have problems with the network connections? (I increase them with ServicePointManager.DefaultConnectionLimit)
Note: Not all the pair PK/RK will retrieve a record.

My other option is do independent parallel queries and then merge the results.
It will save the query time on Azure Storage, but it will spend more time on query request and result response. I have a table with 160K entities. I wrote two sample code to test the total times of query the entities from one query and multi queries. Here are my test result.
Following are my sample code.
Query entities from one query.
int entitesCount = 20;
TableQuery<CustomerEntity> customerQuery = new TableQuery<CustomerEntity>();
string filter = "(";
for (int i = 0; i < entitesCount; i++)
{
filter = filter + "PartitionKey eq '" + i + "'";
if (i < entitesCount - 1)
{
filter = filter + " or ";
}
}
filter = filter + ") and RowKey eq '42'";
customerQuery.FilterString = filter;
Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();
var customers = table.ExecuteQuery(customerQuery);
Console.WriteLine(customers.Count().ToString());
stopWatch.Stop();
TimeSpan ts = stopWatch.Elapsed;
Console.WriteLine(ts.ToString());
Query entities from parallel multi queries.
ServicePointManager.DefaultConnectionLimit = 100;
Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();
int entitesCount = 20;
List<CustomerEntity> customers = new List<CustomerEntity>();
var result = Parallel.For(0, entitesCount, new Action<int>(i =>
{
TableQuery<CustomerEntity> customerQuery = new TableQuery<CustomerEntity>();
customerQuery.FilterString = "PartitionKey eq '" + i.ToString() + "' and RowKey eq '88'";
var cs = table.ExecuteQuery(customerQuery);
foreach (var c in cs)
{
customers.Add(c);
}
}));
while (!result.IsCompleted) { }
Console.WriteLine(customers.Count.ToString());
stopWatch.Stop();
TimeSpan ts = stopWatch.Elapsed;
Console.WriteLine(ts.ToString());
Azure Table Storage: Efficient way to query multiple PK-RK pairs
I suggest you test it on your side and determine which way is a good one.

Related

AutoQuery/OrmLite incorrect total value when using joins

I have this autoquery implementation
var q = AutoQuery.CreateQuery(request, base.Request).SelectDistinct();
var results = Db.Select<ProductDto>(q);
return new QueryResponse<ProductDto>
{
Offset = q.Offset.GetValueOrDefault(0),
Total = (int)Db.Count(q),
Results = results
};
The request has some joins:
public class ProductSearchRequest : QueryDb<GardnerRecord, ProductDto>
, ILeftJoin<GardnerRecord, RecordToBicCode>, ILeftJoin<RecordToBicCode, GardnerBicCode>
{
}
The records gets returned correctly but the total is wrong. I can see 40,000 records in database but it tells me there is 90,000. There is multiple RecordToBicCode for each GardnerRecord so it's giving me the number of records multiplied by the number of RecordToBicCode.
How do I match the total to the number of GardnerRecord matching the query?
I am using PostgreSQL so need the count statement to be like
select count(distinct r.id) from gardner_record r etc...
Dores OrmLite have a way to do this?
I tried:
var q2 = q;
q2.SelectExpression = "select count(distinct \"gardner_record\".\"id\")";
q2.OrderByExpression = null;
var count = Db.Select<int>(q2);
But I get object reference not set error.
AutoQuery is returning the correct total count for your query of which has left joins so will naturally return more results then the original source table.
You can perform a Distinct count with:
Total = Db.Scalar<long>(q.Select(x => Sql.CountDistinct(x.Id));

Losing data on bulk inserts in Cassandra

I'm in trouble for losing data on insert in my Cassandra.
I am doing great bulk inserts from csv files which I read via Stream. The data is duplicated into two tables, because of different queries. Every 30,000th element I split my data to new partition (chunkCounter).
private PersistenceInformation persist(final String period, final String tradePartner, final Integer version, Stream<Transaction> transactions) {
int elementsInChunkCounter = 0;
int chunkCounter = 1;
int elementCounter = 0;
Iterator<Transaction> iterator = transactions.filter(beanValidator).iterator();
List<List<?>> listImportData = new ArrayList<>(30000);
List<List<?>> listGtins = new ArrayList<>(30000);
while (iterator.hasNext()) {
Transaction tr = iterator.next();
List<Object> importTemp = new ArrayList<>(9);
importTemp.add(period);
importTemp.add(tradePartner);
importTemp.add(version);
importTemp.add(chunkCounter);
importTemp.add(tr.getMdhId());
importTemp.add(tr.getGtin());
importTemp.add(tr.getQuantity());
importTemp.add(tr.getTransactionId());
importTemp.add(tr.getTimestamp());
listImportData.add(importTemp);
List<Object> gtinTemp = new ArrayList<>(8);
gtinTemp.add(period);
gtinTemp.add(tradePartner);
gtinTemp.add(version);
gtinTemp.add(chunkCounter);
gtinTemp.add(tr.getMdhId());
gtinTemp.add(tr.getGtin());
gtinTemp.add(tr.getQuantity());
gtinTemp.add(tr.getTimestamp());
listGtins.add(gtinTemp);
elementsInChunkCounter++;
elementCounter++;
if (elementsInChunkCounter == 30000) {
elementsInChunkCounter = 0;
chunkCounter++;
ingestImportData(listImportData);
listImportData.clear();
ingestGtins(listGtins);
listGtins.clear();
}
}
if (!listImportData.isEmpty()) {
ingestImportData(listImportData);
}
if (!listGtins.isEmpty()) {
ingestGtins(listGtins);
}
return new PersistenceInformation();
}
private void ingestImportData(List<List<?>> list) {
String cqlIngest = "INSERT INTO import_data (pd, tp , ver, chunk, mdh_id, gtin, qty, id, ts) VALUES (?,?,?,?,?,?,?,?,?)";
cassandraOperations.ingest(cqlIngest, list);
}
private void ingestGtins(List<List<?>> list) {
String cqlIngest = "INSERT INTO gtins (pd, tp, ver, chunk, mdh_id, gtin, qty, ts) VALUES (?,?,?,?,?,?,?,?)";
cassandraOperations.ingest(cqlIngest, list);
}
This worked pretty well until I noticed that sometimes a dataset goes missing. There is an entry in the second table (gtins) but the data set in the main table was not inserted. The application counted it but the database did not write it.
The table is built this way:
CREATE TABLE import_data (
tp text,
pd text,
ver int,
chunk int,
mdh_id uuid,
gtin text,
qty float,
id text,
ts timestamp
PRIMARY KEY ((tp, pd, ver, chunk), ts, mdh_id)) WITH CLUSTERING ORDER BY (ts DESC);
The mdh_id is a UUID from my application, so that every data set has a unique key and is not accidentally overridden.
The Cassandra log files didn't even show a warning.
At the moment I am evaluating BatchStatement but I need to insert every 8th dataset because of the 5kb limit, otherwise the database lost even more entries.
Any suggestions whats going wrong in my application is highly appreciated. Thanks a lot.

Most efficient way to read from bottom of Azure Table Storage

I have a an Azure table which serves as an event log. I need the most efficient way to read the bottom of the table to retrieve the most recent entries.
What is the most efficient way of doing this?
First of all, I would really advice you to base your partition key on UTC ticks. You can do it in a way that all the antities are ordered from latest to oldest.
Then if you want to get lets say 100 latest logs you just call (lets say that query is IQueryable something from your favorite client - we use Lucifure Stash): query.Take(100);
If you want to fetch entities for certain period you write: query.Where(x => x.PartitionKey <= value); or something similar.
The "value" variable has to be constructed based on the way you construct the values for partition key.
Assuming you want to fetch the data for last 15 minutes, try this pseudo code:
DateTime toDateTime = DateTime.UtcNow;
DateTime fromDateTime = toDateTime.AddMinutes(-15);
string myPartitionKeyFrom = fromDateTime.ToString("yy-MM");
string myPartitionKeyTo = toDateTime.ToString("yy-MM");
string query = "";
if (myPartitionKeyFrom.Equals(myPartitionKeyTo))//In case both time periods fall in same month, then we can directly hit that partition.
{
query += "(PartitionKey eq '" + myPartitionKeyFrom + "') ";
}
else // Otherwise we would need to do a greater than and lesser than stuff.
{
query += "(PartitionKey ge '" + myPartitionKeyFrom + "' and PartitionKey le '" + myPartitionKeyTo + "') ";
}
query += "and (RowKey ge '" + fromDateTime.ToString() + "' and RowKey le '" + toDateTime.ToString() + "')";
If you want to fetch latest 'n' number of entries then you need to slightly modify your PartitionKey and RowKey value, So that latest entries will be pushed to the top of the table.
For this you need to compute both the keys using DateTime.MaxValue.Subtract(DateTime.UtcNow).Ticks; instead of DateTime.UtcNow.
Microsoft provides a SemanticLogging framework that has a specific sink to log to Azure Table.
If you look at the library code, it generates a partition key (in reverse order) based on a Datetime :
static string GeneratePartitionKeyReversed(DateTime dateTime)
{
dateTime = dateTime.AddMinutes(-1.0);
return GetTicksReversed(
new DateTime(dateTime.Year, dateTime.Month, dateTime.Day, dateTime.Hour, dateTime.Minute, 0));
}
static string GetTicksReversed(DateTime dateTime)
{
return (DateTime.MaxValue - dateTime.ToUniversalTime())
.Ticks.ToString("d19", (IFormatProvider)CultureInfo.InvariantCulture);
}
So you can implement the same logic in your application to build your partitionkey.
If you want to retrieve the logs for a specific date range, you can write a query that looks like that:
var minDate = GeneratePartitionKeyReversed(DateTime.UtcNow.AddHours(-2));
var maxDate = GeneratePartitionKeyReversed(DateTime.UtcNow.AddHours(-1));
// Get the cloud table
var cloudTable = GetCloudTable();
// Build the query
IQueryable<DynamicTableEntity> query = cloudTable.CreateQuery<DynamicTableEntity>();
// condition for max date
query = query.Where(a => string.Compare(a.PartitionKey, maxDate,
StringComparison.Ordinal) >= 0);
// condition for min date
query = query.Where(a => string.Compare(a.PartitionKey, minDate,
StringComparison.Ordinal) <= 0);3

Big query Query Statistics

How to get query Statistics such as Time Taken to execute and other parameters in Bigquery using C#.
QueryRequest _r = new QueryRequest();
_r.Query = "SELECT Id, Name FROM [Sample.Test] LIMIT 1000";
QueryResponse _qr = _service.Jobs.Query(_r, "samplequery").Fetch();
List<string> _fieldNames = _qr.Schema.Fields.ToList().Select(x => x.Name).ToList() ;
List<Google.Apis.Bigquery.v2.Data.TableRow> _rows = _qr.Rows.ToList();
There is JobStatistics Class, but i am not getting job statistics from above query execution. Else If there is any other way to get statistics pls suggest.
Thanks
I got it.
Job _j = _service.Jobs.Get(_qr.JobReference.ProjectId, _qr.JobReference.JobId).Fetch();
JobStatistics _js = _j.Statistics;
this.StartTime = _js.StartTime;
this.EndTime = _js.EndTime;
this.BytesProcessed = _js.TotalBytesProcessed;

Last records from WadLogsTable

How can i get the last 100 records from the WADLogsTable Ordered by date ?
I tried to do it with this piece of code but it doesn't work
var query = (from entity in tsc.CreateQuery<LogsObject>("WADLogsTable")
where entity.PartitionKey.CompareTo(startTime.ToUniversalTime().Ticks.ToString("D19")) >= 0
orderby entity.EventTickCount descending
select entity);
Where tsc is the TableServiceContext.
I can get the records but i'm interested in the recents logs.
Thanks,
Windows Azure table storage doesn't support sorting, so entities always come back sorted by their PartitionKey+RowKey. But I suspect the log entries are already in reverse chronological order. Aren't they?
[EDIT] Apparently they're not. :-)
CloudStorageAccount storageAccount = CloudStorageAccount.Parse(ATCommon.DiagnosticConfig);
CloudTableClient cloudTableClient = storageAccount.CreateCloudTableClient();
TableServiceContext serviceContext = cloudTableClient.GetDataServiceContext();
IQueryable<WadLogEntity> traceLogsTable = serviceContext.CreateQuery<WadLogEntity>("WADLogsTable");
var selection = from row in traceLogsTable where row.PartitionKey.CompareTo("0" + DateTime.UtcNow.AddHours(hours).Ticks) >= 0 select row;
//var selection = from row in traceLogsTable where row.PartitionKey.CompareTo("0" + DateTime.UtcNow.AddMinutes(-5.0).Ticks) >= 0 select row;
CloudTableQuery<WadLogEntity> query = selection.AsTableServiceQuery<WadLogEntity>();
IEnumerable<WadLogEntity> output = query.Execute();
return output.OrderByDescending(s => s.Timestamp).Take(100).ToList();

Resources