I have a an Azure table which serves as an event log. I need the most efficient way to read the bottom of the table to retrieve the most recent entries.
What is the most efficient way of doing this?
First of all, I would really advice you to base your partition key on UTC ticks. You can do it in a way that all the antities are ordered from latest to oldest.
Then if you want to get lets say 100 latest logs you just call (lets say that query is IQueryable something from your favorite client - we use Lucifure Stash): query.Take(100);
If you want to fetch entities for certain period you write: query.Where(x => x.PartitionKey <= value); or something similar.
The "value" variable has to be constructed based on the way you construct the values for partition key.
Assuming you want to fetch the data for last 15 minutes, try this pseudo code:
DateTime toDateTime = DateTime.UtcNow;
DateTime fromDateTime = toDateTime.AddMinutes(-15);
string myPartitionKeyFrom = fromDateTime.ToString("yy-MM");
string myPartitionKeyTo = toDateTime.ToString("yy-MM");
string query = "";
if (myPartitionKeyFrom.Equals(myPartitionKeyTo))//In case both time periods fall in same month, then we can directly hit that partition.
{
query += "(PartitionKey eq '" + myPartitionKeyFrom + "') ";
}
else // Otherwise we would need to do a greater than and lesser than stuff.
{
query += "(PartitionKey ge '" + myPartitionKeyFrom + "' and PartitionKey le '" + myPartitionKeyTo + "') ";
}
query += "and (RowKey ge '" + fromDateTime.ToString() + "' and RowKey le '" + toDateTime.ToString() + "')";
If you want to fetch latest 'n' number of entries then you need to slightly modify your PartitionKey and RowKey value, So that latest entries will be pushed to the top of the table.
For this you need to compute both the keys using DateTime.MaxValue.Subtract(DateTime.UtcNow).Ticks; instead of DateTime.UtcNow.
Microsoft provides a SemanticLogging framework that has a specific sink to log to Azure Table.
If you look at the library code, it generates a partition key (in reverse order) based on a Datetime :
static string GeneratePartitionKeyReversed(DateTime dateTime)
{
dateTime = dateTime.AddMinutes(-1.0);
return GetTicksReversed(
new DateTime(dateTime.Year, dateTime.Month, dateTime.Day, dateTime.Hour, dateTime.Minute, 0));
}
static string GetTicksReversed(DateTime dateTime)
{
return (DateTime.MaxValue - dateTime.ToUniversalTime())
.Ticks.ToString("d19", (IFormatProvider)CultureInfo.InvariantCulture);
}
So you can implement the same logic in your application to build your partitionkey.
If you want to retrieve the logs for a specific date range, you can write a query that looks like that:
var minDate = GeneratePartitionKeyReversed(DateTime.UtcNow.AddHours(-2));
var maxDate = GeneratePartitionKeyReversed(DateTime.UtcNow.AddHours(-1));
// Get the cloud table
var cloudTable = GetCloudTable();
// Build the query
IQueryable<DynamicTableEntity> query = cloudTable.CreateQuery<DynamicTableEntity>();
// condition for max date
query = query.Where(a => string.Compare(a.PartitionKey, maxDate,
StringComparison.Ordinal) >= 0);
// condition for min date
query = query.Where(a => string.Compare(a.PartitionKey, minDate,
StringComparison.Ordinal) <= 0);3
Related
I have a problem regarding merging csv files using pysparkSQL with delta table. I managed to create upsert function that update if matched and insert if not matched.
I want to add column ID to the final delta table and increment it each time we insert data. This column identify each row in our delta table. Is there any way to put that in place ?
def Merge(dict1, dict2):
res = {**dict1, **dict2}
return res
def create_default_values_dict(correspondance_df,marketplace):
dict_output = {}
for field in get_nan_keys_values(get_mapping_dict(correspondance_df, marketplace)):
dict_output[field] = 'null'
# We want to increment the id row each time we perform an insertion (TODO TODO TODO)
# if field == 'id':
# dict_output['id'] = col('id')+1
# else:
return dict_output
def create_matched_update_dict(mapping, products_table, updates_table):
output = {}
for k,v in mapping.items():
if k == 'source_name':
output['products.source_name'] = lit(v)
else:
output[products_table + '.' + k] = F.when(col(updates_table + '.' + v).isNull(), col(products_table + '.' + k)).when(col(updates_table + '.' + v).isNotNull(), col(updates_table + '.' + v))
return output
insert_dict = create_not_matched_insert_dict(mapping, 'products', 'updates')
default_dict = create_default_values_dict(correspondance_df_products, 'Cdiscount')
insert_values = Merge(insert_dict, default_dict)
update_values = create_matched_update_dict(mapping, 'products', 'updates')
delta_table_products.alias('products').merge(
updates_df_table.limit(20).alias('updates'),
"products.barcode_ean == updates.ean") \
.whenMatchedUpdate(set = update_values) \
.whenNotMatchedInsert(values = insert_values)\
.execute()
I tried to increment the column id in the function create_default_values_dict but it's seems to not working well, it doesn't auto increment by 1. Is there another way to solve this problem ? Thanks in advance :)
Databricks has IDENTITY columns for hosted Spark
https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html#parameters
GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY
[ ( [ START WITH start ] [ INCREMENT BY step ] ) ]
This works on Delta tables.
Example:
create table gen1 (
id long GENERATED ALWAYS AS IDENTITY
, t string
)
Requires Runtime version 10.4 or above.
Delta does not support auto-increment column types.
In general, Spark doesn't use auto-increment IDs, instead favoring monotonically increasing IDs. See functions.monotonically_increasing_id().
If you want to achieve auto-increment behavior you will have to use multiple Delta operations, e.g., query the max value + add it to a row_number() column computed via a window function + then write. This is problematic for two reasons:
Unless you introduce an external locking mechanism or some other way to ensure that no updates to the table happen in-between finding the max value and writing, you can end up with invalid data.
Using row_number() will reduce parallelism to 1, forcing all the data through a single core, which will be very slow with large data.
Bottom line, you really do not want to use auto-increment columns with Spark.
Hope this helps.
I am doing a query with this filter:
(PartitionKey eq 'A' or PartionKey eq 'B' or ...) and RowKey eq 'RK'
I realized that this kind of query with 20 to 100 PKs is taking 3 to 5 seconds. The total quantity of items on the table is not much (more or less 1 million)
I think is doing a partial scan query. I assumed that it would do several puntual queries, but it seems is not the case.
My other option is do independent parallel queries and then merge the results.
Is this a good option for 100 items?
I will not have problems with the network connections? (I increase them with ServicePointManager.DefaultConnectionLimit)
Note: Not all the pair PK/RK will retrieve a record.
My other option is do independent parallel queries and then merge the results.
It will save the query time on Azure Storage, but it will spend more time on query request and result response. I have a table with 160K entities. I wrote two sample code to test the total times of query the entities from one query and multi queries. Here are my test result.
Following are my sample code.
Query entities from one query.
int entitesCount = 20;
TableQuery<CustomerEntity> customerQuery = new TableQuery<CustomerEntity>();
string filter = "(";
for (int i = 0; i < entitesCount; i++)
{
filter = filter + "PartitionKey eq '" + i + "'";
if (i < entitesCount - 1)
{
filter = filter + " or ";
}
}
filter = filter + ") and RowKey eq '42'";
customerQuery.FilterString = filter;
Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();
var customers = table.ExecuteQuery(customerQuery);
Console.WriteLine(customers.Count().ToString());
stopWatch.Stop();
TimeSpan ts = stopWatch.Elapsed;
Console.WriteLine(ts.ToString());
Query entities from parallel multi queries.
ServicePointManager.DefaultConnectionLimit = 100;
Stopwatch stopWatch = new Stopwatch();
stopWatch.Start();
int entitesCount = 20;
List<CustomerEntity> customers = new List<CustomerEntity>();
var result = Parallel.For(0, entitesCount, new Action<int>(i =>
{
TableQuery<CustomerEntity> customerQuery = new TableQuery<CustomerEntity>();
customerQuery.FilterString = "PartitionKey eq '" + i.ToString() + "' and RowKey eq '88'";
var cs = table.ExecuteQuery(customerQuery);
foreach (var c in cs)
{
customers.Add(c);
}
}));
while (!result.IsCompleted) { }
Console.WriteLine(customers.Count.ToString());
stopWatch.Stop();
TimeSpan ts = stopWatch.Elapsed;
Console.WriteLine(ts.ToString());
Azure Table Storage: Efficient way to query multiple PK-RK pairs
I suggest you test it on your side and determine which way is a good one.
Is it possible to return both objects of a join as the result of a GridGain cache query?
We can get either one of the sides of the join or fields from both (and then use these to retrieve each object separately), but looking at the examples and documentation there seems to be no way to get both objects.
Thanks!
#dejan- In GridGain and Apache Ignite, you can use _key and _val functionality with SqlFieldsQuery in order to return an object. For example -
SqlFieldsQuery sql = new SqlFieldsQuery(
"select a._key, a._val, b._val from SomeTypeA a, SomeTypeB b " +
"where a.id = b.otherId");
try (QueryCursor<List<?>> cursor = cache.query(sql) {
for (List<?> row : cursor)
System.out.println("Row: " + row);
}
Note that in this case the object will be returned in a Serialized form.
First of all, GridGain Open Source edition is now Apache Ignite, so I would recommend switching.
In Ignite, you can return exactly the fields you need from a query using SqlFieldsQuery, like so:
SqlFieldsQuery sql = new SqlFieldsQuery(
"select fieldA1, fieldA2, fieldB3 from SomeTypeA a, SomeTypeB b " +
"where a.id = b.otherId");
try (QueryCursor<List<?>> cursor = cache.query(sql) {
for (List<?> row : cursor)
System.out.println("Row: " + row);
}
In GridGain open source edtion, you can use GridGain fields query APIs as well.
How can I select the count from a table and include a where clause to return a long? Ideally I would use db.Count instead of db.Select. I'm just not sure how to use db.Count and cannot find documentation on it.
long totalCount = 0;
using (IDbConnection db = dbFactory.OpenDbConnection())
{
totalCount = db.Count<Content>( ?? );
}
Console.WriteLine(totalCount);
You answered for you question in your comment ;) You should use Count extension method with expression parameter. Example below:
long amout = db.Count<Post>(x => x.Subject == "test");
OrmLite generates following sql:
SELECT Count(*) FROM POST WHERE (SUBJECT = 'test')
I've a problem with LINQ. Basically a third party database that I need to connect to is using the now depreciated text field (I can't change this) and I need to execute a distinct clause in my linq on results that contain this field.
I don't want to do a ToList() before executing the Distinct() as that will result in thousands of records coming back from the database that I don't require and will annoy the client as they get charged for bandwidth usage. I only need the first 15 distinct records.
Anyway query is below:
var query = (from s in db.tSearches
join sc in db.tSearchIndexes on s.GUID equals sc.CPSGUID
join a in db.tAttributes on sc.AttributeGUID equals a.GUID
where s.Notes != null && a.Attribute == "Featured"
select new FeaturedVacancy
{
Id = s.GUID,
DateOpened = s.DateOpened,
Notes = s.Notes
});
return query.Distinct().OrderByDescending(x => x.DateOpened);
I know I can do a subquery to do the same thing as above (tSearches contains unique records) but I'd rather a more straightfoward solution if available as I need to change a number of similar queries throughout the code to get this working.
No answers on how to do this so I went with my first suggestion and retrieved the unique records first from tSearch then constructed a subquery with the non unique records and filtered the search results by this subquery. Answer below:
var query = (from s in db.tSearches
where s.DateClosed == null && s.ConfidentialNotes != null
orderby s.DateOpened descending
select new FeaturedVacancy
{
Id = s.GUID,
Notes = s.ConfidentialNotes
});
/* Now filter by our 'Featured' attribute */
var subQuery = from sc in db.tSearchIndexes
join a in db.tAttributes on sc.AttributeGUID equals a.GUID
where a.Attribute == "Featured"
select sc.CPSGUID;
query = query.Where(x => subQuery.Contains(x.Id));
return query;