Force Azure CosmosDB Java SDK Read Latest Value

Force Azure CosmosDB Java SDK Read Latest Value - azure

I am setting the TTL on my Cosmos Container to 1 to force the deletion of all items, I then query SELECT VALUE COUNT(1) from c to check that all items are deleted before setting TTL back to its previous value.
My issue is, I can see via the portal that the items are deleted but my query via the SDK returns the "old" wrong value for an inordinate time. Is there a way to force it to read the "real" value from the back end or establish a fresh connection etc?
//I create my client like so, setting Consistencylevel.STRONG will throw an error as
//it is higher level than the DB
CosmosClient cosmosClient= new CosmosClientBuilder().endpoint(DATABASE_HOST)
.key(DATABASE_KEY)
.consistencyLevel(ConsistencyLevel.SESSION)
.contentResponseOnWriteEnabled(true)
.buildClient();
//get the database
CosmosDatabase dataBase = cosmosClient.getDatabase(databaseName);
return dataBase;
//get the container
CosmosContainer container = theDatabase.getContainer(containerProps.getId());#
//update the TTL
containerProps.setDefaultTimeToLiveInSeconds(1);
container.replace(containerProps);
Thread.sleep(1000);
//now confirm that the container contents are deleted
//i tried refreshing my client/db/container objects to see if it would help
CosmosClient refreshedCosmosClient = createSyncCosmosClient();
CosmosDatabase refreshedDatabase = refreshedCosmosClient.getDatabase(DATABASE_NAME);
CosmosContainer refreshedContainer = refreshedDatabase.getContainer(container.getId());
//query the number of ITEMS in the container
CosmosPagedIterable<JsonNode> countOfDocs =
refreshedContainer.queryItems(CHECK_CONTAINER_EMPTY_QUERY, new CosmosQueryRequestOptions(),
JsonNode.class);
context.getLogger().info("wooooooooooooooooaaaa" +countOfDocs.toString());
//THIS VALUE IS NOT UP TO DATE. IT IS THE OLD VALUE
JsonNode count = countOfDocs.iterator().next();
int numberOfDocuments = count.asInt();

When you set TTL to 1 second, yes, it will eventually delete all the documents but this does not happen instantaneously. Depending on the volume of data, this can take some time, what happens is that the documents that are in the process of deletion by TTL cannot be seen by read operations (hence the COUNT shows 0)
If you disable TTL and there are documents still in the process of TTL deletion, then those are now back to being accessible (because the process by which they were being deleted is disabled).
Reference: https://learn.microsoft.com/azure/cosmos-db/nosql/time-to-live
Deletion of expired items is a background task that consumes left-over Request Units, that is Request Units that haven't been consumed by user requests. Even after the TTL has expired, if the container is overloaded with requests and if there aren't enough RU's available, the data deletion is delayed. Data is deleted once there are enough RUs available to perform the delete operation. Though the data deletion is delayed, data is not returned by any queries (by any API) after the TTL has expired.

I can set the ConsistencyLevel to BOUNDED_STALENESS #5 seconds, this seems to improve things. I cannot use ConsistencyLevel.STRONG *due to DB configuration

Related

How to cleanup the JdbcMetadataStore?

Initially our flow of cimmunicating with google Pub/Sub was so:
Application accepts message
Checks that it doesn't exist in idempotencyStore
3.1 If doesn't exist - put it into idempotency store (key is a value of unique header, value is a current timestamp)
3.2 If exist - just ignore this message
When processing is finished - send acknowledge
In the acknowledge successfull callback - remove this msg from metadatastore
The point 5 is wrong because theoretically we can get duplicated message even after message has processed. Moreover we found out that sometimes message might not be removed even although successful callback was invoked( Message is received from Google Pub/Sub subscription again and again after acknowledge[Heisenbug]) So we decided to update value after message is proccessed and replace timestamp with "FiNISHED" string
But sooner or later we will encounter that this table will be overcrowded. So we have to cleanup messages in the MetaDataStore. We can remove messages which are processed and they were processed more 1 day.
As was mentioned in the comments of https://stackoverflow.com/a/51845202/2674303 I can add additional column in the metadataStore table where I could mark if message is processed. It is not a problem at all. But how can I use this flag in the my cleaner? MetadataStore has only key and value

In the acknowledge successfull callback - remove this msg from metadatastore
I don't see a reason in this step at all.
Since you say that you store in the value a timestamp that means that you can analyze this table from time to time to remove definitely old entries.
In some my project we have a daily job in DB to archive a table for better main process performance. Right, just because we don't need old data any more. For this reason we definitely check some timestamp in the raw to determine if that should go into archive or not. I wouldn't remove data immediately after process just because there is a chance for redelivery from external system.
On the other hand for better performance I would add extra indexed column with timestamp type into that metadata table and would populate a value via trigger on each update or instert. Well, MetadataStore just insert an entry from the MetadataStoreSelector:
return this.metadataStore.putIfAbsent(key, value) == null;
So, you need an on_insert trigger to populate that date column. This way you will know in the end of day if you need to remove an entry or not.

Anyway to view Cache entry metadata from Hazelcast (ie: date added, last accessed, etc)?

I'm using Hazelcast with a distributed cache in embedded mode. My cache is defined with an Eviction Policy and an Expiry Policy.
CacheSimpleConfig cacheSimpleConfig = new CacheSimpleConfig()
.setName(CACHE_NAME)
.setKeyType(UserRolesCacheKey.class.getName())
.setValueType((new String[0]).getClass().getName())
.setStatisticsEnabled(false)
.setManagementEnabled(false)
.setReadThrough(true)
.setWriteThrough(true)
.setInMemoryFormat(InMemoryFormat.OBJECT)
.setBackupCount(1)
.setAsyncBackupCount(1)
.setEvictionConfig(new EvictionConfig()
.setEvictionPolicy(EvictionPolicy.LRU)
.setSize(1000)
.setMaximumSizePolicy(EvictionConfig.MaxSizePolicy.ENTRY_COUNT))
.setExpiryPolicyFactoryConfig(
new ExpiryPolicyFactoryConfig(
new TimedExpiryPolicyFactoryConfig(ACCESSED,
new DurationConfig(
1800,
TimeUnit.SECONDS))));
hazelcastInstance.getConfig().addCacheConfig(cacheSimpleConfig);
ICache<UserRolesCacheKey, String[]> userRolesCache = hazelcastInstance.getCacheManager().getCache(CACHE_NAME);
MutableCacheEntryListenerConfiguration<UserRolesCacheKey, String[]> listenerConfiguration =
new MutableCacheEntryListenerConfiguration<>(
new UserRolesCacheListenerFactory(), null, false, false);
userRolesCache.registerCacheEntryListener(listenerConfiguration);
The problem I am having is that my Listener seems to be firing prematurely in a production environment; the listener is executed (REMOVED) even though the cache has been recently queried.
As the expiry listener fires, I get the CacheEntry, but I'd like to be able to see the reason for the Expiry, whether it was Evicted (due to MaxSize policy), or Expired (Duration). If Expired, I'd like to see the timestamp of when it was last accessed. If Evicted, I'd like to see the number of entries in the cache, etc.
Are these stats/metrics/metadata available anywhere via Hazelcast APIs?

Local cache statistics (entry count, eviction count) are available using ICache#getLocalCacheStatistics(). Notice that you need to setStatisticsEnabled(true) in your cache configuration for statistics to be available. Also, notice that the returned CacheStatistics object only reports statistics for the local member.
When seeking info on a single cache entry, you can use the EntryProcessor functionality to unwrap the MutableEntry to the Hazelcast-specific class com.hazelcast.cache.impl.CacheEntryProcessorEntry and inspect that one. The Hazelcast-specific implementation provides access to the CacheRecord that provides metadata like creation/accessed time.
Caveat: the Hazelcast-specific implementation may change between versions. Here is an example:
cache.invoke(KEY, (EntryProcessor<String, String, Void>) (entry, arguments) -> {
CacheEntryProcessorEntry hzEntry = entry.unwrap(CacheEntryProcessorEntry.class);
// getRecord does not update the entry's access time
System.out.println(hzEntry.getRecord().getLastAccessTime());
return null;
});

How to avoid race condition when updating Azure Table Storage record

Azure Function utilising Azure Table Storage
I have an Azure Function which is triggered from Azure Service Bus topic subscription, let's call it "Process File Info" function.
The message on the subscription contains file information to be processed. Something similar to this:
{
"uniqueFileId": "adjsdakajksajkskjdasd",
"fileName":"mydocument.docx",
"sourceSystemRef":"System1",
"sizeBytes": 1024,
... and other data
}
The function carries out the following two operations -
Check individual file storage table for the existing of the file. If it exists, update that file. If it's new, add the file to the storage table (stored on a per system|per fileId basis).
Capture metrics on the file size bytes and store in a second storage table, called metrics (constantly incrementing the bytes, stored on a per system|per year/month basis).
The following diagram gives a brief summary of my approach:
The difference between the individualFileInfo table and the fileMetric is that the individual table has one record per file, where as the metric table stores one record per month that is constantly updated (incremented) gathering the total bytes that are passed through the function.
Data in the fileMetrics table is stored as follows:
The issue...
Azure functions are brilliant at scaling, in my setup I have a max of 6 of these functions running at any one time. Presuming each file message getting processed is unique - updating the record (or inserting) in the individualFileInfo table works fine as there are no race conditions.
However, updating the fileMetric table is proving problematic as say all 6 functions fire at once, they all intend to update the metrics table at the one time (constantly incrementing the new file counter or incrementing the existing file counter).
I have tried using the etag for optimistic updates, along with a little bit of recursion to retry should a 412 response come back from the storage update (code sample below). But I can't seem to avoid this race condition. Has anyone any suggestion on how to work around this constraint or come up against something similar before?
Sample code that is executed in the function for storing the fileMetric update:
internal static async Task UpdateMetricEntry(IAzureTableStorageService auditTableService,
string sourceSystemReference, long addNewBytes, long addIncrementBytes, int retryDepth = 0)
{
const int maxRetryDepth = 3; // only recurively attempt max 3 times
var todayYearMonth = DateTime.Now.ToString("yyyyMM");
try
{
// Attempt to get existing record from table storage.
var result = await auditTableService.GetRecord<VolumeMetric>("VolumeMetrics", sourceSystemReference, todayYearMonth);
// If the volume metrics table existing in storage - add or edit the records as required.
if (result.TableExists)
{
VolumeMetric volumeMetric = result.RecordExists ?
// Existing metric record.
(VolumeMetric)result.Record.Clone()
:
// Brand new metrics record.
new VolumeMetric
{
PartitionKey = sourceSystemReference,
RowKey = todayYearMonth,
SourceSystemReference = sourceSystemReference,
BillingMonth = DateTime.Now.Month,
BillingYear = DateTime.Now.Year,
ETag = "*"
};
volumeMetric.NewVolumeBytes += addNewBytes;
volumeMetric.IncrementalVolumeBytes += addIncrementBytes;
await auditTableService.InsertOrReplace("VolumeMetrics", volumeMetric);
}
}
catch (StorageException ex)
{
if (ex.RequestInformation.HttpStatusCode == 412)
{
// Retry to update the volume metrics.
if (retryDepth < maxRetryDepth)
await UpdateMetricEntry(auditTableService, sourceSystemReference, addNewBytes, addIncrementBytes, retryDepth++);
}
else
throw;
}
}
Etag keeps track of conflicts and if this code gets a 412 Http response it will retry, up to a max of 3 times (an attempt to mitigate the issue). My issue here is that I cannot guarantee the updates to table storage across all instances of the function.
Thanks for any tips in advance!!

You can put the second part of the work into a second queue and function, maybe even put a trigger on the file updates.
Since the other operation sounds like it might take most of the time anyways, it could also remove some of the heat from the second step.
You can then solve any remaining race conditions by focusing only on that function. You can use sessions to limit the concurrency effectively. In your case, the system id could be a possible session key. If you use that, you will only have one Azure Function processing data from one system at one time, effectively solving your race conditions.
https://dev.to/azure/ordered-queue-processing-in-azure-functions-4h6c
Edit: If you can't use Sessions to logically lock the resource, you can use locks via blob storage:
https://www.azurefromthetrenches.com/acquiring-locks-on-table-storage/

Insert bulk data into big-query without keeping it in streaming buffer

My motive here is as follow:
Insert bulk records into big-query every half an hour
Delete the record if the exists
Those records are transactions which change their statuses from: pending, success, fail and expire.
BigQuery does not allow me to delete the rows that are inserted just half an hour ago as they are still in the streaming buffer.
can anyone suggest me some workaround as i am getting some duplicate rows in my table.

A better course of action would be to:
Perform periodic loads into a staging table (loading is a free operation)
After the load completes, execute a MERGE statement.
You would want something like this:
MERGE dataset.TransactionTable dt
USING dataset.StagingTransactionTable st
ON dt.tx_id = st.tx_id
WHEN MATCHED THEN
UPDATE dt.status = st.status
WHEN NOT MATCHED THEN
INSERT (tx_id, status) VALUES (st.tx_id, st.status)

couchDB conflicts when supplying own ID with large inserts using _bulk_docs

Same code works fine when letting couch auto generate UUID's. I am starting off with a new completely empty database yet I keep getting this
error: conflict
reason: Document update conflict
To reiterate I am posting new documents to an empty database so not sure how I can get update conflicts when nothing is being updated. Even stranger the conflicting documents still show up in the DB with only a single revision, but overall there are missing records.
I am trying to insert about 38,000 records with _bulk_docs in batches of 100. I am getting these records (100 at a time) from a RETS server, each record already has a unique ID that I want to use for the couchDB _id instead of their UUID's. I am using a promised based library to get the records and axios to insert them into couch. After getting the first batch of 100 I then run this code to add an _id to each of the 100 records before inserting
let batch = [];
batch = records.results.map((listing) => {
let temp = listing;
temp._id = listing.ListingKey;
return temp;
});
Then insert:
axios.post('http://127.0.0.1:5984/rets_store/_bulk_docs', { docs: batch })
This is all inside of a function that I call recursively.
I know this probably wont be enough to see the issue but thought Id start here. I know for sure it has something to do with my map() and adding the _id = ListingKey
Thanks!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Force Azure CosmosDB Java SDK Read Latest Value - azure

I can set the ConsistencyLevel to BOUNDED_STALENESS #5 seconds, this seems to improve things. I cannot use ConsistencyLevel.STRONG *due to DB configuration

Related

How to cleanup the JdbcMetadataStore?

Anyway to view Cache entry metadata from Hazelcast (ie: date added, last accessed, etc)?

How to avoid race condition when updating Azure Table Storage record

Insert bulk data into big-query without keeping it in streaming buffer

couchDB conflicts when supplying own ID with large inserts using _bulk_docs

Categories

Resources