Append to Azure Append Blob Using AppendTextAsync Results in Missing Data

Append to Azure Append Blob Using AppendTextAsync Results in Missing Data - azure

I'm attempting to create a logger for an application in Azure using the new Azure append blobs and the Azure Storage SDK 6.0.0. So I created a quick test application to get a better understanding of append blobs and their performance characteristics.
My test program simply loops 100 times and appends a line of text to the append blob. If I use the synchronous AppendText() method everything works fine, however, it appears to be limited to writing about 5-6 appends per second. So I attempted to use the asynchronous AppendTextAsync() method; however, when I use this method, the loop runs much faster (as expected) but the append blob is missing about 98% of the appended text without any exception being thrown.
If I add a Thread.Sleep and sleep for 100 milliseconds between each append operation, I end up with about 50% of the data. Sleep for 1 second and I get all of the data.
This seems similar to an issue that was discovered in v5.0.0 but was fixed in v5.0.2: https://github.com/Azure/azure-storage-net/releases/tag/v5.0.2
Here is my test code if you'd like to try to reproduce this issue:
static void Main(string[] args)
{
var accountName = "<account-name>";
var accountKey = "<account-key>;
var credentials = new StorageCredentials(accountName, accountKey);
var account = new CloudStorageAccount(credentials, true);
var client = account.CreateCloudBlobClient();
var container = client.GetContainerReference("<container-name>");
container.CreateIfNotExists();
var blob = container.GetAppendBlobReference("append-blob.txt");
blob.CreateOrReplace();
for (int i = 0; i < 100; i++)
blob.AppendTextAsync(string.Format("Appending log number {0} to an append blob.\r\n", i));
Console.WriteLine("Press any key to exit.");
Console.ReadKey();
}
Does anyone know if I'm doing something wrong with my attempt to append lines of text to an append blob? Otherwise, any idea why this would just lose data without throwing some kind of exception?
I'd really like to start using this as a repository for my application logs (since it was largely created for that purpose). However, it would be quite unreliable if logs would just go missing without warning if the rate of logging went above 5-6 logs per second.
Any thoughts or feedback would be greatly appreciated.

I now have a working solution based upon the information provided by #ZhaoxingLu-Microsoft. According to the the API documentation, the AppendTextAsync() method should only be used in a single-writer scenario because the API internally uses the append-offset conditional header to avoid duplicate blocks which does not work in a multiple-writer scenario.
Here is the documentation that specifies this behavior is by design:
https://msdn.microsoft.com/en-us/library/azure/mt423049.aspx
So the solution is to use the AppendBlockAsync() method instead. The following implementation appears to work correctly:
for (int i = 0; i < 100; i++)
{
var message = string.Format("Appending log number {0} to an append blob.\r\n", i);
var bytes = Encoding.UTF8.GetBytes(message);
var stream = new MemoryStream(bytes);
tasks[i] = blob.AppendBlockAsync(stream);
}
Task.WaitAll(tasks);
Please note that I am not explicitly disposing the memory stream in this example as that solution would entail a using block with an async/await inside the using block in order to wait for the async append operation to finish before disposing the memory stream... but that causes a completely unrelated issue.

You are using async method incorrectly. blob.AppendTextAsync() is non-blocking, but it doesn't really finish when it returns. You should wait for all the async tasks before exiting from the process.
Following code is the correct usage:
var tasks = new Task[100];
for (int i = 0; i < 100; i++)
tasks[i] = blob.AppendTextAsync(string.Format("Appending log number {0} to an append blob.\r\n", i));
Task.WaitAll(tasks);
Console.WriteLine("Press any key to exit.");
Console.ReadKey();

Related

Can the CosmosClient give out the number of retries that happened because of throttling (http 429)?

I have the following code where I bulk insert a lot of items
var batchSize = 200;
for (int i = 0; i < items.Count() / batchSize + 1; i++)
{
var insertOperations = new List<Task<ItemResponse<MyType>>>();
var batchItems = items.Skip(i * batchSize).Take(batchSize);
if (batchItems.Count() == 0)
break;
foreach (var batchItem in batchItems)
{
insertOperations.Add(container.CreateItemAsync(batchItem));
}
await Task.WhenAll(insertOperations);
}
My CosmosClient is
CosmosClientOptions options = new CosmosClientOptions
{
ConnectionMode = ConnectionMode.Direct,
AllowBulkExecution = true,
MaxRetryAttemptsOnRateLimitedRequests = 30,
MaxRetryWaitTimeOnRateLimitedRequests = TimeSpan.FromSeconds(180)
};
When inserting a lot of items I seem to get throttled, as per the metrics in the Azure portal. Since I allow long wait times and a lot of retries the code will take a bit longer but it will eventually get the work done. I would be interested in logging the number of retries needed because it would help me see which operations are causing the most throttling.
Is it possible to find out how many retries the cosmos client needed to perform the operation?

You can see them in the Diagnostics exposed as part of the operation response. Each container.CreateItemAsync(batchItem) returns an ItemResponse<T>.
The ItemResponse.Diagnostics.ToString() contains the information for that particular operation, including any retries it might have gone through.

CosmosDB insertion loop stops inserting after a certain number of iterations (Node.js)

I'm doing a few tutorials on CosmosDB. I've got the database set up with the Core (SQL) API, and using Node.js to interface with it. for development, I'm using the emulator.
This is the bit of code that I'm running:
const CosmosClient = require('#azure/cosmos').CosmosClient
process.env.NODE_TLS_REJECT_UNAUTHORIZED = "0";
const options = {
endpoint: 'https://localhost:8081',
key: REDACTED,
userAgentSuffix: 'CosmosDBJavascriptQuickstart'
};
const client = new CosmosClient(options);
(async () => {
let cost = 0;
let i = 0
while (i < 2000) {
i += 1
console.log(i+" Creating record, running cost:"+cost)
let response = await client.database('TestDB').container('TestContainer').items.upsert({}).catch(console.log);
cost += response.requestCharge;
}
})()
This, without fail, stops at around iteration 1565, and doesn't continue. I've tried it with different payloads, without much difference (it may do a few more or a few less iterations, but seems to almsot always be around that number)
On the flipside, a similar .NET Core example works great to insert 10,000 documents:
double cost = 0.0;
int i = 0;
while (i < 10000)
{
i++;
ItemResponse<dynamic> resp = await this.container.CreateItemAsync<dynamic>(new { id = Guid.NewGuid() });
cost += resp.RequestCharge;
Console.WriteLine("Created item {0} Operation consumed {1} RUs. Running cost: {2}", i, resp.RequestCharge, cost);
}
So I'm not sure what's going on.

So, after a bit of fiddling, this doesn't seem to have anything to do with CosmosDB or it's library.
I was running this in the debugger, and Node would just crap out after x iterations. I noticed if I didn't use a console.log it would actually work. Also, if I ran the script with node file.js it also worked. So there seems to be some sort of issue with debugging the script while also printing to the console. Not exactly sure whats up with that, but going to go ahead and mark this as solved

"The specified block list is invalid" while uploading blobs in parallel

I've a (fairly large) Azure application that uploads (fairly large) files in parallel to Azure blob storage.
In a few percent of uploads I get an exception:
The specified block list is invalid.
System.Net.WebException: The remote server returned an error: (400) Bad Request.
This is when we run a fairly innocuous looking bit of code to upload a blob in parallel to Azure storage:
public static void UploadBlobBlocksInParallel(this CloudBlockBlob blob, FileInfo file)
{
blob.DeleteIfExists();
blob.Properties.ContentType = file.GetContentType();
blob.Metadata["Extension"] = file.Extension;
byte[] data = File.ReadAllBytes(file.FullName);
int numberOfBlocks = (data.Length / BlockLength) + 1;
string[] blockIds = new string[numberOfBlocks];
Parallel.For(
0,
numberOfBlocks,
x =>
{
string blockId = Convert.ToBase64String(Guid.NewGuid().ToByteArray());
int currentLength = Math.Min(BlockLength, data.Length - (x * BlockLength));
using (var memStream = new MemoryStream(data, x * BlockLength, currentLength))
{
var blockData = memStream.ToArray();
var md5Check = System.Security.Cryptography.MD5.Create();
var md5Hash = md5Check.ComputeHash(blockData, 0, blockData.Length);
blob.PutBlock(blockId, memStream, Convert.ToBase64String(md5Hash));
}
blockIds[x] = blockId;
});
byte[] fileHash = _md5Check.ComputeHash(data, 0, data.Length);
blob.Metadata["Checksum"] = BitConverter.ToString(fileHash).Replace("-", string.Empty);
blob.Properties.ContentMD5 = Convert.ToBase64String(fileHash);
data = null;
blob.PutBlockList(blockIds);
blob.SetMetadata();
blob.SetProperties();
}
All very mysterious; I'd think the algorithm we're using to calculate the block list should produce strings that are all the same length...

We ran into a similar issue, however we were not specifying any block ID or even using the block ID anywhere. In our case, we were using:
using (CloudBlobStream stream = blob.OpenWrite(condition))
{
//// [write data to stream]
stream.Flush();
stream.Commit();
}
This would cause The specified block list is invalid. errors under parallelized load. Switching this code to use the UploadFromStream(…) method while buffering the data into memory fixed the issue:
using (MemoryStream stream = new MemoryStream())
{
//// [write data to stream]
stream.Seek(0, SeekOrigin.Begin);
blob.UploadFromStream(stream, condition);
}
Obviously this could have negative memory ramifications if too much data is buffered into memory, but this is a simplification. One thing to note is that UploadFromStream(...) uses Commit() in some cases, but checks additional conditions to determine the best method to use.

This exception can happen also when multiple threads open stream into a blob with the same file name and try to write into this blob simultaneously.

NOTE: this solution is based on Azure JDK code, but I think we can safely assume that pure REST version will have the very same effect (as any other language actually).
Since I have spent entire work day fighting this issue, even if this is actually a corner case, I'll leave a note here, maybe it will be of help to someone.
I did everything right. I had block IDs in the right order, I had block IDs of the same length, I had a clean container with no leftovers of some previous blocks (these three reasons are the only ones I was able to find via Google).
There was one catch: I've been building my block list for commit via
CloudBlockBlob.commitBlockList(Iterable<BlockEntry> blockList)
with use of this constructor:
BlockEntry(String id, BlockSearchMode searchMode)
passing
BlockSearchMode.COMMITTED
in the second argument. And THAT proved to be the root cause. Once I changed it to
BlockSearchMode.UNCOMMITTED
and eventually landed on the one-parameter constructor
BlockEntry(String id)
which uses UNCOMMITED by default, commiting the block list worked and blob was successfuly persisted.

Caching requests to reduce processing (TPL?)

I'm currently trying to reduce the number of similar requests being processed in a business layer by:
Caching the requests a method receives
Performing the slow processing task (once for all similar requests)
Return the result to each requesting method calls
Things to note, are that:
The original method calls are not currently in a async BeginMethod() / EndMethod(IAsyncResult)
The requests arrive faster than the time it takes to generate the output
I'm trying to use TPL where possible, as I am currently trying to learn more about this library
eg. Improving the following
byte[] RequestSlowOperation(string operationParameter)
{
Perform slow task here...
}
Any thoughts?
Follow up:
class SomeClass
{
private int _threadCount;
public SomeClass(int threadCount)
{
_threadCount = threadCount;
int parameter = 0;
var taskFactory = Task<int>.Factory;
for (int i = 0; i < threadCount; i++)
{
int i1 = i;
taskFactory
.StartNew(() => RequestSlowOperation(parameter))
.ContinueWith(result => Console.WriteLine("Result {0} : {1}", result.Result, i1));
}
}
private int RequestSlowOperation(int parameter)
{
Lazy<int> result2;
var result = _cacheMap.GetOrAdd(parameter, new Lazy<int>(() => RequestSlowOperation2(parameter))).Value;
//_cacheMap.TryRemove(parameter, out result2); <<<<< Thought I could remove immediately, but this causes blobby behaviour
return result;
}
static ConcurrentDictionary<int, Lazy<int>> _cacheMap = new ConcurrentDictionary<int, Lazy<int>>();
private int RequestSlowOperation2(int parameter)
{
Console.WriteLine("Evaluating");
Thread.Sleep(100);
return parameter;
}
}

Here is a fast, safe and maintainable way to do this:
static var cacheMap = new ConcurrentDictionary<string, Lazy<byte[]>>();
byte[] RequestSlowOperation(string operationParameter)
{
return cacheMap.GetOrAdd(operationParameter, () => new Lazy<byte[]>(() => RequestSlowOperation2(operationParameter))).Value;
}
byte[] RequestSlowOperation2(string operationParameter)
{
Perform slow task here...
}
This will execute RequestSlowOperation2 at most once per key. Please be aware that the memory held by the dictionary will never be released.
The user delegate passed to the ConcurrentDictionary is not executed under lock, meaning that it could execute multiple times! My solution allows multiple lazies to be created but only one of them will ever be published and materialized.
Regarding locking: this solution will take locks, but it does not matter because the work items are far more expensive than the (few) lock operations.

Honestly, the use of TPL as a technology here is not really important, this is just a straight up concurrency problem. You're trying to protect access to a shared resource (the cached data) and, to do that, the only approach is to lock. Either that or, if the cache entry does not already exist, you could allow all incoming threads to generate it and then subsequent requesters benefit from the cached value once it's stored, but there's little value in that if the resource is slow/expensive to generate and cache.
Perhaps some more details will make it clear on exactly why you're trying to accomplish this without a lock. I'll happily to revise my answer if more detail makes it clearer what you're trying to do.

Download an undefined number of files with HttpWebRequest.BeginGetResponse

I have to write a small app which downloads a few thousand files. Some of these files contain reference to other files that must be downloaded as part of the same process. The following code downloads the initial list of files, but I would like to download the others files as part of the same loop. What is happening here is that the loop completes before the first request come back. Any idea how to achieve this?
var countdownLatch = new CountdownEvent(Urls.Count);
string url;
while (Urls.TryDequeue(out url))
{
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(url);
webRequest.BeginGetResponse(
new AsyncCallback(ar =>
{
using (HttpWebResponse response = (ar.AsyncState as HttpWebRequest).EndGetResponse(ar) as HttpWebResponse)
{
using (var sr = new StreamReader(response.GetResponseStream()))
{
string myFile = sr.ReadToEnd();
// TODO: Look for a reference to another file. If found, queue a new Url.
}
}
}), webRequest);
}
ce.Wait();

One solution which comes to mind is to keep track of the number of pending requests and only finish the loop once no requests are pending and the Url queue is empty:
string url;
int requestCounter = 0;
int temp;
AutoResetEvent requestFinished = new AutoResetEvent(false);
while (Interlocked.Exchange(requestCounter, temp) > 0 || Urls.TryDequeue(out url))
{
if (url != null)
{
Interlocked.Increment(requestCounter);
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(url);
webRequest.BeginGetResponse(
new AsyncCallback(ar =>
{
try {
using (HttpWebResponse response = (ar.AsyncState as HttpWebRequest).EndGetResponse(ar) as HttpWebResponse)
{
using (var sr = new StreamReader(response.GetResponseStream()))
{
string myFile = sr.ReadToEnd();
// TODO: Look for a reference to another file. If found, queue a new Url.
}
}
}
finally {
Interlocked.Decrement(requestCounter);
requestFinished.Set();
}
}), webRequest);
}
else
{
// no url but requests are still pending
requestFinished.WaitOne();
}
}

You are tryihg to write a webcrawler. In order to write a good webcrawler, you first need to define some parameters...
1) How many request do you want to download simultaneously? In other words, how much throughput do you want? This will determine things like how many requests you want outstanding, what the threadpool size should be etc.
2) You will have to have a queue of URLs. This queue is populated by each request that completes. You now need to decide what the growth strategy of the queue is. For eg, you cannot have an unbounded queue, as you can pump workitems into the queue faster than you can download from the network.
Given this, you can design a system as follows:
Have max N worker threads that actually download from the web. They take one time from the queue, and download the data. They parse the data and populate your URL queue.
If there are more than 'M' URLs in the queue, then the queue blocks and does not allow anymore URLs to be queued. Now, here you can do one of two things. You can either cause the thread that is enqueuing to block, or you can just discard the workitem being enqueued. Once another workitem completes on another thread, and a URL is dequeued, the blocked thread will now be able to enqueue succesfully.
With a system like this, you can ensure that you will not run out of system resources while downloading the data.
Implementation:
Note that if you are using async, then you are using an extra I/O thread to do the download. THis is fine, as long as you are mindful of this fact. You can do a pure async implementation, where you can have 'N' BeginGetResponse() outstanding, and for each one that completes, you start another one. THis way you will always have 'N' requests outstanding.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string