"The specified block list is invalid" while uploading blobs in parallel

"The specified block list is invalid" while uploading blobs in parallel - azure

I've a (fairly large) Azure application that uploads (fairly large) files in parallel to Azure blob storage.
In a few percent of uploads I get an exception:
The specified block list is invalid.
System.Net.WebException: The remote server returned an error: (400) Bad Request.
This is when we run a fairly innocuous looking bit of code to upload a blob in parallel to Azure storage:
public static void UploadBlobBlocksInParallel(this CloudBlockBlob blob, FileInfo file)
{
blob.DeleteIfExists();
blob.Properties.ContentType = file.GetContentType();
blob.Metadata["Extension"] = file.Extension;
byte[] data = File.ReadAllBytes(file.FullName);
int numberOfBlocks = (data.Length / BlockLength) + 1;
string[] blockIds = new string[numberOfBlocks];
Parallel.For(
0,
numberOfBlocks,
x =>
{
string blockId = Convert.ToBase64String(Guid.NewGuid().ToByteArray());
int currentLength = Math.Min(BlockLength, data.Length - (x * BlockLength));
using (var memStream = new MemoryStream(data, x * BlockLength, currentLength))
{
var blockData = memStream.ToArray();
var md5Check = System.Security.Cryptography.MD5.Create();
var md5Hash = md5Check.ComputeHash(blockData, 0, blockData.Length);
blob.PutBlock(blockId, memStream, Convert.ToBase64String(md5Hash));
}
blockIds[x] = blockId;
});
byte[] fileHash = _md5Check.ComputeHash(data, 0, data.Length);
blob.Metadata["Checksum"] = BitConverter.ToString(fileHash).Replace("-", string.Empty);
blob.Properties.ContentMD5 = Convert.ToBase64String(fileHash);
data = null;
blob.PutBlockList(blockIds);
blob.SetMetadata();
blob.SetProperties();
}
All very mysterious; I'd think the algorithm we're using to calculate the block list should produce strings that are all the same length...

We ran into a similar issue, however we were not specifying any block ID or even using the block ID anywhere. In our case, we were using:
using (CloudBlobStream stream = blob.OpenWrite(condition))
{
//// [write data to stream]
stream.Flush();
stream.Commit();
}
This would cause The specified block list is invalid. errors under parallelized load. Switching this code to use the UploadFromStream(…) method while buffering the data into memory fixed the issue:
using (MemoryStream stream = new MemoryStream())
{
//// [write data to stream]
stream.Seek(0, SeekOrigin.Begin);
blob.UploadFromStream(stream, condition);
}
Obviously this could have negative memory ramifications if too much data is buffered into memory, but this is a simplification. One thing to note is that UploadFromStream(...) uses Commit() in some cases, but checks additional conditions to determine the best method to use.

This exception can happen also when multiple threads open stream into a blob with the same file name and try to write into this blob simultaneously.

NOTE: this solution is based on Azure JDK code, but I think we can safely assume that pure REST version will have the very same effect (as any other language actually).
Since I have spent entire work day fighting this issue, even if this is actually a corner case, I'll leave a note here, maybe it will be of help to someone.
I did everything right. I had block IDs in the right order, I had block IDs of the same length, I had a clean container with no leftovers of some previous blocks (these three reasons are the only ones I was able to find via Google).
There was one catch: I've been building my block list for commit via
CloudBlockBlob.commitBlockList(Iterable<BlockEntry> blockList)
with use of this constructor:
BlockEntry(String id, BlockSearchMode searchMode)
passing
BlockSearchMode.COMMITTED
in the second argument. And THAT proved to be the root cause. Once I changed it to
BlockSearchMode.UNCOMMITTED
and eventually landed on the one-parameter constructor
BlockEntry(String id)
which uses UNCOMMITED by default, commiting the block list worked and blob was successfuly persisted.

Related

Append to Azure Append Blob Using AppendTextAsync Results in Missing Data

I'm attempting to create a logger for an application in Azure using the new Azure append blobs and the Azure Storage SDK 6.0.0. So I created a quick test application to get a better understanding of append blobs and their performance characteristics.
My test program simply loops 100 times and appends a line of text to the append blob. If I use the synchronous AppendText() method everything works fine, however, it appears to be limited to writing about 5-6 appends per second. So I attempted to use the asynchronous AppendTextAsync() method; however, when I use this method, the loop runs much faster (as expected) but the append blob is missing about 98% of the appended text without any exception being thrown.
If I add a Thread.Sleep and sleep for 100 milliseconds between each append operation, I end up with about 50% of the data. Sleep for 1 second and I get all of the data.
This seems similar to an issue that was discovered in v5.0.0 but was fixed in v5.0.2: https://github.com/Azure/azure-storage-net/releases/tag/v5.0.2
Here is my test code if you'd like to try to reproduce this issue:
static void Main(string[] args)
{
var accountName = "<account-name>";
var accountKey = "<account-key>;
var credentials = new StorageCredentials(accountName, accountKey);
var account = new CloudStorageAccount(credentials, true);
var client = account.CreateCloudBlobClient();
var container = client.GetContainerReference("<container-name>");
container.CreateIfNotExists();
var blob = container.GetAppendBlobReference("append-blob.txt");
blob.CreateOrReplace();
for (int i = 0; i < 100; i++)
blob.AppendTextAsync(string.Format("Appending log number {0} to an append blob.\r\n", i));
Console.WriteLine("Press any key to exit.");
Console.ReadKey();
}
Does anyone know if I'm doing something wrong with my attempt to append lines of text to an append blob? Otherwise, any idea why this would just lose data without throwing some kind of exception?
I'd really like to start using this as a repository for my application logs (since it was largely created for that purpose). However, it would be quite unreliable if logs would just go missing without warning if the rate of logging went above 5-6 logs per second.
Any thoughts or feedback would be greatly appreciated.

I now have a working solution based upon the information provided by #ZhaoxingLu-Microsoft. According to the the API documentation, the AppendTextAsync() method should only be used in a single-writer scenario because the API internally uses the append-offset conditional header to avoid duplicate blocks which does not work in a multiple-writer scenario.
Here is the documentation that specifies this behavior is by design:
https://msdn.microsoft.com/en-us/library/azure/mt423049.aspx
So the solution is to use the AppendBlockAsync() method instead. The following implementation appears to work correctly:
for (int i = 0; i < 100; i++)
{
var message = string.Format("Appending log number {0} to an append blob.\r\n", i);
var bytes = Encoding.UTF8.GetBytes(message);
var stream = new MemoryStream(bytes);
tasks[i] = blob.AppendBlockAsync(stream);
}
Task.WaitAll(tasks);
Please note that I am not explicitly disposing the memory stream in this example as that solution would entail a using block with an async/await inside the using block in order to wait for the async append operation to finish before disposing the memory stream... but that causes a completely unrelated issue.

You are using async method incorrectly. blob.AppendTextAsync() is non-blocking, but it doesn't really finish when it returns. You should wait for all the async tasks before exiting from the process.
Following code is the correct usage:
var tasks = new Task[100];
for (int i = 0; i < 100; i++)
tasks[i] = blob.AppendTextAsync(string.Format("Appending log number {0} to an append blob.\r\n", i));
Task.WaitAll(tasks);
Console.WriteLine("Press any key to exit.");
Console.ReadKey();

j2me - Out of memory exception, does it have anything to do with the maximum heap or jar size?

I'm currently developing an app for taking orders. Before I ask my question, let me give you a few details of the basic functionality of my app:
The first thing the app does once the user is logged in, is to read data from a webservice (products, prices and customers) so that the user can work offline.
Once the user have all the necessary data, they can starting taking orders.
Finally, at the end of the day, the user sends all the orders to a server for its processing.
Now that you know how my app works, here's the problem :
When I run my app in the emulator it works, but now that running tests on a physical device When I read data from the webservice, the following error appears on the screen :
Out of memory error java/lang/OutOfMemoryError
At first I thought that the data that is read from the WS (in json format) was too much for the StringBuffer :
hc = (HttpConnection) Connector.open(mUrl);
if (hc.getResponseCode() == HttpConnection.HTTP_OK) {
is = hc.openInputStream();
int ch;
while ((ch = is.read()) != -1) {
stringBuffer.append((char) ch);
}
}
But it turned out that the error occurs when I tried to convert the result from the WS (string in json Format) into a JSONArray .
I do this because I need to loop through all the objects (for example the Products) and then save them using RMS. Here's part of the code I use:
rs = RecordStore.openRecordStore(mRecordStoreName, true);
try {
// This is the line that throws the exception
JSONArray js = new JSONArray(data);
for (int i = 0; i < js.length(); i++) {
JSONObject jsObj = js.getJSONObject(i);
stringJSON = jsObj.toString();
id = saveRecord(stringJSON, rs);
}
The saveRecord Method
public int saveRecord(String stringJSON, RecordStore rs) throws JSONException {
int id = -1;
try {
byte[] raw = stringJSON.getBytes();
id= rs.addRecord(raw, 0, raw.length);
} catch (RecordStoreException ex) {
System.out.println(ex);
}
return id;
}
Searching a little , I found these functions : Runtime.getRuntime().freeMemory() and Runtime.getRuntime().totalMemory()
With those functions I found out that the total memory is : 2097152 and the free memory before the error occurs is : 69584 bytes.
Now the big question(or rather questions) is :
Where this little amount of memory is taken from ?? The heap size??
The device's specifications says that it has 4MB
Another thing that worries me is if RMS increases the JAR size
because the specifications also say that the maximum jar size is 2
MB.
As always, I really appreciate all your comments and suggestions.
Thanks in advance.

i don't know what is the exact reason but these J2ME devices indeed have a memory problem.
my app is working with contacts, and when i tried to receive the JSON of contacts from the server, if it was too long the conversion indeed caused an out of memory error.
the solution that i found is paging. divide the data that you receive from the server into parts and read it part by part.

Caching requests to reduce processing (TPL?)

I'm currently trying to reduce the number of similar requests being processed in a business layer by:
Caching the requests a method receives
Performing the slow processing task (once for all similar requests)
Return the result to each requesting method calls
Things to note, are that:
The original method calls are not currently in a async BeginMethod() / EndMethod(IAsyncResult)
The requests arrive faster than the time it takes to generate the output
I'm trying to use TPL where possible, as I am currently trying to learn more about this library
eg. Improving the following
byte[] RequestSlowOperation(string operationParameter)
{
Perform slow task here...
}
Any thoughts?
Follow up:
class SomeClass
{
private int _threadCount;
public SomeClass(int threadCount)
{
_threadCount = threadCount;
int parameter = 0;
var taskFactory = Task<int>.Factory;
for (int i = 0; i < threadCount; i++)
{
int i1 = i;
taskFactory
.StartNew(() => RequestSlowOperation(parameter))
.ContinueWith(result => Console.WriteLine("Result {0} : {1}", result.Result, i1));
}
}
private int RequestSlowOperation(int parameter)
{
Lazy<int> result2;
var result = _cacheMap.GetOrAdd(parameter, new Lazy<int>(() => RequestSlowOperation2(parameter))).Value;
//_cacheMap.TryRemove(parameter, out result2); <<<<< Thought I could remove immediately, but this causes blobby behaviour
return result;
}
static ConcurrentDictionary<int, Lazy<int>> _cacheMap = new ConcurrentDictionary<int, Lazy<int>>();
private int RequestSlowOperation2(int parameter)
{
Console.WriteLine("Evaluating");
Thread.Sleep(100);
return parameter;
}
}

Here is a fast, safe and maintainable way to do this:
static var cacheMap = new ConcurrentDictionary<string, Lazy<byte[]>>();
byte[] RequestSlowOperation(string operationParameter)
{
return cacheMap.GetOrAdd(operationParameter, () => new Lazy<byte[]>(() => RequestSlowOperation2(operationParameter))).Value;
}
byte[] RequestSlowOperation2(string operationParameter)
{
Perform slow task here...
}
This will execute RequestSlowOperation2 at most once per key. Please be aware that the memory held by the dictionary will never be released.
The user delegate passed to the ConcurrentDictionary is not executed under lock, meaning that it could execute multiple times! My solution allows multiple lazies to be created but only one of them will ever be published and materialized.
Regarding locking: this solution will take locks, but it does not matter because the work items are far more expensive than the (few) lock operations.

Honestly, the use of TPL as a technology here is not really important, this is just a straight up concurrency problem. You're trying to protect access to a shared resource (the cached data) and, to do that, the only approach is to lock. Either that or, if the cache entry does not already exist, you could allow all incoming threads to generate it and then subsequent requesters benefit from the cached value once it's stored, but there's little value in that if the resource is slow/expensive to generate and cache.
Perhaps some more details will make it clear on exactly why you're trying to accomplish this without a lock. I'll happily to revise my answer if more detail makes it clearer what you're trying to do.

Download an undefined number of files with HttpWebRequest.BeginGetResponse

I have to write a small app which downloads a few thousand files. Some of these files contain reference to other files that must be downloaded as part of the same process. The following code downloads the initial list of files, but I would like to download the others files as part of the same loop. What is happening here is that the loop completes before the first request come back. Any idea how to achieve this?
var countdownLatch = new CountdownEvent(Urls.Count);
string url;
while (Urls.TryDequeue(out url))
{
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(url);
webRequest.BeginGetResponse(
new AsyncCallback(ar =>
{
using (HttpWebResponse response = (ar.AsyncState as HttpWebRequest).EndGetResponse(ar) as HttpWebResponse)
{
using (var sr = new StreamReader(response.GetResponseStream()))
{
string myFile = sr.ReadToEnd();
// TODO: Look for a reference to another file. If found, queue a new Url.
}
}
}), webRequest);
}
ce.Wait();

One solution which comes to mind is to keep track of the number of pending requests and only finish the loop once no requests are pending and the Url queue is empty:
string url;
int requestCounter = 0;
int temp;
AutoResetEvent requestFinished = new AutoResetEvent(false);
while (Interlocked.Exchange(requestCounter, temp) > 0 || Urls.TryDequeue(out url))
{
if (url != null)
{
Interlocked.Increment(requestCounter);
HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(url);
webRequest.BeginGetResponse(
new AsyncCallback(ar =>
{
try {
using (HttpWebResponse response = (ar.AsyncState as HttpWebRequest).EndGetResponse(ar) as HttpWebResponse)
{
using (var sr = new StreamReader(response.GetResponseStream()))
{
string myFile = sr.ReadToEnd();
// TODO: Look for a reference to another file. If found, queue a new Url.
}
}
}
finally {
Interlocked.Decrement(requestCounter);
requestFinished.Set();
}
}), webRequest);
}
else
{
// no url but requests are still pending
requestFinished.WaitOne();
}
}

You are tryihg to write a webcrawler. In order to write a good webcrawler, you first need to define some parameters...
1) How many request do you want to download simultaneously? In other words, how much throughput do you want? This will determine things like how many requests you want outstanding, what the threadpool size should be etc.
2) You will have to have a queue of URLs. This queue is populated by each request that completes. You now need to decide what the growth strategy of the queue is. For eg, you cannot have an unbounded queue, as you can pump workitems into the queue faster than you can download from the network.
Given this, you can design a system as follows:
Have max N worker threads that actually download from the web. They take one time from the queue, and download the data. They parse the data and populate your URL queue.
If there are more than 'M' URLs in the queue, then the queue blocks and does not allow anymore URLs to be queued. Now, here you can do one of two things. You can either cause the thread that is enqueuing to block, or you can just discard the workitem being enqueued. Once another workitem completes on another thread, and a URL is dequeued, the blocked thread will now be able to enqueue succesfully.
With a system like this, you can ensure that you will not run out of system resources while downloading the data.
Implementation:
Note that if you are using async, then you are using an extra I/O thread to do the download. THis is fine, as long as you are mindful of this fact. You can do a pure async implementation, where you can have 'N' BeginGetResponse() outstanding, and for each one that completes, you start another one. THis way you will always have 'N' requests outstanding.

How do I tell my C# application to close a file it has open in a FileInfo object or possibly Bitmap object?

So I was writing a quick application to sort my wallpapers neatly into folders according to aspect ratio. Everything is going smoothly until I try to actually move the files (using FileInfo.MoveTo()). The application throws an exception:
System.IO.IOException
The process cannot access the file because it is being used by another process.
The only problem is, there is no other process running on my computer that has that particular file open. I thought perhaps that because of the way I was using the file, perhaps some internal system subroutine on a different thread or something has the file open when I try to move it. Sure enough, a few lines above that, I set a property that calls an event that opens the file for reading. I'm assuming at least some of that happens asynchronously. Is there anyway to make it run synchronously? I must change that property or rewrite much of the code.
Here are some relevant bits of code, please forgive the crappy Visual C# default names for things, this isn't really a release quality piece of software yet:
private void button1_Click(object sender, EventArgs e)
{
for (uint i = 0; i < filebox.Items.Count; i++)
{
if (!filebox.GetItemChecked((int)i)) continue;
//This calls the selectedIndexChanged event to change the 'selectedImg' variable
filebox.SelectedIndex = (int)i;
if (selectedImg == null) continue;
Size imgAspect = getImgAspect(selectedImg);
//This is gonna be hella hardcoded for now
//In the future this should be changed to be generic
//and use some kind of setting schema to determine
//the sort/filter results
FileInfo file = ((FileInfo)filebox.SelectedItem);
if (imgAspect.Width == 8 && imgAspect.Height == 5)
{
finalOut = outPath + "\\8x5\\" + file.Name;
}
else if (imgAspect.Width == 5 && imgAspect.Height == 4)
{
finalOut = outPath + "\\5x4\\" + file.Name;
}
else
{
finalOut = outPath + "\\Other\\" + file.Name;
}
//Me trying to tell C# to close the file
selectedImg.Dispose();
previewer.Image = null;
//This is where the exception is thrown
file.MoveTo(finalOut);
}
}
//The suspected event handler
private void filebox_SelectedIndexChanged(object sender, EventArgs e)
{
FileInfo selected;
if (filebox.SelectedIndex >= filebox.Items.Count || filebox.SelectedIndex < 0) return;
selected = (FileInfo)filebox.Items[filebox.SelectedIndex];
try
{
//The suspected line of code
selectedImg = new Bitmap((Stream)selected.OpenRead());
}
catch (Exception) { selectedImg = null; }
if (selectedImg != null)
previewer.Image = ResizeImage(selectedImg, previewer.Size);
else
previewer.Image = null;
}
I have a long-fix in mind (that's probably more efficient anyway) but it presents more problems still :/
Any help would be greatly appreciated.

Since you are using your selectedImg as a Class scoped variable it is keeping a lock on the File while the Bitmap is open. I would use an using statement and then Clone the Bitmap into the variable you are using this will release the lock that Bitmap is keeping on the file.
Something like this.
using ( Bitmap img = new Bitmap((Stream)selected.OpenRead()))
{
selectedImg = (Bitmap)img.Clone();
}

New answer:
I looked at the line where you do an OpenRead(). Clearly, this locks your file. It would be better to provide the file path instead of an stream, because you can't dispose your stream since bitmap would become erroneous.
Another thing I'm looking in your code which could be a bad practice is binding to FileInfo. Better create a data-transfer object/value object and bind to a collection of this type - some object which has the properties you need to show in your control -. That would help in order to avoid file locks.
In the other hand, you can do some trick: why don't you show streched to screen resolution images compressing them so image size would be extremly lower than actual ones and you provide a button called "Show in HQ"? That should solve the problem of preloading HD images. When the user clicks "Show in HQ" button, loads that image in memory, and when this is closed, it gets disposed.
It's ok for you?
If I'm not wrong, FileInfo doesn't block any file. You're not opening it but reading its metadata.
In the other hand, if you application shows images, you should move to memory visible ones and load them to your form from a memory stream.
That's reasonable because you can open a file stream, read its bytes and move them to a memory stream, leaving the lock against that file.
NOTE: This solution is fine for not so large images... Let me know if you're working with HD images.

using(selectedImg = new Bitmap((Stream)selected))
Will that do it?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string