I have about ~1,000,000 json files, that I would like to update every 30 minutes. The update is simply appending a new array to the end of the existing content.
A single update uses code similar to:
CloudBlockBlob blockBlob = container.GetBlockBlobReference(blobName);
JObject jObject = null;
// If the blob exists, then we may need to update it.
if(blockBlob.Exists())
{
MemoryStream memoryStream = new MemoryStream();
blockBlob.DownloadToStream(memoryStream);
jObject = JsonConvert.DeserializeObject(Encoding.UTF8.GetString(memoryStream.ToArray())) as JObject;
} // End of the blob exists
if(null == jObject)
{
jObject = new JObject();
jObject.Add(new JProperty("identifier", identifier));
} // End of the blob did not exist
JArray jsonArray = new JArray();
jObject.Add(new JProperty(string.Format("entries{0}", timestamp.ToString()),jsonArray));
foreach(var entry in newEntries)
{
jsonArray.Add(new JObject(
new JProperty("someId", entry.id),
new JProperty("someValue", value)
)
);
} // End of loop
string jsonString = JsonConvert.SerializeObject(jObject);
// Upload
blockBlob.Properties.ContentType = "text/json";
blockBlob.UploadFromStream(new MemoryStream(Encoding.UTF8.GetBytes(jsonString)));
Basically:
Check if the blob exists,
If it does, download the data and create a json object from the existing details.
If it did not, then create a new object with the details.
Push the update to the blob.
The problem with this is performance. I've done quite a few things I can do to increase performance (the updates run in five parallel threads and I have set ServicePointManager.UseNagleAlgorithm to false.
It still runs slow though. Roughly ~100,000 updates can take up to an hour.
So I guess basically, my questions would be:
Should I be using Azure Blob storage for this? (I'm open to alternative suggestions).
If so, any suggestions on improving performance?
Note: The file basically contains a history of events and I cannot re-generate the entire file based on existing data. This is why the contents are downloaded before being updated.
Related
I'm trying out Azure Blob Change Feed feature and it behaves strange to me with Append Blobs: append events are missing in the feed.
My scenario is:
Create storage account, enable change feed feature:
Change feed enabled
Create Append Blob if not exists (1) and appending some input into it (2).
private void WriteBlob(string input)
{
MemoryStream stream = new MemoryStream(Encoding.UTF8.GetBytes(input));
try
{
if (client == null)
{
var credential = new ClientSecretCredential("...", "...");
client = new AppendBlobClient(new Uri("..."), credential);
}
client.CreateIfNotExists(); // (1)
client.AppendBlock(stream); // (2)
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
}
Fetch Change Feed entries in separate console app.
public static List<BlobChangeFeedEvent> GetChanges()
{
var credential = new ClientSecretCredential("...", "...");
BlobChangeFeedClient blobChangeFeedClient = new BlobChangeFeedClient(new Uri("..."), credential);
List<BlobChangeFeedEvent> events = new List<BlobChangeFeedEvent>();
foreach (BlobChangeFeedEvent changeFeedEvent in blobChangeFeedClient.GetChanges())
{
events.Add(changeFeedEvent);
}
return events;
}
The problem is that after a few runs of WriteBlob method I only get single change feed event that corresponds to the blob creation, and subsequent appends are missing in the feed, however inputs are being appended successfully to the blob resource.
The question is why it is working this way? I didn't find anything special about Append Blob blob type regarding Change feed in docs.
Currently, the append event for an append blob is not supported.
As per this doc, only the following event types are supported:
BlobCreated
BlobDeleted
BlobPropertiesUpdated
BlobSnapshotCreated
And in the source code of Azure.Storage.Blobs.ChangeFeed package, there is no append event type.
A feature request of this is submitted, hope it can be added in the future release.
The following trigger removes exif data from blobs (which are images) after they are uploaded to azure storage. The problem is that the blob trigger fires at least 5 times for each blob.
In the trigger the blob is updated by writing a new stream of data to it. I had assumed that blob receipts would prevent further firing of the blob trigger against this blob.
[FunctionName("ExifDataPurge")]
public async System.Threading.Tasks.Task RunAsync(
[BlobTrigger("container/{name}.{extension}", Connection = "")]CloudBlockBlob image,
string name,
string extension,
string blobTrigger,
ILogger log)
{
log.LogInformation($"C# Blob trigger function Processed blob\n Name:{name}");
try
{
var memoryStream = new MemoryStream();
await image.DownloadToStreamAsync(memoryStream);
memoryStream.Position = 0;
using (Image largeImage = Image.Load(memoryStream))
{
if (largeImage.Metadata.ExifProfile != null)
{
//strip the exif data from the image.
for (int i = 0; i < largeImage.Metadata.ExifProfile.Values.Count; i++)
{
largeImage.Metadata.ExifProfile.RemoveValue(largeImage.Metadata.ExifProfile.Values[i].Tag);
}
var exifStrippedImage = new MemoryStream();
largeImage.Save(exifStrippedImage, new SixLabors.ImageSharp.Formats.Jpeg.JpegEncoder());
exifStrippedImage.Position = 0;
await image.UploadFromStreamAsync(exifStrippedImage);
}
}
}
catch (UnknownImageFormatException unknownImageFormatException)
{
log.LogInformation($"Blob is not a valid Image : {name}.{extension}");
}
}
Triggers are handled in such a way that they track which blobs have been processed by storing receipts in container azure-webjobs-hosts. Any blob not having a receipt, or an old receipt (based on blob ETag) will be processed (or reprocessed).
since you are calling await image.UploadFromStreamAsync(exifStrippedImage); it gets triggered (assuming its not been processed)
When you call await image.UploadFromStreamAsync(exifStrippedImage);, it will update blob so the blob function will trigger again.
You can try to check the existing CacheControl property on the blob to not update it if it has been updated to break the loop.
// Set the CacheControl property to expire in 1 hour (3600 seconds)
blob.Properties.CacheControl = "max-age=3600";
So I've addressed this by storing a Status in metadata against the blob as it's processed.
https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blob-container-properties-metadata
The trigger then contains a guard to check for the metadata.
if (image.Metadata.ContainsKey("Status") && image.Metadata["Status"] == "Processed")
{
//an subsequent processing for the blob will enter this block.
log.LogInformation($"blob: {name} has already been processed");
}
else
{
//first time triggered for this blob
image.Metadata.Add("Status", "Processed");
await image.SetMetadataAsync();
}
The other answers pointed me in the right direction. I think it is more correct to use the metadata. Storing an ETag elsewhere seems redundant when we can store metadata. The use of "CacheControl" seems like too much of a hack, other developers might be confused as to what and why I have done it.
I'm currently using v2 of Azure Function Apps. I've set the environment to be 64 bit and am compiling to .Net Standard 2.0. Host Json specifies version 2.
I'm reading in a .csv and it works fine for smaller files. But when I read in a 180MB .csv into a List of string[] it's ballooning to over a GB on read and when I try to parse it, it's up over 2 GB but then throws the 'Out of Memory' Exception. Even running on an app service plan with more than 3.5 GB hasn't solved the issue.
Edit:
I'm using this:
Uri blobUri = AppendSasOnUri(blobName); _webClient = new WebClient();
Stream sourceStream = _webClient.OpenRead(blobUri);
_reader = new StreamReader(sourceStream);
However, since It's a csv, I'm splitting out entire columns of data. It's pretty hard to get away from this:
internal async Task<List<string[]>> ReadCsvAsync() {
while (!_reader.EndOfStream) {
string[] currentCsvRow = await ReadCsvRowAsync();
_fullBlobCsv.Add(currentCsvRow);
}
return _fullBlobCsv; }
Goal is to store json into blob when alls said and done.
Try using stream (StreamReader) to read the input .csv file and process one line at a time.
I'm able to parse 300mb files on consumption plan with streams. My use-case may not be same but similar. Parse a large concatenated pdf file and separate it to 5000+ smaller files and store the separated files into blob container. Below is my code for reference.
For your use case you may want to use CloudAppendBlob instead of CloudBlockBlob if you're pushing all parsed data into single blob.
public async static void ExtractSmallerFiles(CloudBlockBlob myBlob, string fileDate, ILogger log)
{
using (var reader = new StreamReader(await myBlob.OpenReadAsync()))
{
CloudBlockBlob blockBlob = null;
var fileContents = new StringBuilder(string.Empty);
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
if (line.StartsWith("%%MS_SKEY_0000_000_PDF:"))
{
var matches = Regex.Match(line, #"%%MS_SKEY_0000_000_PDF: A(\d+)_SMFL_B1234_D(\d{8})_A\d+_M(\d{15}) _N\d+");
var smallFileDate = matches.Groups[2];
var accountNumber = matches.Groups[3];
var fileName = $"SmallerFiles/{smallFileDate}/{accountNumber}.pdf";
blockBlob = myBlob.Container.GetBlockBlobReference(fileName);
}
fileContents.AppendLine(line);
if (line.Equals("%%EOF"))
{
log.LogInformation($"Uploading {fileContents.Length} bytes to {blockBlob.Name}");
await blockBlob.UploadTextAsync(fileContents.ToString());
fileContents = new StringBuilder(string.Empty);
}
}
await myBlob.DeleteAsync();
log.LogInformation("Extracted Smaller files");
}
}
I'm trying to convert current application that uses NPOI for creating xls document on the server to Azure hosted application. I have little experience with NPOI and Azure so 2 strikes right there. I have the app uploading the xls to Blob container however it is always blank (9 bytes). From what I understand NPOI uses filestream to write to the file so I just changed that to write to the blob container.
Here is what i think are the relevant portions:
internal void GenerateExcel(DataSet ds, int QuoteID, string ReportFileName)
{
string ExcelFileName = string.Format("{0}_{1}.xls",ReportFileName,QuoteID);
try
{
//these 2 strings will get deleted but left here for now to run side by side at the moment
string ReportDirectoryPath = HttpContext.Current.Server.MapPath(".") + "\\Reports";
if (!Directory.Exists(ReportDirectoryPath))
{
Directory.CreateDirectory(ReportDirectoryPath);
}
string ExcelReportFullPath = ReportDirectoryPath + "\\" + ExcelFileName;
if (File.Exists(ExcelReportFullPath))
{
File.Delete(ExcelReportFullPath);
}
// Create a new workbook.
var workbook = new HSSFWorkbook();
//Rest of the NPOI XLS rows cells etc. etc. all works fine when writing to disk////////////////
// Retrieve storage account from connection string.
CloudStorageAccount storageAccount = CloudStorageAccount.Parse(CloudConfigurationManager.GetSetting("StorageConnectionString"));
// Create the blob client.
CloudBlobClient blobClient = storageAccount.CreateCloudBlobClient();
// Retrieve a reference to a container.
CloudBlobContainer container = blobClient.GetContainerReference("pricingappreports");
// Create the container if it doesn't already exist.
if (container.CreateIfNotExists())
{
container.SetPermissions(new BlobContainerPermissions { PublicAccess = BlobContainerPublicAccessType.Blob });
}
// Retrieve reference to a blob with the same name.
CloudBlockBlob blockBlob = container.GetBlockBlobReference(ExcelFileName);
// Write the output to a file on the server
String file = ExcelReportFullPath;
using (FileStream fs = new FileStream(file, FileMode.Create))
{
workbook.Write(fs);
fs.Close();
}
// Write the output to a file on Azure Storage
String Blobfile = ExcelFileName;
using (FileStream fs = new FileStream(Blobfile, FileMode.Create))
{
workbook.Write(fs);
blockBlob.UploadFromStream(fs);
fs.Close();
}
}
I'm uploading to the Blob and the file exists, why doesn't the data get written to the xls?
Any help would be appreciated.
Update: I think I found the problem. Doesn't look like you can write to a file in Blob Storage. Found this Blog which pretty much answers my questions: it doesn't use NPOI but the concept is the same. http://debugmode.net/2011/08/28/creating-and-updating-excel-file-in-windows-azure-web-role-using-open-xml-sdk/
Thanks
Can you install fiddler and check the request and the response packets? You may also need to seek back to 0 between two writes . So the correct code here could be to add the below before trying to write the stream to blob.
workbook.Write(fs);
fs.Seek(0, SeekOrigin.Begin);
blockBlob.UploadFromStream(fs);
fs.Close();
I also noticed that you are using String Blobfile = ExcelFileName instead of String Blobfile = ExcelReportFullPath.
I'm converting a website from a standard ASP.NET website over to use Azure. The website had previously taken an Excel file uploaded by an administrative user and saved it on the file system. As part of the migration, I'm saving this file to Azure Storage. It works fine when running against my local storage through the Azure SDK. (I'm using version 1.3 since I didn't want to upgrade during the development process.)
When I point the code to run against Azure Storage itself, though, the process usually fails. The error I get is:
System.IO.IOException occurred
Message=Unable to read data from the transport connection: The connection was closed.
Source=Microsoft.WindowsAzure.StorageClient
StackTrace:
at Microsoft.WindowsAzure.StorageClient.Tasks.Task`1.get_Result()
at Microsoft.WindowsAzure.StorageClient.Tasks.Task`1.ExecuteAndWait()
at Microsoft.WindowsAzure.StorageClient.CloudBlob.UploadFromStream(Stream source, BlobRequestOptions options)
at Framework.Common.AzureBlobInteraction.UploadToBlob(Stream stream, String BlobContainerName, String fileName, String contentType) in C:\Development\RateSolution2010\Framework.Common\AzureBlobInteraction.cs:line 95
InnerException:
The code is as follows:
public void UploadToBlob(Stream stream, string BlobContainerName, string fileName,
string contentType)
{
// Setup the connection to Windows Azure Storage
CloudStorageAccount storageAccount = CloudStorageAccount.Parse(GetConnStr());
DiagnosticMonitorConfiguration dmc = DiagnosticMonitor.GetDefaultInitialConfiguration();
dmc.Logs.ScheduledTransferPeriod = TimeSpan.FromMinutes(1);
dmc.Logs.ScheduledTransferLogLevelFilter = LogLevel.Verbose;
DiagnosticMonitor.Start(storageAccount, dmc);
CloudBlobClient BlobClient = null;
CloudBlobContainer BlobContainer = null;
BlobClient = storageAccount.CreateCloudBlobClient();
// For large file copies you need to set up a custom timeout period
// and using parallel settings appears to spread the copy across multiple threads
// if you have big bandwidth you can increase the thread number below
// because Azure accepts blobs broken into blocks in any order of arrival.
BlobClient.Timeout = new System.TimeSpan(1, 0, 0);
Role serviceRole = RoleEnvironment.Roles.Where(s => s.Value.Name == "OnlineRates.Web").First().Value;
BlobClient.ParallelOperationThreadCount = serviceRole.Instances.Count;
// Get and create the container
BlobContainer = BlobClient.GetContainerReference(BlobContainerName);
BlobContainer.CreateIfNotExist();
//delete prior version if one exists
BlobRequestOptions options = new BlobRequestOptions();
options.DeleteSnapshotsOption = DeleteSnapshotsOption.None;
CloudBlob blobToDelete = BlobContainer.GetBlobReference(fileName);
Trace.WriteLine("Blob " + fileName + " deleted to be replaced by newer version.");
blobToDelete.DeleteIfExists(options);
//set stream to starting position
stream.Position = 0;
long totalBytes = 0;
//Open the stream and read it back.
using (stream)
{
// Create the Blob and upload the file
CloudBlockBlob blob = BlobContainer.GetBlockBlobReference(fileName);
try
{
BlobClient.ResponseReceived += new EventHandler<ResponseReceivedEventArgs>((obj, responseReceivedEventArgs)
=>
{
if (responseReceivedEventArgs.RequestUri.ToString().Contains("comp=block&blockid"))
{
totalBytes += Int64.Parse(responseReceivedEventArgs.RequestHeaders["Content-Length"]);
}
});
blob.UploadFromStream(stream);
// Set the metadata into the blob
blob.Metadata["FileName"] = fileName;
blob.SetMetadata();
// Set the properties
blob.Properties.ContentType = contentType;
blob.SetProperties();
}
catch (Exception exc)
{
Logging.ExceptionLogger.LogEx(exc);
}
}
}
I've tried a number of different alterations to the code: deleting a blob before replacing it (although the problem exists on new blobs as well), setting container permissions, not setting permissions, etc.
Your code looks like it should work, but it has lots of extra functionality that is not strictly required. I would cut it down to an absolute minimum and go from there. It's really only a gut feeling, but I think it might be the using statement giving you grief. This enture function could be written (presuming the container already exists) as:
public void UploadToBlob(Stream stream, string BlobContainerName, string fileName,
string contentType)
{
// Setup the connection to Windows Azure Storage
CloudStorageAccount storageAccount = CloudStorageAccount.Parse(GetConnStr());
CloudBlobClient BlobClient = storageAccount.CreateCloudBlobClient();
CloudBlobContainer BlobContainer = BlobClient.GetContainerReference(BlobContainerName);
CloudBlockBlob blob = BlobContainer.GetBlockBlobReference(fileName);
stream.Position = 0;
blob.UploadFromStream(stream);
}
Notes on the stuff that I've removed:
You should set up diagnostics just once when you're app starts, not every time a method is called. Usually in the RoleEntryPoint.OnStart()
I'm not sure why you're trying to set ParallelOperationThreadCount higher if you have more instances. Those two things seem unrelated.
It's not good form to check for the existence of a container/table every time you save something to it. It's more usual to do that check once when your app starts or to have a process external to the website to make sure all the required containers/tables/queues exist. Of course if you're trying to dynamically create containers this is not true.
The problem turned out to be firewall settings on my laptop. It's my personal laptop originally set up at home and so the firewall rules weren't set up for a corporate environment resulting in slow performance on uploads and downloads.