I have 30000 images in blob storage and I want to fetch the images in descending order of modified date. Is there any way to fetch it in chunks of 1000 images per call?
Here is my code but this take too much time. Basically can i sort ListBlobs() by LastUpdated date?
CloudBlobContainer rootContainer = blobClient.GetContainerReference("installations");
CloudBlobDirectory dir1;
var items = rootContainer.ListBlobs(id + "/Cameras/" + camId.ToString() + "/", false);
foreach (var blob in items.OfType<CloudBlob>()
.OrderByDescending(b => b.Properties.LastModified).Skip(1000).Take(500))
{
}
Basically can i sort ListBlobs() by LastUpdated date?
No, you can't do server-side sorting on LastUpdated. Blob Storage service returns the data sorted by blob's name. You would need to fetch the complete data on the client and sort it there.
Other alternative would be to store the blob's information (like blob's URL, last modified date etc.) in a SQL Database and fetch the list from there. There you will have the ability to sort the data any way you like it.
I have sorted the blobs in last modified order as in the below example and it is the only solution I could think of :)
/**
* list the blob items in the blob container, ordered by the last modified date
* #return
*/
public List<FileProperties> listFiles() {
Iterable<ListBlobItem> listBlobItems = rootContainer.listBlobs();
List<FileProperties> list = new ArrayList<>();
for (ListBlobItem listBlobItem : listBlobItems) {
if (listBlobItem instanceof CloudBlob) {
String substring = ((CloudBlob) listBlobItem).getName();
FileProperties info = new FileProperties(substring, ((CloudBlob) listBlobItem).getProperties().getLastModified());
list.add(info);
}
}
// to sort the listed blob items in last modified order
list.sort(new Comparator<FileProperties>() {
#Override
public int compare(FileProperties o1, FileProperties o2) {
return new Long(o2.getLastModifiedDate().getTime()).compareTo(o1.getLastModifiedDate().getTime());
}
});
return list;
}
Related
I'm currently consuming the changefeed on Azure storage account and would like to distinguish between blobs created (uploaded) and those that are just modified.
In the example below I upload a blob (agent-diag.txt) and then edit the file (add some text).
In both cases it raises 'BlobCreated', there seems no concept of 'BlobUpdated'.
From MS Doc: The following event types are captured in the change feed records:
BlobCreated
BlobDeleted
BlobPropertiesUpdated
BlobSnapshotCreated
BlobPropertiesUpdated is recorded if the meta data or tags etc are changed. But if the file is modified I can't see any way to identify this. Any ideas?
Operation Name: PutBlob
Api: Azure.Storage.Blobs.ChangeFeed.BlobChangeFeedEventData
Subject: /blobServices/default/containers/myblobs/blobs/agent-diag.txt
Event Type: BlobCreated
Event Time: 17/11/2021 23:25:42 +00:00
Operation Name: PutBlob
Api: Azure.Storage.Blobs.ChangeFeed.BlobChangeFeedEventData
Subject: /blobServices/default/containers/myblobs/blobs/agent-diag.txt
Event Type: BlobCreated
Event Time: 17/11/2021 23:26:07 +00:00
using Azure.Storage.Blobs;
using Azure.Storage.Blobs.ChangeFeed;
namespace Changefeed
{
class Program
{
const string conString = "DefaultEndpointsProtocol=BlahBlah";
public static async Task<List<BlobChangeFeedEvent>> ChangeFeedAsync(string connectionString)
{
// Get a new blob service client.
BlobServiceClient blobServiceClient = new BlobServiceClient(connectionString);
// Get a new change feed client.
BlobChangeFeedClient changeFeedClient = blobServiceClient.GetChangeFeedClient();
List<BlobChangeFeedEvent> changeFeedEvents = new List<BlobChangeFeedEvent>();
// Get all the events in the change feed.
await foreach (BlobChangeFeedEvent changeFeedEvent in changeFeedClient.GetChangesAsync())
{
changeFeedEvents.Add(changeFeedEvent);
}
return changeFeedEvents;
}
public static void showEventData(List<BlobChangeFeedEvent> changeFeedEvents)
{
foreach (BlobChangeFeedEvent changeFeedEvent in changeFeedEvents)
{
string subject = changeFeedEvent.Subject;
string eventType = changeFeedEvent.EventType.ToString();
string eventTime = changeFeedEvent.EventTime.ToString();
string api = changeFeedEvent.EventData.ToString();
string operation = changeFeedEvent.EventData.BlobOperationName.ToString();
Console.WriteLine("Subject: " + subject + "\n" +
"Event Type: " + eventType + "\n" +
"Event Time: " + eventTime + "\n" +
"Operation Name: " + operation + "\n" +
"Api: " + api);
}
}
public static void Main(string[] args)
{
Console.WriteLine("Hello World!");
List<BlobChangeFeedEvent> feedlist = ChangeFeedAsync(conString).GetAwaiter().GetResult();
Console.WriteLine("Feedlist :" + feedlist.Count());
showEventData(feedlist);
}
}
}
Each blob has 2 system defined properties - created date and last modified which tells you when a blob was created and when it was last modified respectively.
When a blob is created, both of these properties will have the same value. However when the same blob is overwritten (i.e. content updated), only the last modified value is changed.
What you could do is use these properties to identify whether a new blob is created or content of an existing blob is updated.
You would still work with BlobCreated event. One additional step you would need to do is fetch the properties of the blob and compare these two properties to make the distinction.
I have stored a bunch of images in Azure Blob Storage. Now I want to retrieve them & resize them.
I have successfully managed to read much information from the account, such as the filename, the date last modified, and the size, but how do I get the actual image? Examples I have seen show me how to download it to a file, but that is no use to me, I want to download it as an image so I can process it.
This is what I have so far:
BlobContainerClient containerClient = blobServiceClient.GetBlobContainerClient(containerName);
Console.WriteLine("Listing blobs...");
// build table to hold the info
DataTable table = new DataTable();
table.Columns.Add("ID", typeof(int));
table.Columns.Add("blobItemName", typeof(string));
table.Columns.Add("blobItemLastModified", typeof(DateTime));
table.Columns.Add("blobItemSizeKB", typeof(double));
table.Columns.Add("blobImage", typeof(Image));
// row counter for table
int intRowNo = 0;
// divider to convert Bytes to KB
double dblBytesToKB = 1024.00;
// List all blobs in the container
await foreach (BlobItem blobItem in containerClient.GetBlobsAsync())
{
// increment row number
intRowNo++;
//Console.WriteLine("\t" + blobItem.Name);
// length in bytes
long? longContentLength = blobItem.Properties.ContentLength;
double dblKb = 0;
if (longContentLength.HasValue == true)
{
long longContentLengthValue = longContentLength.Value;
// convert to double DataType
double dblContentLength = Convert.ToDouble(longContentLengthValue);
// Convert to KB
dblKb = dblContentLength / dblBytesToKB;
}
// get the image
// **** Image thisImage = what goes here ?? actual data from blobItem ****
// Last modified date
string date = blobItem.Properties.LastModified.ToString();
try
{
DateTime dateTime = DateTime.Parse(date);
//Console.WriteLine("The specified date is valid: " + dateTime);
table.Rows.Add(intRowNo, blobItem.Name, dateTime, dblKb);
}
catch (FormatException)
{
Console.WriteLine("Unable to parse the specified date");
}
}
You need to open a read stream for your image, and construct your .NET Image from this stream:
await foreach (BlobItem item in containerClient.GetBlobsAsync()){
var blobClient = containerClient.GetBlobClient(item.Name);
using Stream stream = await blobClient.OpenReadAsync();
Image myImage = Image.FromStream(stream);
//...
}
The blobclient class also exposes some other helpful methods, like a download to a stream.
I'm importing data using Maatwebsite from an excel file and before creating a new row in my model I check if the register already exists to avoid duplicates. But this takes too long.
In my ProductImport.php:
public function model(array $row)
{
$exists = Product::
where('product_description', $row["product_description"])
->where('product_code', $row["product_code"])
->first();
if($exists ){
return null;
}
++$this->rows;
// Autoincrement id
return new Product([
"product_description" => $row["art_descripcion"],
"product_code" => $row["cui"],
"id_user" => $this->id_user,
...
]);
}
public function chunkSize(): int
{
return 1000;
}
As you see, I'm also using chunkSize, because there are 5000 rows per excel.
The problem:
The size of the product_description varies between 800 to 900 characters (varchar[1000]) and it makes the query (where()) very slow per iteration within the foreach.
Is there a better way to handle this? Maybe using updateOrCreate instead of searching first and then creating? Because I think it will be the same approach.
So the main problem is how do I compare those 800 - 900 size string quicker? Because this search is taking a lot of time to execute:
$exists = Product::
where('product_description', $row["product_description"])
->where('product_code', $row["product_code"])
->first();
I have a function that deletes every table & blob that belongs to the affected user.
CloudTable uploadTable = CloudStorageServices.GetCloudUploadsTable();
TableQuery<UploadEntity> uploadQuery = uploadTable.CreateQuery<UploadEntity>();
List<UploadEntity> uploadEntity = (from e in uploadTable.ExecuteQuery(uploadQuery)
where e.PartitionKey == "uploads" && e.UserName == User.Idendity.Name
select e).ToList();
foreach (UploadEntity uploadTableItem in uploadEntity)
{
//Delete table
TableOperation retrieveOperationUploads = TableOperation.Retrieve<UploadEntity>("uploads", uploadTableItem.RowKey);
TableResult retrievedResultUploads = uploadTable.Execute(retrieveOperationUploads);
UploadEntity deleteEntityUploads = (UploadEntity)retrievedResultUploads.Result;
TableOperation deleteOperationUploads = TableOperation.Delete(deleteEntityUploads);
uploadTable.Execute(deleteOperationUploads);
//Delete blob
CloudBlobContainer blobContainer = CloudStorageServices.GetCloudBlobsContainer();
CloudBlockBlob blob = blobContainer.GetBlockBlobReference(uploadTableItem.BlobName);
blob.Delete();
}
Each table got its own blob, so if the list contains 3 uploadentities, the 3 table and the 3 blobs will be deleted.
I heard you can use table batch operations for reduce cost and load. I tried it, but failed miserable. Anyone intrested in helping me:)?
Im guessing tablebatch operations are for tables only, so its a no go for blobs, right?
How would you add tablebatchoperations for this code? Do you see any other improvements that can be done?
Thanks!
I wanted to use batch operations but I didn't know how. Anyhow, I figured it out after some testing.
Improved code for deleting several entities:
CloudTable uploadTable = CloudStorageServices.GetCloudUploadTable();
TableQuery<UserUploadEntity> uploadQuery = uploadTable.CreateQuery<UserUploadEntity>();
List<UserUploadEntity> uploadEntity = (from e in uploadTable.ExecuteQuery(uploadQuery)
where e.PartitionKey == "useruploads" && e.MapName == currentUser
select e).ToList();
var batchOperation = new TableBatchOperation();
foreach (UserUploadEntity uploadTableItem in uploadEntity)
{
//Delete upload entities
batchOperation.Delete(uploadTableItem);
//Delete blobs
CloudBlobContainer blobContainer = CloudStorageServices.GetCloudBlobContainer();
CloudBlockBlob blob = blobContainer.GetBlockBlobReference(uploadTableItem.BlobName);
blob.Delete();
}
uploadTable.ExecuteBatch(batchOperation);
I am aware that batchoperations are limited to 100 but in my case its nothing to worry about.
How would I execute a query equivalent to "select top 10" in couch db?
For example I have a "schema" like so:
title body modified
and I want to select the last 10 modified documents.
As an added bonus if anyone can come up with a way to do the same only per category. So for:
title category body modified
return a list of latest 10 documents in each category.
I am just wondering if such a query is possible in couchdb.
To get the first 10 documents from your db you can use the limit query option.
E.g. calling
http://localhost:5984/yourdb/_design/design_doc/_view/view_name?limit=10
You get the first 10 documents.
View rows are sorted by the key; adding descending=true in the querystring will reverse their order. You can also emit only the documents you are interested using again the querystring to select the keys you are interested.
So in your view you write your map function like:
function(doc) {
emit([doc.category, doc.modified], doc);
}
And you query it like this:
http://localhost:5984/yourdb/_design/design_doc/_view/view_name?startkey=["youcategory"]&endkey=["youcategory", date_in_the_future]&limit=10&descending=true
here is what you need to do.
Map function
function(doc)
{
if (doc.category)
{
emit(['category', doc.category], doc.modified);
}
}
then you need a list function that groups them, you might be temped to abuse a reduce and do this, but it will probably throw errors because of not reducing fast enough with large sets of data.
function(head, req)
{
% this sort function assumes that modifed is a number
% and it sorts in descending order
function sortCategory(a,b) { b.value - a.value; }
var categories = {};
var category;
var id;
var row;
while (row = getRow())
{
if (!categories[row.key[0]])
{
categories[row.key[0]] = [];
}
categories[row.key[0]].push(row);
}
for (var cat in categories)
{
categories[cat].sort(sortCategory);
categories[cat] = categories[cat].slice(0,10);
}
send(toJSON(categories));
}
you can get all categories top 10 now with
http://localhost:5984/database/_design/doc/_list/top_ten/by_categories
and get the docs with
http://localhost:5984/database/_design/doc/_list/top_ten/by_categories?include_docs=true
now you can query this with a multiple range POST and limit which categories
curl -X POST http://localhost:5984/database/_design/doc/_list/top_ten/by_categories -d '{"keys":[["category1"],["category2",["category3"]]}'
you could also not hard code the 10 and pass the number in through the req variable.
Here is some more View/List trickery.
slight correction. it was not sorting untill I added the "return" keyword in your sortCategory function. It should be like this:
function sortCategory(a,b) { return b.value - a.value; }