Azure Service for millions of datasets a month

Azure Service for millions of datasets a month - azure

I´m looking for an Azure service that allows me to log ~200 million datasets a month. These are tracking datasets, so writing has to be fast.
The data will be read once or twice a day to cumulate the tracked data.
Does anyone know an Azure Service which makes sense for that?
Thanks in advance,
Michael

Data lake Service in Azure Seems to be a good option for your case

Please provide more details about the kind of "dataset". It´s a flat file with 100 records? Or, some million records?
How often is it going to be sent to the system, some dozend each minute, or 10 million at the end of the day?
With these infos, if you know something about AWS, you can find an equivalent service on azure here:
https://learn.microsoft.com/en-us/azure/architecture/aws-professional/services
If you are going to ingest a lot of request (like tracking events, or anything like) and want to summarize it by some time span, you can do some tests with IoT suite:
https://azure.microsoft.com/suites/iot-suite/

Related

azure data explorer for activity logging

We are looking to provide a historic activity log on objects in our system(similar to jira's history tab). We are looking at Azure Data Explorer as a potential tool for addressing this usecase.
sample queries we need to answer:
give me all objects that have changed in the last 30 days.
give me all objects that have changed in the last 30 days that have value of key1 set to value1.
give me all the objects that userA changed in the last year.
The amount of data(objects) we have is huge(could be tens of millions), but the activity itself is not, and will not be in a streaming format for sure. Is this a right usecase for using Azure Data Explorer?

Yes, Azure Data Explorer (Kusto) is the ideal cloud service for this functionality. You can learn more about the service by watching the recent online event and using the quick starts in the docs.

Document-centric event scheduling on Azure

I'm aware of the many different ways of scheduling system-centric events in Azure. E.g. Azure Scheduler, Logic Apps, etc. These can be used for things like backups, sending batch emails, or other maintenance functions.
However, I'm less clear on what technology is available for events relating to a large volume of documents or records.
For example, imagine I have 100,000 documents in Cosmos and some of the datetime properties on those documents relate to events: e.g. expiry, reminders, escalations, timeouts, etc. Each record has a different set of dates and times.
What approaches are there to fire off code whenever one of those datetimes is reached?
Stuff I've thought of so far:
Have a scheduled task that runs once per minute and looks for anything relating to that particular minute in Cosmos then does "stuff".
Schedule tasks on Service Bus queues with a future date as-and-when the Cosmos records are created and then have something to receive those messages and do "stuff".
But are there better ways of doing this? Is there a ready-made Azure service that would take away much of the background infrastructure work and just let me schedule a single one-off event at a particular point in time and hit a webhook or something like that?
Am I mis-categorising Azure Scheduler as something that you'd use for a handful of regularly scheduled tasks rather than the mixed bag of dates and times you'd find in 100,000 Cosmos records?
FWIW, in my use-case there isn't really a precision issue - stuff scheduled for 10:05:00 happening at 10:05:32 would be perfectly acceptable, for example.
Appreciate your thoughts.

First of all, Azure Schedular will be replaced by Azure Logic Apps:
Azure Logic Apps is replacing Azure Scheduler, which is being retired. To schedule jobs, follow this article for moving to Azure Logic Apps instead.
(source)
That said, Azure Logic Apps is one of your options since you can define a logic apps that starts a one time job by using a delay activity. See the docs for details.
It scales very well and you can pay for what you use (or use a fixed pricing model).
Another option is using a durable azure function with a timer in it. Once elapsed, you could do your thing. You can use a consumption plan as well, so you pay only for what you use or you can use a fixed pricing model. It also scales very well so hundreds of those instances won't be a problem.
In both cases you have to trigger the function or logic app when the Cosmos records are created. Put the due time as context in the trigger and there you go.
Now, given your statement
I'm aware of the many different ways of scheduling system-centric events in Azure. E.g. Azure Scheduler, Logic Apps, etc. These can be used for things like backups, sending batch emails, or other maintenance functions.
That is up to you. You can do anything you want. You don't specify in your question what work needs to be done when the due time is reached but I doubt it is something you can't do with those services.

What Azure services to use for Larger Datastorage

Looking for best azure services for holding and manipulating data for an e-commerce application (online book store) with millions of books.
As of now the e-commerce application is running over asp.net and on-premises SQL server. As stock availability and prices are changed very frequently (in every hour) so we are manipulating/ updating millions of data in a specific time-line. Millions of records are updating with in 30 minutes using SSIS packages.
Now as we are intended to move our application over Azure, so can some help me to select the best data storage service on azure which meets our expectations.
Expectations:
1- Can store relational data
2- Data can update with in strict timeline - uses minimum time to complete full transaction
3- Highly scalable and highly available
As an experiment I am managing these data with Azure SQL Database (P1-tier) but not fully satisfied. Because for those task where On-premises Sql Server takes 30 minutes to complete, Azure Sql takes more than 7 hrs for the same process. I also tried with batches but still struggling.
Can someone suggest the solution please.

I'd be happy to help.
Unfortunately there is so much that's different between an P1 in a remote data center with a 99.99 SLA, automatic HA and very specific CPU/IOPS/Memory resources - and your on-premise server where the app logic, SQL Server are all running in the same OS context. Putting network latency to the side, I would guess that the HW resources in this server (CPU/IOPS/memory) are many times larger than what resources an P1 has.
Using what data you already have, upgrading to a P2 will approx. double the resources available in this test, P3 will quadruple and so on.
Happy to talk offline to help you build a more apples to apples comparison. guyhay at microsoft.com

Client / Server syncing with Azure Table Storage

There must be a solution to this already but i'm having an issue finding it.
We have data stored in table storage and we are syncing it with an offline capable client web app over a restful api (Web API).
We are using a high watermark(currently a date time) to make sure we only download the data which has changed/added.
e.g. clients/get?watermark=2013-12-16 10:00
The problem we are facing with this approach is what happens in the edge case where multiple servers are inserting data whilst a get happens. There is a possibility that data could be inserted with a timestamp lower than the client's timestamp.
Should we worry about this or can someone recommend a better way of doing this?
I believe our main issue is inserting the data into the store. At this point there is no way to guarantee the timestamp used or the Azure box has the correct time against the other azure boxes.

Are you able to insert data into queues when inserting data into table storage? If you are able to do so, you can build off a sync that monitors the queue and inserts data based upon what's in the queue. This will allow you to not worry about timestamps and date-sync issues.
Will also make your table storage scanning faster, as you'll be able to go direct to table storage by Partition/Row keys that would presumably be in the queue messages
Edited to provide further information:
I re-read your question and realized you're looking to sync with many client applications and not necessary with a single premise-sync system which I assumed originally.
In this case, I'm slightly tweaking my suggestion:
Consider using Service Bus and publishing messages to a Service Bus Topic, everytime you change/insert Azure Table Story (ATS) entity. This message could contain an individual PartitionKey/RowKey or perhaps some other meta information as to which ATS entities have been changed.
Your individual disconnectable clients would subscribe to the Service Bus Topic through an individual Service Bus Topic Subscription and be able to pull and handle individual service bus messages and sync whatever ATS entities described in those messages.
This way you'll not really care about last-modified timestamps of your entities and only care about handling pulling messages from the service bus topic. If your client pulls all of the messages from a topic and synchronizes all of the entities that those messages describe, it has synchronized itself, regardless of the number of workers that are inserting data into ATS and timestamps with which they insert those entities.

When you're working in a disconnected/distributed environment is hard to keep things in sync based on actual time (for this to work correctly the time needs to be in sync between all actors).
Instead you should try looking at logical clocks (like a vector clock). You'll find plenty of Java examples but if you're planning to do this in .NET the examples are pretty limited.
On the other hand you might want to take a look at how the Sync Framework handles synchronization.

Azure table storage with large entity sizes

A couple of questions that I can't find any answers to. Hope someone can help:
I will be using entity sizes of almost 1MB. I can't find any information on read latency for these large entity sizes. Is there anyone out there that has any information on this.
Is there any way to determine how much space is used for a row in Azure table storage. Any API for this
Thanks

If most of our entities are large, then it somehow defeats the purpose of Table Storage in the first place. Indeed, you will only be able to retrieve update them 4 by 4 max, as entity transactions are limited to 4MB.
You can get many useful measurements on the AzureScope project concerning the Table Storage, but also the other storage services of Azure.
Then, if you want to accurately check the weight of your rows, just use Fiddler to intercept your web requests, and directly look at the XML being produced.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string