Azure Event Hubs - custom consumer with SQL checkpoints - azure

We are currently looking at Azure Event Hubs as a mechanism to dispatch messages to background processors. At the moment the queue-based systems are being used.
Most processors are writing data to SQL Server databases, and the writes are wrapped in transactions.
Event Hubs are positioned as at-least-once communication channel, so duplicate messages should be expected. EventProcessorHost is the recommended API on the read side, which automates lease management and checkpointing using Azure Blob Storage.
But we have an idea, for some most critical processors, to implement checkpointing ourselves using a SQL Server table inside the same database, and write the checkpoint inside the same transaction of the processor. This should give us the strong guarantee of exactly once delivery when needed.
Ignoring lease management for now (just run 1 processor per partition), is SQL-based checkpointing a good idea? Are there other drawbacks, except the need to work on lower level of API and handle checkpoints ourselves?

As per Fred's advice, we implemented our own Checkpoint Manager based on a table in SQL Server. You can find the code sample here.
This implementation plugs nicely into EventProcessorHost. We also had to implement ILeaseManager, because they are highly coupled in the default implementation.
In my blog post I've described my motivation for such SQL-based implementation, and the high-level view of the overall solution.

Azure Storage is the built-in solution but we are not limited with that. If most of your processors are writing data to SQL Server databases and you do not want to have EventProcessorHost store checkpoints in Azure Storage (that requires a storage account), in my view, storing checkpoints in your SQL database which provide a easy way to make process event and manage checkpoint transactionally, it would be a good solution.
You could write your own checkpoint manager using ICheckpointManager interface to storing checkpoints in your SQL database.

Related

Azure Data Factory(ADF) vs Azure Functions: How to choose?

Currently we are using Blob trigger Azure Functions to move json data into Cosmos DB. We are planning to replace Azure Functions with Azure Data Factory(ADF) pipeline.
I am new to Azure Data Factory(ADF), so not sure, Could Azure Data Factory(ADF) pipeline be better option or not?
Though my answer is a bit late, I would like to add that I would not recommend replacing your current setup with ADF. Reasons:
It is too expensive. ADF costs way more than azure functions.
Custom Logic: ADF is not built to perform cleansing logics or any custom code. Its primary goal is for data integration from external systems using its vast connector pool
Latency: ADF has much higer latency due to the large overhead of its job frameweork
Based on you requirements, Azure Data Factory is your perfect option. You could follow this tutorial to configure Cosmos DB Output and Azure Blob Storage Input.
Advantage over azure function is being that you don't need to write any custom code unless there is a data cleaning involved and azure data factory is the recommended option, even if you want azure function for other purposes you can add it within the pipeline.
Fundamental use of Azure Data Factory is data ingestion. Azure Functions are Server-less (Function as a Service) and its best usage is for short lived instances. Azure Functions which are executed for multiple seconds are far more expensive. Azure Functions are good for Event Driven micro services. For Data ingestion , Azure Data Factory is a better option as its running cost for huge data will be lesser than azure functions. Also you can integrate Spark processing pipelines in ADF for more advanced data ingestion pipelines.
Moreover , it depends upon your situation . Azure functions are server less light weight processes meant for quick access in response to an event instead of volumetric responses which are meant for batch processes.
So, if your requirement is to quickly respond to an event with little information stay with Azure functions or if you have a need for batch process switch to ADF.
Cost
I get images from here.
Let's calculate the cost:
if your file is large:
43:51hour=43.867(h)
4(DIU)*43.867(h)*0.25($/DIU-H)=43.867$
43.867/7.514GB= 5.838 ($/GB)
if your file is small(2.497MB), take about 45 seconds:
4(DIU)*1/60(h)*0.25($/DIU-H)=0.0167$
2.497MB/1024MB=0.00244013671 GB
0.0167/0.00244013671= 6.844 ($/GB)
scale
The Max instances Azure function can run is 200.
ADF can run 3,000 Concurrent External activities. And In my test, only 1500 copy activities were running parallel. (This test wasted a lot of money.)

How to store temporary data in an Azure multi-instance (scale set) virtual machine?

We developed a server service that (in a few words) supports the communications between two devices. We want to make advantage of the scalability given by an Azure Scale Set (multi instance VM) but we are not sure how to share memory between each instance.
Our service basically stores temporary data in the local virtual machine and these data are read, modified and sent to the devices connected to this server.
If these data are stored locally in one of the instances the other instances cannot access and do not have the same information. Is it correct?
If one of the devices start making some request to the server the instance that is going to process the request will not always be the same so the data at the end is spread between instances.
So the question might be, how to share memory between Azure instances?
Thanks
Depending on the type of data you want to share and how much latency matters, as well as ServiceFabric (low latency but you need to re-architect/re-build bits of your solution), you could look at a shared back end repository - Redis Cache is ideal as a distributed cache; SQL Azure if you want to use a relation db to store the data; storage queue/blob storage - or File storage in a storage account (this allows you just to write to a mounted network drive from both vm instances). DocumentDB is another option, which is suited to storing JSON data.
You could use Service Fabric and take advantage of Reliable Collections to have your state automagically replicated across all instances.
From https://azure.microsoft.com/en-us/documentation/articles/service-fabric-reliable-services-reliable-collections/:
The classes in the Microsoft.ServiceFabric.Data.Collections namespace provide a set of out-of-the-box collections that automatically make your state highly available. Developers need to program only to the Reliable Collection APIs and let Reliable Collections manage the replicated and local state.
The key difference between Reliable Collections and other high-availability technologies (such as Redis, Azure Table service, and Azure Queue service) is that the state is kept locally in the service instance while also being made highly available.
Reliable Collections can be thought of as the natural evolution of the System.Collections classes: a new set of collections that are designed for the cloud and multi-computer applications without increasing complexity for the developer. As such, Reliable Collections are:
Replicated: State changes are replicated for high availability.
Persisted: Data is persisted to disk for durability against large-scale outages (for example, a datacenter power outage).
Asynchronous: APIs are asynchronous to ensure that threads are not blocked when incurring IO.
Transactional: APIs utilize the abstraction of transactions so you can manage multiple Reliable Collections within a service easily.
Working with Reliable Collections -
https://azure.microsoft.com/en-us/documentation/articles/service-fabric-work-with-reliable-collections/

Is AKKA trying to do in memory the same as Azure Service Bus Queue does on disk?

There are many benefits that an actor model like AKKA.net bring to the table like scalability, reactiveness, in memory-caching etc... When I tried to compare AKKA with Azure Service Bus Queues, I see pretty much the same primary benefits in Azure Service Bus except the benefit of in-memory caching.
In a production environment, AKKA requires multiple VMs with more memory, processing power to handle millions of actors in memory. In the case of Azure Service Bus Queues, powerful hosts are not needed. Even if we use actor model, there is no need of doing supervision or creating the actor system to manage millions of actors. The scalability is automatic with Azure Service Bus.
In a long run, I think Azure Service Bus Queues is cost effective. There is no IT admin required to manage it as the load increases. There is no need of powerful systems too with multiple cores.
Is AKKA actor model suitable for on-premises data centers that have systems with multi-cores and not suitable for apps that can use Azure services when thinking in terms of cost-effectiveness?
I would say that the mailbox component of Akka.net does a job that is similar to the one that Azure Service Bus Queues (or MSMQ for that fact) is perhaps similar in principle. Where a Service Bus Queue (or MSMQ) will persist (or in the case of MSMQ, a persistent queue - not a memory queue) the messages to storage and Akka's mailbox out of the box will not.
But that is where the similarities end... Service Bus has no concept of Actors, processors, distribution - it is only a messaging solution.
If you are looking to implement actors where perhaps persistent (and transactional "actors") are the goal then I would suggest that you not use Akka.net but instead either MassTransit or nServiceBus. Both are very capable products and while their performance won't equal Akka.net (MT and nSB serialize and persists to disk, Akka.net doesn't) they are very capable and sturdy and will recover transparently when an Actor/Node crashes as the whole work item will be in a transaction which will cause the message to be reinserted/reprocessed as a default behavior.
I have long looked at Azure Table Storage as a good Event Store, but I would use a service bus (MassTransit, nServiceBus) instead if you wish to keep a record of commands as for me this feels like you have needs that are more "transactional" as "near-realtime" and "multiprocessor" in nature.
My personal opinion,

Why we need Transient Fault Handling for storage?

I saw some thread said storage library already have retry policy,
So why should us use this :Transient Fault Handling block
Can any one show me some samples about how to use this Transient Fault handling block for myy blob and Table storage properly?
thanks!
For implementation examples, read on in the link you sent - the Key Scenarios section. If you aren't having problems connecting, then don't implement. We use it, but it hasn't helped as far as we know. Every fault we've encountered has been a longer term, Azure internal network related issue that caused faults the TFHB couldn't handle.
One reason (and the reason I use it in my application) is that the transient fault handling application block provides retry logic for not only storage (tables, blobs and queues) but also for SQL Azure as well as Service Bus Queues. If your project makes use of these additional resources (namely SQL Azure and Service Bus Queues) and you would want to have a single library to handle transient faults, I would recommend using this over storage client library.
Another reason I would give for using this library is it's extensibility. You could probably extend this library to handle other error scenarios (not covered by storage client library retry policies) or use it against other web resources like service management API.
If you're just using blob and table storage, you could very well use the retry policies which come with storage client library.
I don't use the Transient Fault Handling block for blob storage, it may be more applicable to table storage or when transmitting larger chunks of data. Given that I use blob storage containers to archive debug information (in the form of short txt files) on certain areas of the site, it seems a little convoluted. I've never once witnessed any failures writing to storage and we write 10s of thousands of logs a week. Of course different usage of storage may yield different reliability.
For tables and blobs you do not need to use any external transient retry blocks afaik. The ones that are implemented within the sdk are fairly robust. If you think you should implement a special retry policy the way to do that is to implement your own retry policy inheriting from Azure Storage IRetryPolicy interface and pass that on to your storage requests as part of the TableRequestOptions.RetryPolicy property.

Migrating database to SQL Azure

As far as I know the key points to migrate an existing database to SQL Azure are:
Tables has to contain a clustered
index. This is mandatory.
Schema and data migration should be
done through data sync, bulk copy,
or the SQL Azure migration
wizard, but not with the restore option in SSMS.
The .NET code should handle the
transient conditions related with
SQL Azure.
The creation of logins is in the
master database.
Some TSQL features may be not
supported.
And I think that's all, am I right? Am I missing any other consideration before starting a migration?
Kind regards.
Update 2015-08-06
The Web and Business editions are no longer available, they are replaced by Basic, Standard and Premium Tiers.
.CLR Stored Procedure Support is now available
New: SQL Server support for Linked Server and Distributed Queries against Windows Azure SQL Database, more info.
Additional considerations:
Basic tier allows 2 GB
Standard tier allows 250 GB
Premium tier allow 500 GB
The following features are NOT supported:
Distributed Transactions, see feature request on UserVoice
SQL Service broker, see feature request on UserVoice
I'd add in bandwidth considerations (for initial population and on-going bandwidth). This has cost and performance considerations.
Another potential consideration is any long running processes or large transactions that could be subject to SQL Azure's rather cryptic throttling techniques.
Another key area to point out are SQL Jobs. Since SQL Agent is not running, SQL Jobs are not supported.
One way to migrate these jobs are to refactor so that a worker role can kick off these tasks. The content of the job might be moved into a stored procedure to reduce re-architecture. The worker role could then be designed to wake up and run at the appropriate time and kick off the stored procedure.

Resources