Apache pulsar infinite retention

Apache pulsar infinite retention - apache-pulsar

In Apache Pulsar topic documentation it says can we set a topic time retention policy to -1 for infinite time based retention, What are the downsides of having infinite retention and can we use pulsar as message store where data lives forever in topics and build event sourcing application around them?

The downside is that your data will grow forever. However, due to the segment based architecture of the underlying storage (bookkeeper), more space can by added by adding storage nodes (i.e. all the data doesn't have to fit on one machine, as is the case in some other systems).
The segment based architecture also makes it fairly straightforward to move data to a bulk storage system (s3 or something) while still having it available from Pulsar. However, this is still in earlier stages of discussion right now.

Actually, you can and should use Pulsar's Tiered Storage option to offload your older data to more cost effective storage such as S3, Google Blob Storage, or HDFS. Unlike Kafka, Pulsar has decoupled the serving layers from the storage layers, which allows this. In Kafka, you would have to "add hard drives endlessly" and broker instances to store them.

Using the benefits to Pulsar is a better option because it provides more organization for your data store. Since Pulsar's strength is a storage layer that separates tiered storage away from topics, I would recommend going that route because your data will both me more secure and easily accessible.

Related

Apache Pulsar - use cases for infinite retention of a topic

I am actually planing our next version of our telemetry system architecture. I am strongly considering Pulsar at the messaging solution.
To better understand what's this technology is best for, can someone share their use cases of why their use the infinite retention of a topic other than audit trail ?
I was main goal is to see if our telemetry data could be simply stored in a pulsar topic and query that for analytics purpose instead of using a time series database like Apache Druid.
Thanks !

The use-case I've had for infinite retention is when you want to store the history going back to the beginning: e.g. in an event-sourcing style approach, the longer you're keeping the events archived, the more able you are to remix your state.
With durable-log style storage, remember that it heavily optimizes for slurping the log starting at some point. For higher-volume queries or queries with strict latency requirements, this is generally pretty unsuited for that sort of workload, and even more so if you can't limit reads to a single partition (remember also that with multiple partitions, even the ordering of the messages in the log may be difficult to reconstruct). For infrequent queries with loose latency requirements, though, storing them in pulsar might not be that bad, especially if you'd be using pulsar already to feed data into the time-series store (as you can then dispense with the time-series store).

Azure: Redis vs Table Storage for web cache

We currently use Redis as our persistent cache for our web application but with it's limited memory and cost I'm starting to consider whether Table storage is a viable option.
The data we store is fairly basic json data with a clear 2 part key which we'd use for the partition and row key in table storage so I'm hoping that would mean fast querying.
I appreciate one is in memory and one is out so table storage will be a bit slower but as we scale I believe there is only one CPU serving data from a Redis cache whereas with Table storage we wouldn't have that issue as it would be down to the number of web servers we have running.
Does anyone have any experience of using Table storage in this way or comparisons between the 2.
I should add we use Redis in a very minimalist way get/set and nothing more, we evict our own data and failing that leave the eviction to Redis when it runs out of space.

This is a fairly broad/opinion-soliciting question. But from an objective perspective, these are the attributes you'll want to consider when deciding which to use:
Table Storage is a durable, key/value store. As such, content doesn't expire. You'll be responsible for clearing out data.
Table Storage scales to 500TB.
Redis is scalable horizontally across multiple nodes (or, scalable via Redis Service). In contrast, Table Storage will provide up to 2,000 transactions / sec on a partition, 20,000 transactions / sec across the storage account, and to scale beyond, you'd need to utilize multiple storage accounts.
Table Storage will have a significantly lower cost footprint than a VM or Redis service.
Redis provides features beyond Azure Storage tables (such as pub/sub, content eviction, etc).
Both Table Storage and Redis Cache are accessible via an endpoint, with many language-specific SDK wrappers around the API's.

I find some metrials about the azure redis and table, hope that it can help you.There is a video about Azure Redis that also including a demo to compare between table storage and redis about from 50th minute in the videos.
Perhaps it can be as reference. But detail performance it depends on your application, data records and so on.
The pricing of the table storage depends on the capacity of table storage, please refer to details. It is much cheaper than redis.

There are many differences you might care about, including price, performance, and feature set. And, persistence of data, and data consistency.
Because redis is an in-memory data store it is pretty expensive. This is so that you may get low latency. Check out Azure's planning FAQ here for a general understanding of redis performance in a throughput sense.
Azure Redis planning FAQ
Redis does have an optional persistence feature, that you can turn on, if you want your data persisted and restored when the servers have rare downtime. But it doesn't have a strong consistency guarantee.
Azure Table Storage is not a caching solution. It’s a persistent storage solution, and saves the data permanently on some kind of disk. Historically (disclaimer I have not look for the latest and greatest performance numbers) it has much higher read and write latency. It is also strictly a key-value store model (with two-part keys). Values can have properties but with many strict limitations, around size of objects you can store, length of properties, and so on. These limitations are inflexible and painful if your app runs up against them.
Redis has a larger feature set. It can do key-value but also has a bunch of other data structures like sets and lists, and many apps can find ways to benefit from that added flexibility.
See 'Introduction to Redis' (redis docs) .
CosmosDB could be yet another alternative to consider if you're leaning primarily towards Azure technologies. It is pretty expensive, but quite fast and feature-rich. While also being primarily intended to be a persistent store.

Client / Server syncing with Azure Table Storage

There must be a solution to this already but i'm having an issue finding it.
We have data stored in table storage and we are syncing it with an offline capable client web app over a restful api (Web API).
We are using a high watermark(currently a date time) to make sure we only download the data which has changed/added.
e.g. clients/get?watermark=2013-12-16 10:00
The problem we are facing with this approach is what happens in the edge case where multiple servers are inserting data whilst a get happens. There is a possibility that data could be inserted with a timestamp lower than the client's timestamp.
Should we worry about this or can someone recommend a better way of doing this?
I believe our main issue is inserting the data into the store. At this point there is no way to guarantee the timestamp used or the Azure box has the correct time against the other azure boxes.

Are you able to insert data into queues when inserting data into table storage? If you are able to do so, you can build off a sync that monitors the queue and inserts data based upon what's in the queue. This will allow you to not worry about timestamps and date-sync issues.
Will also make your table storage scanning faster, as you'll be able to go direct to table storage by Partition/Row keys that would presumably be in the queue messages
Edited to provide further information:
I re-read your question and realized you're looking to sync with many client applications and not necessary with a single premise-sync system which I assumed originally.
In this case, I'm slightly tweaking my suggestion:
Consider using Service Bus and publishing messages to a Service Bus Topic, everytime you change/insert Azure Table Story (ATS) entity. This message could contain an individual PartitionKey/RowKey or perhaps some other meta information as to which ATS entities have been changed.
Your individual disconnectable clients would subscribe to the Service Bus Topic through an individual Service Bus Topic Subscription and be able to pull and handle individual service bus messages and sync whatever ATS entities described in those messages.
This way you'll not really care about last-modified timestamps of your entities and only care about handling pulling messages from the service bus topic. If your client pulls all of the messages from a topic and synchronizes all of the entities that those messages describe, it has synchronized itself, regardless of the number of workers that are inserting data into ATS and timestamps with which they insert those entities.

When you're working in a disconnected/distributed environment is hard to keep things in sync based on actual time (for this to work correctly the time needs to be in sync between all actors).
Instead you should try looking at logical clocks (like a vector clock). You'll find plenty of Java examples but if you're planning to do this in .NET the examples are pretty limited.
On the other hand you might want to take a look at how the Sync Framework handles synchronization.

Azure table storage and caching

Is it worth caching data from Azure Table storage with the Azure Caching Preview?
Or is the table storage fast enough in large scale applications?
Thanks

The short answer is it depends. In the application I am currently working on there is some information that we use caching for to handle both the latency of retrieving data from Table Storage and to accommodate the desired number of transactions per second.
We started out serving the information from Table Storage and moved to caching only when our performance requirements dictated it. I'd recommend a similar approach: make it work, then make it fast.

In addition to what Robert said, you should also consider following points:
Windows Azure Table Storage allows to store up to 100TB in size (in chunks). At first glance, that size of data may seem overwhelming. However, Table Storage can be partitioned. Each partition of Table Storage can be moved to a separate server by the Azure controller thereby reducing the load on any single server and improving performance.
If you have very high load on your application, you cache with frequent inserts will approach the maximum cache size very quickly and then cache items eviction process starts. In most cases frequent inserts into cache and frequent cache items eviction processes end up with performance degradation instead of improvement. Then you would need to increase cache maximum size, which in turn will affect your application cost (sometimes this might be a blocker).
Last but not least, you can access Windows Azure Table Storage data using the OData protocol and LINQ queries with WCF Data Service .NET Libraries; you do not have that ability with Azure Cache.
Please bear in mind that those points may or may not be valid in your case. All depends on your system architecture; expected load etc.
I hope my answer will help you in making good system architecture decisions.

Is storing data in Windows Azure cheaper when using RavenDB rather than SQL Azure?

SQL Azure storage is a lot more expensive than Windows Azure Storage. Would implementing a no-sql solution like RavenDB allow me to store data on the cheaper Azure Storage?
Are there other things to consider, like backup, speed or security?
Thank you.

You have to consider that with SQL Azure you not only get the storage, but the database server too. If you implement RavenDB, you will will need a worker role to host it in and, in order to allow for failure of that worker role, another worker role (replica), which also doubles up the storage.
Bear in mind that with SQL Azure you get a highly available (3x replicated with failover) SQL solution that surfaces a familiar (ADO.NET) API. Make your choices based on aspects other than storage cost, such as operational effort and development effort. If you choose RavenDB it should be because of the potential cost savings in development effort (because of the closeness on the document API to the object graph) and operational cost, because RavenDB is 'administered' as part of the application. Cost of storage of actual bytes, particularly at scale, is a marginal consideration.

Adding a bit to #Simon's answer: When considering Table Storage and its low cost, also consider whether you can use it directly, instead of going with an installed-and-managed-by-you NoSQL database engine. As it stands, Table Storage offers a schemaless solution that lets you store essentially a property bag within a row, indexed by partitionkey+rowkey. Does that work for you? Could you work with a few extra tables to give you additional indexing? If so, your storage cost is going to be really low (and still durable, triple-replicated).
If you find yourself writing significant code to manage Table Storage, then it may be more efficient to invest in the Compute instances needed to run RavenDB. When considering this, also consider that you'll likely want larger VM sizes if you're moving significant data (as you get approx. 100Mbps per core). A database like MongoDB, working with memory-mapped files, really ramps up speed-wise with more RAM. Not sure if this is the same with RavenDB.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string