I am planning to consider Redis for storing large amount of data in cache. Currently I store them in my own cache written in java. My use case is below.
I get 15 minutes data from a source and i need to aggregate the data hourly. So for a given object A every hour I will get 4 values and I need to aggregate them to one value the formula I will use will max / min / sum.
Foe making key I plan to use like below
a) object id - long
b) time - long
c) property id - int (each object may have many property which I need to aggregate for each property separately)
So final key would look like;
objectid_time_propertyid
Every 15 minutes I may get around 50 to 60 Million keys , I need to fetch these keys every time convert the property value to double and apply the formula (max/min/sum etc.) then convert back to String and store back.
So I see for every key I have one read and one write and conversion in each case.
My questions are following.
Is is advisable to use redis for such use case , going forward I may aggregate hourly data to daily , daily to weekly and so on.
What would be performance of read and writes in cache (I did a sample test on Windows and 100K keys read and write took 30-40 seconds thats not great , but I did on windows and I finally need to run on linux.
I want to use persistence function of redis, what are pros and cons of it ?
If any one has real experience in usage of redis as memcache which requires frequent updation please give a suggestion.
Is is advisable to use redis for such use case , going forward I may aggregate hourly data to daily , daily to weekly and so on.
Advisable depends on who you ask, but I certainly feel Redis will be up to the job. If a single server isn't enough, your description suggests that the dataset can be easily sharded so a cluster will let you scale.
I would advise, however, that you store your data a little differently. First, every key in Redis has an overhead so the more of these, the more RAM you'll need. Therefore, instead of keeping a key per object-time-property, I recommend Hashes as a means for aggregating some values together. For example, you could use an object_id:timestamp key and store the property_id:value pairs under it.
Furthermore, instead of keeping the 4 discrete measurements for each object-property by timestamp and recomputing your aggregates, I suggest you keep just the aggregates and update these with new measurements. So, you'd basically have an object_id Hash, with the following structure:
object_id:hourtimestamp -> property_id1:max = x
property_id1:min = y
property id1:sum = z
When getting new data - d - for an object's property, just recompute the aggregates:
property_id1:max = max(x, d)
property_id1:min = min(y, d)
property_id1:sum = z + d
Repeat the same for every resolution needed, e.g. use object_id:daytimestamp to keep day-level aggregates.
Finally, don't forget expiring your keys after they are no longer required (i.e. set a 24 hours TTL for the hourly counters and so forth).
There are other possible approaches, mainly using Sorted Sets, that can be applicable to solve your querying needs (remember that storing the data is the easy part - getting it back is usually harder ;)).
What would be performance of read and writes in cache (I did a sample test on Windows and 100K keys read and write took 30-40 seconds thats not great , but I did on windows and I finally need to run on linux.
Redis, when running on my laptop on Linux in a VM, does an excess of 500K reads and writes per second. Performance is very dependent on how you use Redis' data types and API. Given your throughput of 60 million values over 15 minutes, or ~70K/sec writes of smallish data, Redis is more than equipped to handle that.
I want to use persistence function of redis, what are pros and cons of it ?
This is an extremely-well documented subject - please refer to http://redis.io/topics/persistence and http://oldblog.antirez.com/post/redis-persistence-demystified.html for starters.
Related
I have a single structured row as input with write rate of 10K per seconds. Each row has 20 columns. Some queries should be answered on these inputs. Because most of the queries needs different WHERE, GROUP BY or ORDER BY, The final data model ended up like this:
primary key for table of query1 : ((column1,column2),column3,column4)
primary key for table of query2 : ((column3,column4),column2,column1)
and so on
I am aware of the limit in number of tables in Cassandra data model (200 is warning and 500 would fail)
Because for every input row I should do an insert in every table, the final write per seconds became big * big data!:
writes per seconds = 10K (input)
* number of tables (queries)
* replication factor
The main question: am I on the right path? Is it normal to have a table for every query even when the input rate is already so high?
Shouldn't I use something like spark or hadoop instead of relying on bare datamodel? Or event Hbase instead of Cassandra?
It could be that Elassandra would resolve your problem.
The query system is quite different from CQL, but the duplication for indexing would automatically be managed by Elassandra on the backend. All the columns of one table will be indexed so the Elasticsearch part of Elassandra can be used with the REST API to query anything you'd like.
In one of my tests, I pushed a huge amount of data to an Elassandra database (8Gb) going non-stop and I never timed out. Also the search engine remained ready pretty much the whole time. More or less what you are talking about. The docs says that it takes 5 to 10 seconds for newly added data to become available in the Elassandra indexes. I guess it will somewhat depend on your installation, but I think that's more than enough speed for most applications.
The use of Elassandra may sound a bit hairy at first, but once in place, it's incredible how fast you can find results. It includes incredible (powerful) WHERE for sure. The GROUP BY is a bit difficult to put in place. The ORDER BY is simple enough, however, when (re-)ordering you lose on speed... Something to keep in mind. On my tests, though, even the ORDER BY equivalents was very fast.
It may be too much turkey over the holidays, but I've been thinking about a potential problem that we could have with Couchbase.
Currently we paginate based on time, but I'm thinking a similar issue could occur with other values used for paging for example the atomic counter. I'll try to explain best I can, this would only occur in a load balanced environment.
For example say we have 4 servers load balanced and storing data to our Couchbase cluster. We sort our records based on timestamps currently. If any of the 4 servers writing the data starts to lag behind the others than our pagination would possibly be missing records when retrieving client side. A SQL DB auto-increment and timestamps for example can be created when the record is stored to the DB which will avoid similar issues. Using a NoSql DB like Couchbase you define the data you need to retrieve on before it is stored to the DB. So what I am getting at is if there is a delay in storing to the DB and you are retrieving in a pagination fashion while this delay has occurred, you run the real possibility of missing data. Since we are paging that data may never be viewed.
Interested in what other thoughts people have on this.
EDIT**
Response to Andrew:
Example a facebook or pintrest type app is storing data to a DB, they have many load balanced servers from the frontend writing to the db. If for some reason writing is delayed its a non issue with a SQL DB because a timestamp or auto increment happens when the data is actually stored to the DB. There will be no missing data when paging. asking for 1-7 will give you data that is only stored in the DB, 7-* will contain anything that is delayed because an auto-increment value has not been created for that record becuase it is not actually stored.
In Couchbase its different, you actually get your auto increment value (atomic counter) and then save it. So for example say a record is going to be stored as atomic counter number 4. For some reasons this is delayed in storing to the DB. Other servers are grabbing 5, 6, 7 and storing that data just fine. The client now asks for all data between 1 and 7, 4 is still not stored. Then the next paging request is 7 to *. 4 will never be viewed.
Is there a way around this? Can it be modelled differently in CB, or is this just a potential weakness in CB when needing to page results. As I mentioned are paging is timestamp sensitive.
Michael,
Couchbase is an eventually consistent database with respect to views. It is ACID with respect to documents. There are durability interfaces that let you manage this. This means that you can rest assured you won't lose data and that indexes will catch up eventually.
In my experience with Couchbase, you need to expect that the nodes will never be in-sync. There are many things the database is doing, such as compaction and replication. The most important thing you can do to enhance performance is to put your views on a separate spindle from the data. And you need to ensure that your main data spindles across your cluster can sustain between 3-4 times your ingestion bandwidth. Also, make sure your main document key hashes appropriately to distribute the load.
It sounds like you are discussing a situation where the data exists in your system for less time than it takes to be processed through the view system. If you are removing data that fast, you need either a bigger cluster or faster disk arrays. Of the two choices, I would expand the size of your cluster. I like to think of Couchbase as building a RAIS, Redundant Array of Independent Servers. By expanding the cluster, you reduce the coincidence of hotspots and gain disk bandwidth. My ideal node has two local drives, one each for data and views, and enough RAM for my working set.
Anon,
Andrew
I am trying to build a real-time stock application.
Every seconds I can get some data from web service like below:
[{"amount":"20","date":1386832664,"price":"183.8","tid":5354831,"type":"sell"},{"amount":"22","date":1386832664,"price":"183.61","tid":5354833,"type":"buy"}]
tid is the ticket ID for stock buying and selling;
date is the second from 1970.1.1;
price/amount is at what price and how many stock traded.
Reuirement
My requirement is show user highest/lowest price at every minute/5 minutes/hour/day in real-time; show user the sum of amount in every minute/5 minutes/hour/day in real-time.
Question
My question is how to store the data to redis, so that I can easily and quickly get highest/lowest trade from DB for different periods.
My design is something like below:
[date]:[tid]:amount
[date]:[tid]:price
[date]:[tid]:type
I am new in redis. If the design is this is that means I need to use sorted set, will there any performance issue? Or is there any other way to get highest/lowest price for different periods.
Looking forward for your suggestion and design.
My suggestion is to store min/max/total for all intervals you are interested in and update it for current ones with every arriving data point. To avoid network latency when reading previous data for comparison, you can do it entirely inside Redis server using Lua scripting.
One key per data point (or, even worse, per data point field) is going to consume too much memory. For the best results, you should group it into small lists/hashes (see http://redis.io/topics/memory-optimization). Redis only allows one level of nesting in its data structures: if you data has multiple fields and you want to store more than one item per key, you need to somehow encode it yourself. Fortunately, standard Redis Lua environment includes msgpack support which is very a efficient binary JSON-like format. JSON entries in your example encoded with msgpack "as is" will be 52-53 bytes long. I suggest grouping by time so that you have 100-1000 entries per key. Suppose one-minute interval fits this requirement. Then the keying scheme would be like this:
YYmmddHHMMSS — a hash from tid to msgpack-encoded data points for the given minute.
5m:YYmmddHHMM, 1h:YYmmddHH, 1d:YYmmdd — window data hashes which contain min, max, sum fields.
Let's look at a sample Lua script that will accept one data point and update all keys as necessary. Due to the way Redis scripting works we need to explicitly pass the names of all keys that will be accessed by the script, i.e. the live data and all three window keys. Redis Lua has also JSON parsing library available, so for the sake of simplicity let's assume we just pass it JSON dictionary. That means that we have to parse data twice: on the application side and on the Redis side, but the performance effects of it are not clear.
local function update_window(winkey, price, amount)
local windata = redis.call('HGETALL', winkey)
if price > tonumber(windata.max or 0) then
redis.call('HSET', winkey, 'max', price)
end
if price < tonumber(windata.min or 1e12) then
redis.call('HSET', winkey, 'min', price)
end
redis.call('HSET', winkey, 'sum', (windata.sum or 0) + amount)
end
local currkey, fiveminkey, hourkey, daykey = unpack(KEYS)
local data = cjson.decode(ARGV[1])
local packed = cmsgpack.pack(data)
local tid = data.tid
redis.call('HSET', currkey, tid, packed)
local price = tonumber(data.price)
local amount = tonumber(data.amount)
update_window(fiveminkey, price, amount)
update_window(hourkey, price, amount)
update_window(daykey, price, amount)
This setup can do thousands of updates per second, not very hungry on memory, and window data can be retrieved instantly.
UPDATE: On the memory part, 50-60 bytes per point is still a lot if you want to store more a few millions. With this kind of data I think you can get as low as 2-3 bytes per point using custom binary format, delta encoding, and subsequent compression of chunks using something like snappy. It depends on your requirements, whether it's worth doing this.
I have a case where I need to record a user action in Cassandra, then later retrieve a sorted list of users with the highest number of that action in an arbitrary time period.
Can anyone suggest a way to store and retrieve this data in a pre-aggregated method?
Outside of Cassandra I would recommend using stream-summary or count min sketch you would be able to solve this with much less space and have immediate results. Just update and periodically serialize and persist it (assuming you don't need guaranteed accuracy)
In Cassandra you can keep a row per period of time like by hours and have a counter per user in that row, incrementing them on use. Then use a batch job to run through them and find the heavy hitters. You would be constrained to having the minimal queryable time be 1 hour and it wont be particularly cheap or fast to compute but it would work.
Generally it would be good treating these as a log of operation, every time there is an event store it and have batch jobs do analytics against it with hadoop or custom. If need it realtime id recommend the above approach of keeping stream summaries in memory.
I have some software which collects data over a large period of time, approx 200 readings per second. It uses an SQL database for this. I am looking to use Azure to move a lot of my old "archived" data to.
The software uses a multi-tenant type architecture, so I am planning to use one Azure Table per Tenant. Each tenant is perhaps monitoring 10-20 different metrics, so I am planning to use the Metric ID (int) as the Partition Key.
Since each metric will only have one reading per minute (max), I am planning to use DateTime.Ticks.ToString("d19") as my RowKey.
I am lacking a little understanding as to how this will scale however; so was hoping somebody might be able to clear this up:
For performance Azure will/might split my table by partitionkey in order to keep things nice and quick. This would result in one partition per metric in this case.
However, my rowkey could potentially represent data over approx 5 years, so I estimate approx 2.5 million rows.
Is Azure clever enough to then split based on rowkey as well, or am I designing in a future bottleneck? I know normally not to prematurely optimise, but with something like Azure that doesn't seem as sensible as normal!
Looking for an Azure expert to let me know if I am on the right line or whether I should be partitioning my data into more tables too.
Few comments:
Apart from storing the data, you may also want to look into how you would want to retrieve the data as that may change your design considerably. Some of the questions you might want to ask yourself:
When I retrieve the data, will I always be retrieving the data for a particular metric and for a date/time range?
Or I need to retrieve the data for all metrics for a particular date/time range? If this is the case then you're looking at full table scan. Obviously you could avoid this by doing multiple queries (one query / PartitionKey)
Do I need to see the most latest results first or I don't really care. If it's former, then your RowKey strategy should be something like (DateTime.MaxValue.Ticks - DateTime.UtcNow.Ticks).ToString("d19").
Also since PartitionKey is a string value, you may want to convert int value to a string value with some "0" prepadding so that all your ids appear in order otherwise you'll get 1, 10, 11, .., 19, 2, ...etc.
To the best of my knowledge, Windows Azure partitions the data based on PartitionKey only and not the RowKey. Within a Partition, RowKey serves as unique key. Windows Azure will try and keep data with the same PartitionKey in the same node but since each node is a physical device (and thus has size limitation), the data may flow to another node as well.
You may want to read this blog post from Windows Azure Storage Team: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx.
UPDATE
Based on your comments below and some information from above, let's try and do some math. This is based on the latest scalability targets published here: http://blogs.msdn.com/b/windowsazurestorage/archive/2012/11/04/windows-azure-s-flat-network-storage-and-2012-scalability-targets.aspx. The documentation states that:
Single Table Partition– a table partition are all of the entities in a
table with the same partition key value, and usually tables have many
partitions. The throughput target for a single table partition is:
Up to 2,000 entities per second
Note, this is for a single partition, and not a single table. Therefore, a table with good partitioning, can process up to the
20,000 entities/second, which is the overall account target described
above.
Now you mentioned that you've 10 - 20 different metric points and for for each metric point you'll write a maximum of 1 record per minute that means you would be writing a maximum of 20 entities / minute / table which is well under the scalability target of 2000 entities / second.
Now the question remains of reading. Assuming a user would read a maximum of 24 hours worth of data (i.e. 24 * 60 = 1440 points) per partition. Now assuming that the user gets the data for all 20 metrics for 1 day, then each user (thus each table) will fetch a maximum 28,800 data points. The question that is left for you I guess is how many requests like this you can get per second to meet that threshold. If you could somehow extrapolate this information, I think you can reach some conclusion about the scalability of your architecture.
I would also recommend watching this video as well: http://channel9.msdn.com/Events/Build/2012/4-004.
Hope this helps.