Do we need a cache layer for cassandra? - cassandra

I am playing/evaluating a realtime streaming application with millions of user activity logs, the design was to use Cassandra as a persistent store and use redis as a cache layer to store recent activities (last 1000). I am looking for a suggestion whether such a cache layer necessary along with cassandra. Is cassandra capable to get best read and write performance? The activities are streamed to front end as pages of 10 or 15 records. Suggestions are expected to use any alternative noSQL solutions as well

It depends a lot on your requirements - Cassandra is reasonably fast for most common purposes, but redis will be faster, so having a caching layer is a reasonable and common approach. It's not strictly necessary, but it's not a bad idea.

Related

Apache Pulsar - use cases for infinite retention of a topic

I am actually planing our next version of our telemetry system architecture. I am strongly considering Pulsar at the messaging solution.
To better understand what's this technology is best for, can someone share their use cases of why their use the infinite retention of a topic other than audit trail ?
I was main goal is to see if our telemetry data could be simply stored in a pulsar topic and query that for analytics purpose instead of using a time series database like Apache Druid.
Thanks !
The use-case I've had for infinite retention is when you want to store the history going back to the beginning: e.g. in an event-sourcing style approach, the longer you're keeping the events archived, the more able you are to remix your state.
With durable-log style storage, remember that it heavily optimizes for slurping the log starting at some point. For higher-volume queries or queries with strict latency requirements, this is generally pretty unsuited for that sort of workload, and even more so if you can't limit reads to a single partition (remember also that with multiple partitions, even the ordering of the messages in the log may be difficult to reconstruct). For infrequent queries with loose latency requirements, though, storing them in pulsar might not be that bad, especially if you'd be using pulsar already to feed data into the time-series store (as you can then dispense with the time-series store).

web real time analytics dashboard: which technologies should use? (node/django, cassandra/mongodb...)

we want to develop a dashboard to analyze geospatial data.
This is a small and close approach to what we want to do: http://adilmoujahid.com/images/data-viz-talkingdata.gif
Our main concerns are about the backend technologies to be used. (front will be D3.js, DC.js, leaflet.js...)
Between Django and node.js, we think that we will use node.js, cause we've read than its faster than Django for this kind of tasks. But we are not sure and we are open to ideas.
But about Mongo or Cassandra, we are so confused. Our data is mostly structured, so store it in tables like Cassandra would make it easy to manage, also Cassandra seems to have better performance. However, we also have IoT devices data, with lots of real-time GPS location...
Which suggestions can you give to us to achieve our goal?
TL;DR Summary;
Dashboard with hundreds of simultaneous users.
Stored data will be mostly structured text/numbers, but will include also images, GPS-arrays, IoT sensors, geographical data (vector-polygons & rasters)
Databases will receive high write load coming from sensors.
Dashboard performance is so important. Its more important to read data in real time, than keeping it uncorrupted/secure.
Most calculus/math will be calculated in the client's browser, the server will try to avoid mathematical operations.
Disclaimer: I'm a DataStax employee so I'll comment on the Cassandra piece.
Cassandra is a good choice for this if your dashboard can be planned around a set of known queries. If those users will be doing ad-hoc queries directly to the database from the dashboard, you'll want something with a little more flexibility like ElasticSearch or (shameless plug) DataStax Search. Especially if you expect the queries/database to handle some of the geospatial logic.
JaguarDB has very strong support of geospatial data (2D and 3D). It allows you to store multi-measurements per point location while other databases support only one measurement (pointm). Many complex queries such as Voronoi polygon, convexhull are also supported. It is open source, distributed and sharded, multiple columns indexes, etc.
Concerning Postgresql and Cassandra, is there much difference in RAM/CPU/DISK usage between them?
Our use case does not require transactions, it will be in a single node and we will have IoT devices writing data up to 500 times per second. However ive read that Geographical data that works better with Potstgis than cassandra...
According to this use case, do you recommend Cassandra or Postgis?

Are there any benefits of using hazel cast over Cassandra

I want to implement session store for my web application. Here's the profile of my application.
The information associated with a session does not change a lot, but
it does change sometimes.
Session reads(session.getAttribute()) are more frequent than writes(session.setAttribute()).
I don't want to deal with master node based architecture(like redis).
Data associated with a session is small, but the number of sessions could be large.
The lookup is always in the form of key value like in hash map.
I am OK with eventual consistency.
I want to be able to specify the replication factor. i.e. number of nodes that will hold data for a given session
I am only looking for open source solutions that wouldn't incur license cost for above features.
For now I want to store upto 10,000 sessions with 10kb data per session(on average), but eventually I want to scale to 100,000 sessions or more!
In my app hazelcast is already being used for some other functionality. But I don't want that to be the deciding factor. Cassandra seems to fulfill all my requirements and it seems to be quite popular. Any reason I should chose hazelcast over cassandra?
Disclaimer: Hazelcast employee
In general I would argue that if you can exchange Hazelcast with Cassandra OR Cassandra with Hazelcast, one of the tools is misused.
We have plenty of people using them as companions, meaning Cassandra as the storage layer and Hazelcast as the caching layer, Cassandra, however, is not a cache and Hazelcast is not a database.
If you want to persist your storages to disk, go for Cassandra (maybe add caching with Hazelcast), if you just want to distribute, go with Hazelcast. Latter case especially if it "doesn't really matter" if you loose sessions once in a while if you (for some reason or another) restart the cluster.
We use both of them in our project. We use Cassandra as persistent storage and Hazelcast for temporary and frequently changed data (e.g. distributed queues and synchronization primitives).
Any reason I should chose hazelcast over cassandra?
In my opinion, Hazelcast is easier from developer point of view and it does not require as much attention as Cassandra on production environment (deep configuration and tunning, repeair, restarting...), so Hazelcast is cheaper in support.

DB access from a Mapper in MapReduce

I planning the next generation of an analysis system I'm developing and I think of implementing it using one of the MapReduce/Stream-Processing platforms like Flink, Spark Streaming etc.
For the analysis, the mappers must have DB access.
So my greatest concern is when a mapper is paralleled, the connections from the connection pool will all be in use and there might be a mapper that fail to access the DB.
How should I handle that?
Is it something I need to concern about?
As you have pointed out, a pull-style strategy is going to be inefficient and/or complex.
Your strategy for ingesting the meta-data from the DB will be dictated by the amount of meta-data and the frequency that the meta-data changes. Either way, moving away from fetching the meta-data when it's needed, and toward receiving updates when the meta-data is changed, is likely to be a good approach.
Some ideas:
Periodically dump the meta-data to flat file/s into distributed file system
Streaming meta-data updates to your pipeline at write-time to keep an in-memory cache up-to-date
Use a separate mechanism to fetch the meta-data, for instance Akka Actor/s polling for changes
It will depend on the trade-offs you are able to make for your given use-case.
If DB interactivity is unavoidable, I do wonder if map-reduce style frameworks would be the best approach to solve your problem. But any failed tasks should be retried by the framework.

Rate limiting - using CouchDB with Redis or CouchDB on its own

I've written an application with a CouchDB backend. I have invested a lot of time into CouchDB and so I'm reluctant to move everything over to a different NoSQL database (like Redis).
The problem is that I now need to implement a rate limiting (based on IP address) feature.
There are plenty of examples on how good Redis is for this kind of task, however because I don't want to drop CouchDB for other tasks this means I would essentially be running (and supporting) two databases (1 for most data, 1 for rate limiting) and so...
Is running CouchDB in tandem with Redis unheard of?
Is CouchDB itself suitable for handling rate limiting itself?
Is running CouchDB in tandem with Redis unheard of?
Redis is commonly used in complement with other storage solutions (MySQL, PostgreSQL, MongoDB, CouchDB, etc ...). Like many other NoSQL solutions, Redis is not adapted to all kind of workloads or situations. The authors of Redis are pragmatic and open people, and they routinely suggest to use other solutions rather than Redis, when they are more adapted to the situation.
Redis is therefore a good team player, and it is generally easy to integrate in an existing infrastructure.
Here is an example of usage of Redis with CouchDB.
Is CouchDB itself suitable for handling rate limiting itself?
CouchDB has a number of useful features to implement the rate limiting strategy described in Chris O'Hara's article. For instance, it supports bulk operations on several documents (with optional atomicity). A "bucket span" can be stored in a single document. In-place incrementation of counters can be covered by using update handlers.
IMO, the main missing feature would be automatic item expiration (which CouchDB does not provide AFAIK). So you would have to design a clever mechanism to get rid of obsolete data on top of CouchDB.
The main problem is CouchDB is not really designed for this kind of workload: it is a log structured document oriented database. Each time a counter has to be incremented, it would involve JSON unpacking/packing operations, some Javascript code to be executed, and writing a new revision of the whole document in append only files. You can find a good article describing how CouchDB stores its data here.
I suspect a rate limiting strategy implemented on top of CouchDB would not scale very well (too many I/Os, too much CPU consumption, inefficient network protocol). For instance, CouchDB is a RESTful server; I would not feel comfortable to initiate client HTTP operations (REST queries to CouchDB) to rate limit each incoming HTTP query of my system.
Redis is much more adapted to this kind of workload (fast, in-memory, no I/O, efficient client protocol, no JSON parsing/formatting, incrementations are native atomic operations, etc ...)
You can do rate limiting with Memcached - it has a nice counter increment command as you mention, plus obsolete data is automatically purged from the cache in due course, so it has all the benefits of Redis for this application without the annoying duplication of capability (and complexity) that running Redis on top of CouchDB would bring.
http://simonwillison.net/2009/jan/7/ratelimitcache/
You could add memcached to your own setup easily enough or you could investigate CouchBase whose current server product integrates a CouchDB derived database with Memcached compatibility baked in:
http://www.couchbase.com/memcached
Personally I dislike the way Couchbase forked from CouchDB, but for your application it might be a perfect fit.

Resources