DataStax Cassandra seems expensive, is there a best practice configuration to use Apache Cassandra in Production? - cassandra

DataStax seems expensive. Is there a best practice configuration that is available to use Apache Cassandra in production? I am trying to setup Cassandra on EC2.
Thanks

Instead of giving you a commercial for some other product, let me give you some practical advice when choosing to go with OSS vs Commerical licensed products.
You have two things to spend when using any technology. Time or money. Ultimately time is money, but for the sake of this let's say they are different. By your question, you have more time so let's focus on that.
Spend the time to learn the fundamentals. The term black magic is FUD. Some of the world's largest workloads are running on Cassandra. You can do it too.
Seek out peers and learn from those who have been successful. There are organizations that have been running Cassandra in prod for years.
Focus on a single use case/project. Nothing worse than trying to replace all of your infrastructure with a new technology when you are learning. Pick one thing and become proficient. Use that experience for the next projects.
You can get some free training at DataStax Academy. http://academy.datastax.com
Learn from peers by watching talks from the community of awesome users.
You can find something in these 135 talks here: https://www.youtube.com/playlist?list=PLm-EPIkBI3YoiA-02vufoEj4CgYvIQgIk
If you need to ask questions. Stack Overflow, the Cassandra mailing list, and DataStax Academy Slack are all good resources.
Using a commercial product or spending the time is up to you, but don't let anyone try to convince you that it's too hard and you should use something else. We are all here to help if we can.

Disclaimer: I'm a ScyllaDB employee.
There are several alternative to operate Cassandra/Scylla like workloads.
Use OpenSource Cassandra, with best practices. Most of them, unfortunately, where created couple of years ago. So you'll need to learn the black magic of tuning JVM and Cassandra loads.
https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html
There are no "official" AMIs on AWS for recent releases of Cassandra.
Use Scylla OpenSource. It is a drop replacement for Cassandra. Scylla autotunes itself, to minimize the intervention of the operator in the day-to-day operations. Also, Scylla provides opensoure AMIs for EC2 deployment, so, all you need is an AWS account.
Scylla is a C++ implementation of Cassandra, which benefits from the great (and costly) resources on AWS. Thus, offer a better ops/$$ ratio. Scylla highly recommends the usage of I3 instances, you'd be using contemporary CPU technology, excellent I/O (NVMe based) and lots of memory at the fraction of the cost of other EC2 instances.
You can read more about it here:
http://www.scylladb.com/2017/05/10/faster-and-better-what-to-expect-running-scylla-on-aws-i3-instances/
ScyllaDB is committed to provide opensource, optimized AMI versions.
Buy enterprise licenses from DataStax or Scylla.
Hire consultants to help you install a Cassandra setup.
Companies like "the last pickle" or Pythian can help you in that regard.
Use DBaaS offerings from the following companies:
Scylla:
IBM Compose: https://www.compose.com/databases/scylladb
Joyent Triton:https://www.joyent.com/blog/free-trial-managed-scylladb-beta-on-triton
Scylla and Cassandra
Instaclustr: https://www.instaclustr.com/
Hope this helps.

Related

Janus Graph backend cassandra vs Bigtable

I am planning to use Janusgraph for building graph of different uses our team handles and I see that janus graph has option to use BigTable or Cassandra as storage backend. I am looking for any recommendation on which backend is more optimal/performant ( I am mainly talking about gremlin query performance on 2 hop neighbor of a node ) with JanusGraph.
I understand that performance is pretty subjective and varies based on datasize/graph connectivity and use case so best approach will be to try out myself, which I am planning to do. But has anyone else has done similar performance comparison ? Is there any general recommendation about storage backend here ?
You're right in that performance is both:
subjective
depends largely on data size
I can tell you that I have done this exercise as well. To that end, I think it's important to share this comparison from DB-Engines.com.
In terms of performance, the biggest thing I'd be looking at is how each handles consistency. As a general rule, databases which enforce stronger levels of consistency typically have to sacrifice performance.
BigTable == strong-consistent
Cassandra == eventually consistent
Other factors worth considering, are the fact that BigTable limits you to Google Cloud (GCP). And if you don't want to lose performance over the network, you'll also need to pay for more (Janus) instances on GCP for data locality.
In terms of raw DB-Engine "score," Cassandra is currently at 114.112, while BigTable is at a paltry 3.582. These scores will change month-to-month, but in general this signifies that Cassandra has a much stronger community around it. Similarly, Cassandra has 18182 questions on this site, while BigTable only has 449. Bottom line, is that it'll be much easier to get support and answers to questions.
Just based on the underlying strength of the community, Cassandra is the better option here.
Having supported JanusGraph on Cassandra for the last few years, I can tell you that overall it's been solid. The difficulties tend to come into play with bulk data loading. But outside of that, things seem to run pretty well.

JanusGraph + Cassandra (Generic questions)

I have a few questions regarding the integration of the two tools. Not technical questions and how to setup( i will have my fun with that later ) but more on the course of the project and the direction, seeing that JanusGraph is still very young.
I am starting a new project and already decided to use Cassandra for storage and using a graph on top sounds very appealing to me.
A couple of things that i would like to know in advance before i take that road.
JanusGraph is very young and it picks up from where Titan left about a year or so ago. There is gap there but the fact that is part of the Linux Foundation and all the big players are going to support it sounds promising. Is it safe to assume at this point that JanusGraph is here to stay? Would it be safe to depend on Janus as a startup project? And follow development of course and be up to date as much as possible.
Cassandra. Titan/JanusGraph integrates with Cassandra 2.1.9 using the thrift api which will be deprecated eventually in Cassandra 4. I know that work is being done at the moment to make janus work with Cassandra 3 and eventually work with CQL as well. Is it safe to start with existing janus and Cassandra 2.1.9 and deal with the migration later on? Will it be a huge task for a startup to handle?
Production ready JanusGraph.(This question relates to any kind of software in it's early stages and whether it's safe for a start up to use). As i understand it, it will take some time for JanusGraph to be production ready and catch up with the rest of the tools it integrates with( although work is being done as we speak:)). Again would it be safe to start using Janus at this point and follow development and finally migrate to a production ready version? What is the overall roadmap for JanusGraph?
My concern in general is whether the combination of the tools is a safe choice for a start up. The whole stack is already new to us and we are excited to try and learn but we will hit a migration period pretty quickly. Is it something that you would do/recommend? Is it a suicide?
Please share your thoughts and keep in mind that it doesn't have to be about the stack i am talking about. It could be any startup company dealing with any kind of software in its early stages.
Cheers
Full disclosure, I'm a developer for JanusGraph on Compose.
It's as safe as any other OSS software project with a large amount of backers. Everyone could jump on some new toy tomorrow, but I doubt it. Companies are putting money into it and the development community is very active.
There is a CQL backend for Janus that's compatible with the Thrift data model. Migration to CQL should be simple and pretty painless when 0.2.0 is released.
I know there are already people using Titan for production applications. With JanusGraph being forked from Titan, I think it's pretty reasonable to start in with JanusGraph from everything I've seen. As far as a roadmap, I'd check out the JanusGraph mailing list (dev/users) and see what's going on and what's being talked about.
Disclosure: I am one of the co-founders of the JanusGraph project; I am also seeking out and adding production users to our GitHub repo and website, so I may be slightly biased. :)
Regarding your questions:
Is it safe to use?
The project is young, but it is built on a foundation of Titan, a very popular graph database that's been around since 2012 and has already been running in production. We have contributors from a number of well-known companies, and several companies are building their business-critical applications directly on JanusGraph, e.g.,
GRAKN.AI is building their knowledge graph on JanusGraph
IBM's Compose.io has built a managed JanusGraph service
Uber is already running JanusGraph in production (having previously run Titan)
several other companies run JanusGraph as a core part of their production environment
We are also starting to identify companies who will provide consulting services around JanusGraph in case someone needs production-level support for their own self-managed deployments.
So as you can see, there is significant interest in and support for this project.
Cassandra upgrade
#pantalohnes answered this question; I won't repeat it here.
Production readiness
As I linked above (GitHub repo and website), we already have production users of JanusGraph which you can find there. Those are just the companies that are publicly willing to lend their name/logo to the project; I'm sure there are more. Also, Titan has been running in many production environments for several years; JanusGraph is a more up-to-date version of Titan, despite the low version number.
I am also speaking with other companies who are planning to migrate to JanusGraph soon; look for announcements via the #JanusGraph Twitter handle to learn about more production deployments.

Has anyone on stackoverflow successfully used CouchDB for a webapp and deployed to a production environment? [duplicate]

I have been using CouchDB on some prototype applications and it has been brilliant, very easy to use and extremely quick. I was wondering if anyone has been using it in production and have any views on it's reliability, performance suitability for operational management etc ?? I am considering using it to support a service layer and would make use of its replication functionality.
Any comments/experiences would be most welcome.
I've used CouchDB for a few small in-house applications - it's been very stable and I've had no serious complaints. Setting that aside, a few small gripes -
1) Databases can be synchronized, but not nodes. That is, if you have four servers and twenty databases, you have to specify each server, and each database to synchronize. A minor gripe, but I prefer less management to more.
2) Since databases are append only, a database with a bunch of activity gets really big really quickly. Compacting fixes this, but isn't exactly fast, especially on big (e.g. 20 gigabytes) database. Scheduling compact for the weekends solved this, but doing that is probably less of an option for high availability applications.
3) Javascript is the de facto view language. What is not well advertised is that since CouchDB is written in Erlang, it also supports Erlang views, which are faster as they are "native". For applications doing a lot of operations in views, Erlang probably makes more sense.
Setting those minor issues aside, I'd wholeheartedly recommend it.
CouchDB ships in Ubuntu and is a fundamental component of the Ubuntu One service.

CouchDB in Production

I have been using CouchDB on some prototype applications and it has been brilliant, very easy to use and extremely quick. I was wondering if anyone has been using it in production and have any views on it's reliability, performance suitability for operational management etc ?? I am considering using it to support a service layer and would make use of its replication functionality.
Any comments/experiences would be most welcome.
I've used CouchDB for a few small in-house applications - it's been very stable and I've had no serious complaints. Setting that aside, a few small gripes -
1) Databases can be synchronized, but not nodes. That is, if you have four servers and twenty databases, you have to specify each server, and each database to synchronize. A minor gripe, but I prefer less management to more.
2) Since databases are append only, a database with a bunch of activity gets really big really quickly. Compacting fixes this, but isn't exactly fast, especially on big (e.g. 20 gigabytes) database. Scheduling compact for the weekends solved this, but doing that is probably less of an option for high availability applications.
3) Javascript is the de facto view language. What is not well advertised is that since CouchDB is written in Erlang, it also supports Erlang views, which are faster as they are "native". For applications doing a lot of operations in views, Erlang probably makes more sense.
Setting those minor issues aside, I'd wholeheartedly recommend it.
CouchDB ships in Ubuntu and is a fundamental component of the Ubuntu One service.

Cassandra vs Riak [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I am looking for an eventually consistent data store and it looks like it may be coming down to Riak or Cassandra. Has anyone got expereinces of a view on this?
As you probably know, they are both architecturally strongly influenced by Dynamo (eventually consistent, no single points of failure, etc). Both also go beyond Dynamo in providing a "richer than pure K/V" data model -- in Cassandra's case, providing a Bigtable-like ColumnFamily mode, in Riak's, a Document-oriented one. I have seen sane people choose both.
I believe points that favor Cassandra include
speed
support for clusters spanning multiple data centers
big names using it (digg, twitter, facebook, webex, ... -- http://n2.nabble.com/Cassandra-users-survey-tp4040068p4040393.html)
Points that favor Riak include
map/reduce support out of the box
/Cassandra dev, fwiw
Riak is used by
Mozilla Foundation
Ask.com sponsored listings
Comcast
Citigroup
Bet365
I think they both pass the test of credible reference customers/users.
Cassandra seems more mature, and is currently doing better in benchmarks. Riak seems easier to add a node to as your cluster grows.
For completeness: A good (probably biased) comparison between the two can be found at http://docs.basho.com/riak/1.3.2/references/appendices/comparisons/Riak-Compared-to-Cassandra/
Use and download are different. Best to get references.
Perhaps a private conversation could be had where Riak references in these companies could be shared? Not sure how to get such with Cassandra, but there is a community of companies that support Cassandra that seem like a good place to start. As these probably have community participants in Cassandra development, it may be a REALLY reasonable place to start.
I would like to hear Riak's answer to recent and large deployments where customers are happy.
I also would like to see the roadmap for each product. Cassandra is a bit easier to track (http://wiki.apache.org/cassandra/) than Riak in my view as Cassandra's wiki discusses limitations and things that are probably going to change going forward, but neither outline futures well. I could understand that of an open source community ... perhaps ... but I cannot for a product for which I must pay.
I also would suggest research of Cloudant, which has what appears to be a very nice layering of capabilities. It also looks like it is bringing to bear the capabilities elsewhere in Apache land. CouchDB is the Apache platform on which Cloudant is based. BUT the indexing with Lucene seems but the tip of the iceberg when it comes to where Cloudant could go. Creating and managing an index is a very systematic process, a kind of data pipeline, that could be scripted using other Apache community assets. AND capabilities like NLP also could be added through Lucene indirectly, or maybe directly into what is persisted.
It would be nice to see a proposed Cloudant roadmap, especially since the team could mine the riches of the Apache community and integrate such into Cloudant. Such probably exists as there is an operational component to the Cloudant revenue model that will require it, if for no other reason.
Another area of interest ... Cloudant's pricing model ... it is clear their revenue model is not based on software, but around service. That is quite attractive, and it seems consistent with the ecosystem surrounding Cassandra too. I don't know if the Basho folks have won over enough of the nosql community as yet ... don't see such from any buzz around their web site or product.
I like this Cloudant web page (https://cloudant.com/the-data-layer/). I was surprised to see the embedded Erlang capability ... I did not know CouchDB was written in Erlang as this seems unusual to me in the Apache community (my ignorance); CouchDB appears to be older than other nosql products I know (now) to be written in Erlang. Whatever their strategy, they at least count Amazon EC2 and Microsoft Azure as hosting partners, indicating an appreciation of Microsoft and !Microsoft worlds - all very important if properly recognizing the middleware value potential (beyond cache or hash table applications) that these types of data stores could have.
Finally, while I don't know the board well, Andy Palmer's guidance looks like it will be valuable. He can bring some guidance vis-a-vis structured data (through VoltDB) to a world that rightly or wrongly may be unfairly branded as KVP hash tables of unstructured data. The need for structure and ecosystem surrounding nosql "databases" is being recognized ... witness Google's efforts with Spanner ... KVP/little structure/need for search-ability motivated Google's investment in the Spanner space. While we all may not need something like Spanner, we probably do need an improving and robust "enterprise" management and interoperability capability in these nosql databases to make it reasonable to incorporate them into modern cloud architectures. The needed structure can come from ease of interoperability and functional richness. It can also come from new capabilities that support conversion of unstructured data to structured data (e.g. indexes, use of NLP to create structured and parsed renderings of things inside of a KVP blob, and plenty of other things that, if put into a roadmap and published, could entice and grow a user base). Cloudant looks like it has a good chance of success ... I will take a closer look at it ...
And look what I found about CouchDB ...
CouchDB comes with a suite of features, such as on-the-fly document transformation and real-time change notifications, that makes web app development a breeze. It even comes with an easy to use web administration console. You guessed it, served up directly out of CouchDB! We care a lot about distributed scaling. CouchDB is highly available and partition tolerant, but is also eventually consistent. And we care a lot about your data. CouchDB has a fault-tolerant storage engine that puts the safety of your data first.

Resources