Janus Graph backend cassandra vs Bigtable - cassandra

I am planning to use Janusgraph for building graph of different uses our team handles and I see that janus graph has option to use BigTable or Cassandra as storage backend. I am looking for any recommendation on which backend is more optimal/performant ( I am mainly talking about gremlin query performance on 2 hop neighbor of a node ) with JanusGraph.
I understand that performance is pretty subjective and varies based on datasize/graph connectivity and use case so best approach will be to try out myself, which I am planning to do. But has anyone else has done similar performance comparison ? Is there any general recommendation about storage backend here ?

You're right in that performance is both:
subjective
depends largely on data size
I can tell you that I have done this exercise as well. To that end, I think it's important to share this comparison from DB-Engines.com.
In terms of performance, the biggest thing I'd be looking at is how each handles consistency. As a general rule, databases which enforce stronger levels of consistency typically have to sacrifice performance.
BigTable == strong-consistent
Cassandra == eventually consistent
Other factors worth considering, are the fact that BigTable limits you to Google Cloud (GCP). And if you don't want to lose performance over the network, you'll also need to pay for more (Janus) instances on GCP for data locality.
In terms of raw DB-Engine "score," Cassandra is currently at 114.112, while BigTable is at a paltry 3.582. These scores will change month-to-month, but in general this signifies that Cassandra has a much stronger community around it. Similarly, Cassandra has 18182 questions on this site, while BigTable only has 449. Bottom line, is that it'll be much easier to get support and answers to questions.
Just based on the underlying strength of the community, Cassandra is the better option here.
Having supported JanusGraph on Cassandra for the last few years, I can tell you that overall it's been solid. The difficulties tend to come into play with bulk data loading. But outside of that, things seem to run pretty well.

Related

web real time analytics dashboard: which technologies should use? (node/django, cassandra/mongodb...)

we want to develop a dashboard to analyze geospatial data.
This is a small and close approach to what we want to do: http://adilmoujahid.com/images/data-viz-talkingdata.gif
Our main concerns are about the backend technologies to be used. (front will be D3.js, DC.js, leaflet.js...)
Between Django and node.js, we think that we will use node.js, cause we've read than its faster than Django for this kind of tasks. But we are not sure and we are open to ideas.
But about Mongo or Cassandra, we are so confused. Our data is mostly structured, so store it in tables like Cassandra would make it easy to manage, also Cassandra seems to have better performance. However, we also have IoT devices data, with lots of real-time GPS location...
Which suggestions can you give to us to achieve our goal?
TL;DR Summary;
Dashboard with hundreds of simultaneous users.
Stored data will be mostly structured text/numbers, but will include also images, GPS-arrays, IoT sensors, geographical data (vector-polygons & rasters)
Databases will receive high write load coming from sensors.
Dashboard performance is so important. Its more important to read data in real time, than keeping it uncorrupted/secure.
Most calculus/math will be calculated in the client's browser, the server will try to avoid mathematical operations.
Disclaimer: I'm a DataStax employee so I'll comment on the Cassandra piece.
Cassandra is a good choice for this if your dashboard can be planned around a set of known queries. If those users will be doing ad-hoc queries directly to the database from the dashboard, you'll want something with a little more flexibility like ElasticSearch or (shameless plug) DataStax Search. Especially if you expect the queries/database to handle some of the geospatial logic.
JaguarDB has very strong support of geospatial data (2D and 3D). It allows you to store multi-measurements per point location while other databases support only one measurement (pointm). Many complex queries such as Voronoi polygon, convexhull are also supported. It is open source, distributed and sharded, multiple columns indexes, etc.
Concerning Postgresql and Cassandra, is there much difference in RAM/CPU/DISK usage between them?
Our use case does not require transactions, it will be in a single node and we will have IoT devices writing data up to 500 times per second. However ive read that Geographical data that works better with Potstgis than cassandra...
According to this use case, do you recommend Cassandra or Postgis?

Performance Difference between Apache Ignite compute grid and Spark

I need to do some computation on a large set of data. In this computation, data will be manipulated and stored (persisted) to Database.
So can anyone suggest which technology is preferable in accordance with performance and Resource Utilization?
The best way here is to implement your logic and check how it works with both frameworks. Also, you can run different benchmarks in your environment.
Information about benchmarks for Apache Ignite you and read here:
https://apacheignite.readme.io/docs/perfomance-benchmarking
In case of any technical questions, you can ask them to the community.

Need architecture hint: Data replication into the cloud + data cleansing

I need to sync customer data from several on-premise databases into the cloud. In a second step, the customer data there needs some cleanup in order to remove duplicates (of different types). Based on that cleansed data I need to do some data analytics.
To achieve this goal, I'm searching for an open source framework or cloud solution I can use for. I took a look into Apache Apex and Apache Kafka, but I'm not sure whether these are the right solutions.
Can you give me a hint which frameworks you would use for such an task?
From my quick read on APEX it requires Hadoop underneath coupling to more dependencies than you probably want early on.
Kafka on the other hand is used for transmitting messages (it has other APIs such as streams and connect which im not as familiar with).
Im currently using Kafka to stream log files in real time from a client system. Out of the box Kafka really only provides fire and forget semantics. I have had to add a bit to make it an exactly once delivery semantic (Kafka 0.11.0 should solve this).
Overall, think of KAFKA being a more low level solution with logical message domains with queues and from what I skimmed over APEX being a more heavy packaged library with alot more things to explore.
Kafka would allow you to switch out the underlying analytical system of your choosing with their consumer api.
The question is very generic, but I'll try to outline a few different scenarios, as there are many parameters in play here. One of them is cost, which on the cloud it can quickly build up. Of course, the size of data is also important.
These are a few things you should consider:
batch vs streaming: do the updates flow continuously, or the process is run on demand/periodically (sounds the latter rather than the former)
what's the latency required ? That is, what's the maximum time that it would take an update to propagate through the system ? Answer to this question influences question 1)
how much data are we talking about ? If you're up the Gbyte size, Tbyte or Pbyte ? Different tools have different 'maximum altitude'
and what format ? Do you have text files, or are you pulling from relational DBs ?
Cleaning and deduping can be tricky in plain SQL. What language/tools are you planning on using to do that part ? Depending on question 3), data size, deduping usually requires a join by ID, which is done in constant time in a key value store, but requires a sort (generally O(nlogn)) in most other data systems (spark, hadoop, etc)
So, while you ponder all this questions, if you're not sure, I'd recommend you start your cloud work with an elastic solution, that is, pay as you go vs setting up entire clusters on the cloud, which could quickly become expensive.
One cloud solution that you could quickly fire up is amazon athena (https://aws.amazon.com/athena/). You can dump your data in S3, where it's read by Athena, and you just pay per query, so you don't pay when you're not using it. It is based on Apache Presto, so you could write the whole system using basically SQL.
Otherwise you could use Elastic Mapreduce with Hive (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive.html). Or Spark (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark.html). It depends on what language/technology you're most comfortable with. Also, there are similar products from Google (BigData, etc) and Microsoft (Azure).
Yes, you can use Apache Apex for your use case. Apache Apex is supported with Apache Malhar which can help you build application quickly to load data using JDBC input operator and then either store it to your cloud storage ( may be S3 ) or you can do de-duplication before storing it to any sink. It also supports Dedup operator for such kind of operations. But as mentioned in previous reply, Apex do need Hadoop underneath to function.

How does AWS SimpleDb differ to Azure DocumentDb? How do both differ to ElasticSearch

In terms of
scalability,
performance,
maintenance,
Ease of use / Learning curve
cost,
In order of significance but wouldn't mind a general answer as I appreciate I m probably asking for too much :)
Thanks
EDIT: I m looking for a database that will serve as the single authoritative data store and I need all attributes of the documents stored to be indexed for various business reasons. Therefore I know that other solutions won't do what I m looking for.
tl;dr; If you are using JavaScript and building browser apps, node.js and DocumentDB are a match made in heaven. If you are using .NET and/or other Azure services, then DocumentDB is favored. If you are using other AWS services, then SimpleDB might be better.
I know that questions like this are not ideal for Stack Overflow, but I often see value in answers like this and my most popular answer on SO is essentially informed opinion backed by evidence. I have not used SimpleDB but I looked into it before deciding on DocumentDB. I rejected it pretty quickly... although I did give AWS Lambda a serious look before deciding on DocumentDB. So:
scalability. DocumentDB has a very straight forward and explicit scaling model -- add more collections if you need either more space or more operations per second. SimpleDB's scaling model is similar except less straight forward since you add domains which are overloaded to both provide type separation (think tables) and scalability. You can scale either to whatever you need.
performance. Since I never built anything on it, I can't say anything about SimpleDB's performance. However, I've been very impressed with the performance of DocumentDB. Latency is less than 10ms for simple id-based reads and I get impressive latency and throughput for queries. The DocumentDB implementation of our current app returns complex n-dimensional aggregations (done in stored procedures on DocumentDB using documentdb-lumenize) in 1/4 the time of the functionally-equivalent MongoDB/node.js implementation. You'd have to do your own performance testing on your actuall application to have a definitive answer here.
maintenance. Both are much more hands off than traditional data stores. There just aren't that many knobs to turn maintaining either of them. SimpleDB geographically distributes your data by default. You'd have to do the equivalent manually in DocumentDB. Possible, but harder. DocumentDB has good import/export tools and their backup solution is about to be significantly upgraded.
ease of use / learning curve. If you are JavaScript programmer, than DocumentDB has a lot to recommend it. DocumentDB uses JSON natively. SimpleDB uses XML. DocumentDB has ACID-enabling stored procedures written in JavaScript. You'd need to combine SimpleDB with something else (Lambda maybe, but the XML/JavaScript mismatch would make this less than ideal) to get the equivalent. Both allow use to use SQL but DocumentDB also allows for JavaScript native queries.
There is one huge mindset hurdle that you will have to get over in order to be successful with DocumentDB. Despite the fact that they both scale by adding more domains/collections, SimpleDB domains are closer conceptually to tables. The word choice of "collection" by the DocumentDB team is unfortunate since they are more akin to partitions and should not be thought of as tables. The hard part is getting used to the idea that you store all of your different data types in the same collection. Once you get over that, I find DocumentDB's approach refreshing and incredibly flexible. I can efficiently model inheritance and type-mixins. Collections nay partitions have one purpose -- scalability. Domains are used for both scalability and data type separation which is actually harder in practice.
cost. Not much to say here. Both allow you to scale your cost gradually. For really small implementations, DocumentDB is probably more expensive since the smallest unit of usage is a single collection which is $25/month minimum. You'd have to do your own modeling/what-if analysis to determine which would be less expensive for you. Note, that Azure is being every aggressive in general and even pushing AWS to lower prices in some cases. My gut is that they would be roughly equal in cost for most applications.
Other thoughts:
You wrote, "I need all attributes of the documents stored to be indexed". One really nice feature of DocumentDB is that you can specify the size of your indexes By default, every field is indexed into a 3-byte per field hash index, which is highly space efficient. I do not know if SimpleDB has the equivalent.
This is a bit like comparing apples to oranges. I consider DocumentDB to be like MongoDB or CouchDB in it's data model and VoltDB in its use execution model (although VoltBD sprocs are written in Java). SimpleDB feels more like a simple XML object store. If you already have a big XML mindset, then it might be easier, but I think there are more folks using JSON today than XML.
Writing ACID-enabling stored procedures in JavaScript is a killer feature that only DocumentDB has. Some say the days of stored procedures are over; that you should put all such logic in your application server layer. If you implementing a simple CRUD API, that may be, but almost every application requires some sort of transaction where more than one row is changed at a time. This is mind bogglingly hard to do correctly without transaction support in your data store. Even if you do implement the equivalent of transactions with your NoSQL database, the overhead of the implementation eats away any development/performance/scalability advantages that you got by choosing NoSQL rather than SQL.
DocumentDB's user defined functions and triggers (also written in JavaScript) might be useful, although I believe the trigger implementation is crippled at this moment in time and I haven't found a use for UDFs myself yet.
DocumentDB has built-in attachment support. You need to integrate manually with S3 for the equivalent on AWS.
DocumentDB has geo indexing and operators.
SimpleDB's 1K per document limit is a serious limitation. This tells me that it's designed mostly for logging or as an index to S3 and not a full-fledged document store. The limit for DocumentDB is 512K.
If comparison to SimpleDB is like apples to oranges, then comparison to ElasticSearch is like apples to fire engines. My impression of ElasticSearch is that it's all about full-text searching and analytics. I don't think it's space/execution/api efficient to serve as a primary transactional store. Built on Lucene, it was not designed to have the reliability/durability to be your primary store. Further, even when hosted, it's more of an IaaS offering, wherease DocumentDB and SimpleDB are true PaaS offerings. The maintenance will be much higher with ElasticSearch.

Cassandra vs Riak [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I am looking for an eventually consistent data store and it looks like it may be coming down to Riak or Cassandra. Has anyone got expereinces of a view on this?
As you probably know, they are both architecturally strongly influenced by Dynamo (eventually consistent, no single points of failure, etc). Both also go beyond Dynamo in providing a "richer than pure K/V" data model -- in Cassandra's case, providing a Bigtable-like ColumnFamily mode, in Riak's, a Document-oriented one. I have seen sane people choose both.
I believe points that favor Cassandra include
speed
support for clusters spanning multiple data centers
big names using it (digg, twitter, facebook, webex, ... -- http://n2.nabble.com/Cassandra-users-survey-tp4040068p4040393.html)
Points that favor Riak include
map/reduce support out of the box
/Cassandra dev, fwiw
Riak is used by
Mozilla Foundation
Ask.com sponsored listings
Comcast
Citigroup
Bet365
I think they both pass the test of credible reference customers/users.
Cassandra seems more mature, and is currently doing better in benchmarks. Riak seems easier to add a node to as your cluster grows.
For completeness: A good (probably biased) comparison between the two can be found at http://docs.basho.com/riak/1.3.2/references/appendices/comparisons/Riak-Compared-to-Cassandra/
Use and download are different. Best to get references.
Perhaps a private conversation could be had where Riak references in these companies could be shared? Not sure how to get such with Cassandra, but there is a community of companies that support Cassandra that seem like a good place to start. As these probably have community participants in Cassandra development, it may be a REALLY reasonable place to start.
I would like to hear Riak's answer to recent and large deployments where customers are happy.
I also would like to see the roadmap for each product. Cassandra is a bit easier to track (http://wiki.apache.org/cassandra/) than Riak in my view as Cassandra's wiki discusses limitations and things that are probably going to change going forward, but neither outline futures well. I could understand that of an open source community ... perhaps ... but I cannot for a product for which I must pay.
I also would suggest research of Cloudant, which has what appears to be a very nice layering of capabilities. It also looks like it is bringing to bear the capabilities elsewhere in Apache land. CouchDB is the Apache platform on which Cloudant is based. BUT the indexing with Lucene seems but the tip of the iceberg when it comes to where Cloudant could go. Creating and managing an index is a very systematic process, a kind of data pipeline, that could be scripted using other Apache community assets. AND capabilities like NLP also could be added through Lucene indirectly, or maybe directly into what is persisted.
It would be nice to see a proposed Cloudant roadmap, especially since the team could mine the riches of the Apache community and integrate such into Cloudant. Such probably exists as there is an operational component to the Cloudant revenue model that will require it, if for no other reason.
Another area of interest ... Cloudant's pricing model ... it is clear their revenue model is not based on software, but around service. That is quite attractive, and it seems consistent with the ecosystem surrounding Cassandra too. I don't know if the Basho folks have won over enough of the nosql community as yet ... don't see such from any buzz around their web site or product.
I like this Cloudant web page (https://cloudant.com/the-data-layer/). I was surprised to see the embedded Erlang capability ... I did not know CouchDB was written in Erlang as this seems unusual to me in the Apache community (my ignorance); CouchDB appears to be older than other nosql products I know (now) to be written in Erlang. Whatever their strategy, they at least count Amazon EC2 and Microsoft Azure as hosting partners, indicating an appreciation of Microsoft and !Microsoft worlds - all very important if properly recognizing the middleware value potential (beyond cache or hash table applications) that these types of data stores could have.
Finally, while I don't know the board well, Andy Palmer's guidance looks like it will be valuable. He can bring some guidance vis-a-vis structured data (through VoltDB) to a world that rightly or wrongly may be unfairly branded as KVP hash tables of unstructured data. The need for structure and ecosystem surrounding nosql "databases" is being recognized ... witness Google's efforts with Spanner ... KVP/little structure/need for search-ability motivated Google's investment in the Spanner space. While we all may not need something like Spanner, we probably do need an improving and robust "enterprise" management and interoperability capability in these nosql databases to make it reasonable to incorporate them into modern cloud architectures. The needed structure can come from ease of interoperability and functional richness. It can also come from new capabilities that support conversion of unstructured data to structured data (e.g. indexes, use of NLP to create structured and parsed renderings of things inside of a KVP blob, and plenty of other things that, if put into a roadmap and published, could entice and grow a user base). Cloudant looks like it has a good chance of success ... I will take a closer look at it ...
And look what I found about CouchDB ...
CouchDB comes with a suite of features, such as on-the-fly document transformation and real-time change notifications, that makes web app development a breeze. It even comes with an easy to use web administration console. You guessed it, served up directly out of CouchDB! We care a lot about distributed scaling. CouchDB is highly available and partition tolerant, but is also eventually consistent. And we care a lot about your data. CouchDB has a fault-tolerant storage engine that puts the safety of your data first.

Resources