Best way to benchmark Cassandra and Hbase for performance?

Best way to benchmark Cassandra and Hbase for performance? - cassandra

What's the best way to benchmark Cassandra and Hbase for performance?
I'm working on an application where the Read (80%) and Write (20%) usage through an web application. Users can also do CRUD (Create, Read, Update, Delete) to the data. Our data is all structured from (RDBMS). I have heard about YCSB (Yahoo! Cloud Serving Benchmark).
Had anyone done benchmark on Cassandra vs Hbase for a similar usecase like above?

I will assume that your Cassandra is sitting behind a web app?
If so (as you mentioned CRUD), just benchmark the end points of your CRUD for WRITE (the Create) and the READ via Apache Workbench or Siege under load (ie concurrent calls, etc..)
Update
If you want to purely test if your configuration of Cassandra is correct for raw power:
http://docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsCStress_t.html
but if you want to test the application as a whole, Apache workbench and Siege will test your App.

Most of the databases provide some tool to do performance testing. In my opinion, the best way to get an unbiased view is to use a third party tool like https://github.com/brianfrankcooper/YCSB which supports testing different types of ACID and NoSQL databases.

Related

web real time analytics dashboard: which technologies should use? (node/django, cassandra/mongodb...)

we want to develop a dashboard to analyze geospatial data.
This is a small and close approach to what we want to do: http://adilmoujahid.com/images/data-viz-talkingdata.gif
Our main concerns are about the backend technologies to be used. (front will be D3.js, DC.js, leaflet.js...)
Between Django and node.js, we think that we will use node.js, cause we've read than its faster than Django for this kind of tasks. But we are not sure and we are open to ideas.
But about Mongo or Cassandra, we are so confused. Our data is mostly structured, so store it in tables like Cassandra would make it easy to manage, also Cassandra seems to have better performance. However, we also have IoT devices data, with lots of real-time GPS location...
Which suggestions can you give to us to achieve our goal?
TL;DR Summary;
Dashboard with hundreds of simultaneous users.
Stored data will be mostly structured text/numbers, but will include also images, GPS-arrays, IoT sensors, geographical data (vector-polygons & rasters)
Databases will receive high write load coming from sensors.
Dashboard performance is so important. Its more important to read data in real time, than keeping it uncorrupted/secure.
Most calculus/math will be calculated in the client's browser, the server will try to avoid mathematical operations.

Disclaimer: I'm a DataStax employee so I'll comment on the Cassandra piece.
Cassandra is a good choice for this if your dashboard can be planned around a set of known queries. If those users will be doing ad-hoc queries directly to the database from the dashboard, you'll want something with a little more flexibility like ElasticSearch or (shameless plug) DataStax Search. Especially if you expect the queries/database to handle some of the geospatial logic.

JaguarDB has very strong support of geospatial data (2D and 3D). It allows you to store multi-measurements per point location while other databases support only one measurement (pointm). Many complex queries such as Voronoi polygon, convexhull are also supported. It is open source, distributed and sharded, multiple columns indexes, etc.

Concerning Postgresql and Cassandra, is there much difference in RAM/CPU/DISK usage between them?
Our use case does not require transactions, it will be in a single node and we will have IoT devices writing data up to 500 times per second. However ive read that Geographical data that works better with Potstgis than cassandra...
According to this use case, do you recommend Cassandra or Postgis?

Need architecture hint: Data replication into the cloud + data cleansing

I need to sync customer data from several on-premise databases into the cloud. In a second step, the customer data there needs some cleanup in order to remove duplicates (of different types). Based on that cleansed data I need to do some data analytics.
To achieve this goal, I'm searching for an open source framework or cloud solution I can use for. I took a look into Apache Apex and Apache Kafka, but I'm not sure whether these are the right solutions.
Can you give me a hint which frameworks you would use for such an task?

From my quick read on APEX it requires Hadoop underneath coupling to more dependencies than you probably want early on.
Kafka on the other hand is used for transmitting messages (it has other APIs such as streams and connect which im not as familiar with).
Im currently using Kafka to stream log files in real time from a client system. Out of the box Kafka really only provides fire and forget semantics. I have had to add a bit to make it an exactly once delivery semantic (Kafka 0.11.0 should solve this).
Overall, think of KAFKA being a more low level solution with logical message domains with queues and from what I skimmed over APEX being a more heavy packaged library with alot more things to explore.
Kafka would allow you to switch out the underlying analytical system of your choosing with their consumer api.

The question is very generic, but I'll try to outline a few different scenarios, as there are many parameters in play here. One of them is cost, which on the cloud it can quickly build up. Of course, the size of data is also important.
These are a few things you should consider:
batch vs streaming: do the updates flow continuously, or the process is run on demand/periodically (sounds the latter rather than the former)
what's the latency required ? That is, what's the maximum time that it would take an update to propagate through the system ? Answer to this question influences question 1)
how much data are we talking about ? If you're up the Gbyte size, Tbyte or Pbyte ? Different tools have different 'maximum altitude'
and what format ? Do you have text files, or are you pulling from relational DBs ?
Cleaning and deduping can be tricky in plain SQL. What language/tools are you planning on using to do that part ? Depending on question 3), data size, deduping usually requires a join by ID, which is done in constant time in a key value store, but requires a sort (generally O(nlogn)) in most other data systems (spark, hadoop, etc)
So, while you ponder all this questions, if you're not sure, I'd recommend you start your cloud work with an elastic solution, that is, pay as you go vs setting up entire clusters on the cloud, which could quickly become expensive.
One cloud solution that you could quickly fire up is amazon athena (https://aws.amazon.com/athena/). You can dump your data in S3, where it's read by Athena, and you just pay per query, so you don't pay when you're not using it. It is based on Apache Presto, so you could write the whole system using basically SQL.
Otherwise you could use Elastic Mapreduce with Hive (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive.html). Or Spark (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark.html). It depends on what language/technology you're most comfortable with. Also, there are similar products from Google (BigData, etc) and Microsoft (Azure).

Yes, you can use Apache Apex for your use case. Apache Apex is supported with Apache Malhar which can help you build application quickly to load data using JDBC input operator and then either store it to your cloud storage ( may be S3 ) or you can do de-duplication before storing it to any sink. It also supports Dedup operator for such kind of operations. But as mentioned in previous reply, Apex do need Hadoop underneath to function.

stubbed cassandra for data storage

I need an embedded cassandra for my project and I was wondering if I can use Stubbed Cassandra for data storage. Because I need a system to simulate CQL requests and responses.
Thanks everyone.

You cant use it as a real datastore. Use real cassandra as a real cassandra datastore. check out ccm which is probably more what your looking for.
There are wrappers for it in dtests (python) and the java driver uses it for testing and has a java wrapper.

I don't really have any experience at all with SCassandra but I worked on several projects using Apache Cassandra and there are some use cases like multidatacenter infrastructure to experiment and I don't think SCassandra can do it. So if you plan to do simple tests, that's fine, But advanced use cases really need to be tested in a real cassandra distribution.

As others have mentioned, you will need the real Cassandra for data storage. However, if you want to test CQL requests/responses then you can use this library:
Cassandra-Spy
It runs an actual embedded Cassandra and also can simulate failures for inserts/selects. This helps you test your app's behaviour in failure cases. I wrote the library to address this specific use case.

Do I need to run NoSQL databases on some cloud environments before I benchmark them?

I have installed Cassandra (from DataStax) and Riak in my computer. I want to benchmark them with varieties workload and record size. I am using YCSB tool.
Do I need to use any public datacentre/cloud environment before I benchmark, or the Databases are already running in some cloud environments?

The short answer, I believe if I follow your question, is no.
It is quite possible to benchmark the database on your local computer without having to set up environments in the "cloud". If you have the databases correctly setup on your local machine and point YCSB to them properly your should be able to run the tests.
There are of course some additional considerations in benchmarking like:
Why are you benchmarking Cassandra and Riak? What is the use case? The two databases are similar but have different ideal use cases and different performance profiles. Just benchmarking them head to head via YCSB will only tell you a small part of the story.
Testing a single node of either database is also not going to give you a realistic picture of their performance as they are designed to be used as clusters of nodes.
If you have a specific use case you want to test you might consider writing your own benchmarking test vs using YCSB.

Performance testing in Cassandra

I'm currently doing some improvement to Apache cassandra 1.2.8, and I want to do some performance testing on the data base. What is the best way of doing performance testing on this kind of NO-SQL data base? are there any tools or standards which we can use for performance testings?

Check out YCSB. While not a standard it has been used by quite a few products including Cassandra.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string