Does Accumulo support aggregation? - accumulo

I am new to Accumulo. I know that I can write Java code to scan, insert, update and delete data using Hadoop and MapReduce. What I would like to know is whether aggregation is possible in Accumulo.
I know that in MySql we can use groupby,orderby,max,min,count,sum,joins, nested queries, etc. Is their is any possibility to use these functions in Accumulo either directly or indirectly.

Accumulo does support aggregation through the use of combiner iterators (Accumulo Combiner Example ).
Iterators mostly run server-side, but can be run client-side, and can perform quite a bit of computation before sending the data back to your client.
Accumulo comes packaged with many iterators, more specifically the summingCombiner is used to sum the values of entries. Dave Medinet's has a blog that has some good examples (Accumulo Blog). More specifically, using the summingCombiner to implement wordcount (Word Count in Accumulo). I also suggest signing up for the Accumulo users mailing lists (mailing lists).

I like to think Accumulo has great agg functionality. I run an OLAP solution on it with hundreds of millions of keys on 40 nodes. In addition to the basic SummingCombiner, I recommend the newer statscombiner as well
http://accumulo.apache.org/1.4/apidocs/org/apache/accumulo/examples/simple/combiner/StatsCombiner.html
which gives you basic stats about a set of keys.
You can set combiners at maj compaction, minor compaction or scan time. If you have a ton of data with a lot of trickled keys, I don't recommend scan time combining, because it can slow down the scan time (not always).
HTH

Some aggregation is supported in Accumulo, over multiple entries, and even multiple rows, within each tablet. Aggregation across tablets would need to be done on the client side or in a MapReduce job.

Yes, Aggregations are possible in Accumulo. you can achieve them by -
1) Using in built Combiners which aggregate data when you ingest.
2) Make Customised Aggregation Iterator and then deploy it at minor or majour compactions.

Related

Getting data OUT of Cassandra?

How can I export data, over a period of time (like hourly or daily) or updated records from a Cassandra database? It seems like using an index with a date field might work, but I definitely get timeouts in my cqlsh when I try that by hand, so I'm concerned that it's not reliable to do that.
If that's not the right way, then how do people get their data out of Cassandra and into a traditional database (for analysis, querying with JOINs, etc..)? It's not a java shop, so using Spark is non-trivial (and we don't want to change our whole system to use Spark instead of cassandra directly). Do I have to read sstables and try to keep track of them that way? Is there a way to say "get me all records affected after point in time X" or "get me all changes after timestamp X" or something similar?
It looks like Cassandra is really awesome at rapidly reading and writing individual records, but beyond that Cassandra seems to not be the right tool if you want to pull its data into anything else for analysis or warehousing or querying...
Spark is the most typical to do exactly that (as you say). It does it efficiently and is used often so pretty reliable. Cassandra is not really designed for OLAP workloads but things like spark connector help bridge the gap. DataStax Enterprise might have some more options available to you but I am not sure their current offerings.
You can still just query and page through the whole data set with normal CQL queries, its just not as fast. You can even use ALLOW FILTERING just be wary as its very expensive and can impact your cluster (creating a separate dc for the workload and using LOCOL_CL queries against it helps). You will probably also in that scenario add a < token() and > token() to the where clause to split up the query and prevent too much work on any one coordinator. Organizing your data so that this query is more efficient would be strongly recommended (ie if doing time slices, put things in a partition bucketed by time and clustering key timeuuids so its sequential read for each part of time).
Kinda cheesy sounding but the CSV dump from cqlsh is actually fast and might work for you if your data set is small enough.
I would not recommend going to the sstables directly unless you are familiar with internals and using hadoop or spark.

Pros and Cons of Cassandra User Defined Functions

I am using Apache Cassandra to store mostly time series data. And I am grouping the data and aggregating/counting it based on some conditions. At the moment I am doing this in a Java 8 application, but with the release of Cassandra 3.0 and the User Defined Functions, I have been asking myself if extracting the grouping and aggregation/counting logic to Cassandra is a good idea. To my understanding this functionallity is something like the stored procedures in SQL.
My concern is if this will impact the computation performance and the overall performance of the database. I am also not sure if there are other issues with it and if this new feature is something like the secondary indexes in Cassandra - you can do them, but it is not recommended at all.
Have you used user defined functions in Cassandra? Do you have any observations on the performance? What are the good and bad sides of this new functionality? Is it applicable in my use case?
You can compare it to using count() or avg() kind of aggregations. They can save you a lot of network traffic and object creation/GC by having the coordinator only send the result, but its easy to get carried away and make the coordinator do a lot of work. This extra work takes away from normal C* duties, and can just as likely increase GCs as reduce them.
If your aggregating 100 rows in a partition its probably fine and if your aggregating 10000 its probably not end of the world if its very rare. If your calling it once a second though its a problem. If your aggregating over 1000 I would be very careful.
If you absolutely need to do it and its a lot of data often, you may want to create dedicated proxy coordinators (-Djoin_ring=false) to bear the brunt of the load without impacting normal C* read/writes. At that point its just as easy to create dedicated workload DC for it or something (with RF=0 for your keyspace, and set application to be part of that DC with DCAwareRoundRobinPolicy). This also is the point where using Spark is probably the right thing to do.

Dynamic Cassandra queries

I have a messenger application with a history page, on which you can see your sent and received messages.
Since the amount of messages has lowered my performance I have been thinking about using Cassandra.
After researching on the topic of Cassandra, I found out that you have to build tables to satisfy your queries.
Now the problem: on the history page you can use x amount of different filters at the same time. e.g filter by date,receiver and sender.
If I were to use Cassandra, would I need to create a table for every combination of these filters?
Or is this a bad use case for Cassandra in general?
If so, are there any alternatives?
Why don't you just make a SELECT statement.
You should definately have a look into CQL (Cassandra Query Language).
While CQL and SQL share a similar syntax queries are a lot different.
The reasons for these differences is the fact that Cassandra is dealing with distributed data and aims to prevent inefficient queries.
See this link for reference. It shows queries you can or cannot do.

Is it bad to use INDEX in Cassandra if performance is not important?

Background
We have recently started a "Big Data" project where we want to track what users are doing with our product - how often they are logging in, which features they are clicking on, etc - your basic user analytics stuff. We still don't know exactly what questions we will be asking, but most of it will be "how often did X occur over the last Y months?" type of thing, so we started storing the data sooner rather than later thinking we can always migrate, re-shape etc when we need to but if we don't store it it is gone forever.
We are now looking at what sorts of questions we can ask. In a typical RDBMS, this stage would consist of slicing and dicing the data in many different dimensions, exporting to Excel, producing graphs, looking for trends etc - it seems that for Cassandra, this is rather difficult to do.
Currently we are using Apache Spark, and submitting Spark SQL jobs to slice and dice the data. This actually works really well, and we are getting the data we need, but it is rather cumbersome as there doesn't seem to be any native API for Spark that we can connect to from our workstations, so we are stuck using the spark-submit script and a Spark app that wraps some SQL from the command line and outputs to a file which we then have to read.
The question
In a table (or Column Family) with ~30 columns running on 3 nodes with RF 2, how bad would it be to add an INDEX to every non-PK column, so that we could simply query it using CQL across any column? Would there be a horrendous impact on the performance of writes? Would there be a large increase in disk space usage?
The other option I have been investigating is using Triggers, so that for each row inserted, we populated another handful of tables (essentially, custom secondary index tables) - is this a more acceptable approach? Does anyone have any experience of the performance impact of Triggers?
Impact of adding more indexes:
This really depends on your data structure, distribution and how you access it; you were right before when you compared this process to RDMS. For Cassandra, it's best to define your queries first and then build the data model.
These guys have a nice write-up on the performance impacts of secondary indexes:
https://pantheon.io/blog/cassandra-scale-problem-secondary-indexes
The main impact (from the post) is that secondary indexes are local to each node, so to satisfy a query by indexed value, each node has to query its own records to build the final result set (as opposed to a primary key query where it is known exactly which node needs to be quired). So there's not just an impact on writes, but on read performance as well.
In terms of working out the performance on your data model, I'd recommend using the cassandra-stress tool; you can combine it with a data modeler tool that Datastax have built, to quickly generate profile yamls:
http://www.datastax.com/dev/blog/data-modeler
For example, I ran the basic stress profile without and then with secondary indexes on the default table, and the "with indexes" batch of writes took a little over 40% longer to complete. There was also an increase in GC operations / duration etc.

Cassandra bulk insert operation, internally

I am looking for Cassandra/CQL's cousin of the common SQL idiom of INSERT INTO ... SELECT ... FROM ... and have been unable to find anything to do such an operation programmatically or in CQL. Is it just not supported?
My use case is to do a reasonably bulky copy from one table to another. I don't need any particular concurrent guarantees, but it's a lot of data so I'd like to avoid the additional network overhead of writing a client that retrieves data from one table, then issues batches of inserts into the other table. I understand that the changes will still need to be transported between nodes of the Cassandra cluster according to the replication set-up, but it seems reasonable for there to be an "internal" option to do a bulk operation from one table to another. Is there such a thing in CQL or elsewhere? I'm currently using Hector to talk to Cassandra.
Edit: it looks like sstableloader might be relevant, but is awfully low-level for something that I'd expect to be a fairly common use case. Taking just a subset of rows from one table to another also seems less than trivial in that framework.
Correct, this is not supported natively. (Another alternative would be a map/reduce job.) Cassandra's API focuses on short requests for applications at scale, not batch or analytical queries.

Resources