I have a Cassandra Database and a Spark cluster that will get his inputs from Cassandra to do some processing.
In my Cassandra database, I have some table that are time series. I am looking for a way to visualize those time series easily without multiplying databases.
Grafana is a great tool for that, but infortunately, it seems like there is no way to plug it to Cassandra.
So, for now I am using Zeppelin notebooks using my Cassandra/Spark cluster, but the available features to display time series aren't as good as those from Grafana.
I also cannot replace my Cassandra by InfluxDB, because my Cassandra is not used only for time series storing.
Unfortunately there is no direct plugin for Cassandra as a datasource for Grafana. Below are the different possible ways you can achieve Cassandra to Grafana integration.
There is a pull request for Cassandra as a datasource https://github.com/grafana/grafana/pull/9774, this is not merged to Grafana master branch though.
you could run a fork of Grafana with this PR and use the plugin.
You can use KairosDB on top of Cassandra (We can configure KairosDB to use Cassandra as a Datastore, so no multiple databases:) and use KairosDB plugin. but this approach has some drawbacks:
we need to map the Cassandra schema to KairosDB schema, KairosDB
schema is metrics based schema.
Though KairosDB uses cassandra as
a Datastore, it will store the data in different schema and table, so
data is duplicated.
If your app is writing data to Cassandra
directly, you need to write simple client to pull the latest data
from cassandra and push to KairosDB
You can implement the SimpleJSON plugin for Grafana (https://github.com/grafana/simple-json-datasource). There are lots examples available for SimpleJSON implementation, write one for Cassandra and opensource :)
You can push the data ElasticSearch and use it as a Datasource. ES is supported as a Datasource for all major visualization tools.
A bit too late but there is a direct integration now, Cassandra datasource for Grafana
https://github.com/HadesArchitect/GrafanaCassandraDatasource
I would suggest to use Banana Visualization, but for this Solr should be enabled on Timeseries Table. Banana is a forked version of KIBANA. Also has powerful dashboard configuration capabilities.
https://github.com/lucidworks/banana
Related
I am using HBase as backend for Janusgraph. I have to migrate to Cassandra as backend. What is the best way to migrate the old data?
one way to go for it is to read data from Hbase and put into Cassandra using java code.
Migrating data from JanusGraph is not well supported, so I would prefer myself to start from copies of the data that were made before ingesting it into JanusGraph. If that is not an option, your suggestion of using java code to read from one graph and ingest into the other comes first.
Naturally, you want to parallellize this, because millions of operations on a single thread and process take too long for being practical. Although JanusGraph supports OLAP traversals for reading vertices and edges in parallel, JanusGraph OLAP has its own problems and you are probably better of segmenting the data using a mixed index in JanusGraph and have each process/thread read the segment assigned to it using an OLTP traversal.
I would ask whether Ignite is suitable for my use case which is:
Load all the data of oracle tables to the Ignite cache, and then do various SQL queries(aggregation/join/sub-query) against the data in the cache.
When oracle has newly created data or some data are updated, there are some way that these data can be inserted into the cache or update the corresponding entry in the cache
When the cache is down, there should be some way to restore the data from oracle?
Not sure Ignite SQLGrid can fit in this use case.
Also, I notice that IgniteRDD is not immutable, is IgniteRDD suitable for this use case? That is, I first load the data in oracle into IgniteRDD,
and make the corresponding changes to IgniteRDD with the newly created/updated data to oracle? But it looks that IgniteRDD doesn't support complicated SQL?( aggregation/join/sub-query)
This is one of the basic use cases supported by Ignite.
Data can pre-loaded from Oracle using one of the methods covered in this documentation section.
If you're planning to update the data in Ignite first and propagate to Oracle after (which is preferred way), then it makes sense to use Oracle as a CacheStore in write-through/read-through mode. Ignite will make sure to sync up data with the persistent layer. Moreover, it'll be straightforward to pre-load data from Oracle if the cluster is restarted.
Finally, you can take advantage of GridGain Web Console by connecting to Oracle and map Oracle's scheme to Ignite caches configuration and POJO objects.
As I mentioned, it's recommended to make all the updates through Ignite first which will persist them to Oracle. But if Oracle is updated by other applications that are not aware of Ignite you need to update Ignite cluster on your own somehow. Ignite doesn't have any feature that covers this use case. However, this can be easily implemented with GridGain, that is built on top of Ignite, with it's Oracle Golden Gate Integration.
Once the data is in the Ignite Cluster use SQL Grid to query and/or update your data. SQL Grid engine is ANSI-99 compliant and doesn't have any limitations.
As for Ignite Shared RDD, it stores data in a distributed Ignite cache. This is why it's mutable which is opposite to Spark native RDDs. Shared RDDs SQL capabilities are absolutely the same - it's just one more API on top of SQL Grid.
I've been reading about Apache Cassandra lately to learn how it works and how to use it for IoT projects, especially in the need of time series based database..
However, I started to notice that Apache Spark is often mentioned when people talk about Cassandra too.
The question is, as long as I can use Cassandra cluster of nodes to serve my app, to store and read data, why would I need Apache Spark? any useful use-cases are appreciated!
The answer is broad but summarizing ... Cassandra is highly scalable and there are lot of scenarios where it fits but CQL sintax has some limitations if you don't have your schema ready for some queries.
If you want to make use of your data without restrictions and doing analytical workloads with your cassandra data or join with other tables Spark is the most appropriate complement. Spark has a tight integration with Cassandra.
I recommend you to check this slides: http://www.slideshare.net/patrickmcfadin/apache-cassandra-and-spark-you-got-the-the-lighter-lets-start-the-fire?qid=48e2528c-a03c-49b4-879e-45599b2aff34&v=&b=&from_search=5
Cassandra is for storing data where as Spark is for performing some computation on top of it. Analogy with Hadoop: Cassandra is like HDFS where as Spark is like Map Reduce.
Especially with computations, when using DataStax Cassandra connector, data locality can be exploited. If you need to do some computation which modifies a row (but doesn't really depend on anything else), then that operation is optimized to run locally on each machine in cluster without any data movement in network.
Same goes with a lot of other Spark workload, the actions(some function which modifies the data) are done locally and only result is sent to client. As far as I know, when you want to do analytics on top of data stored in Cassandra, Spark is well supported and popular choice. If you don't need to do any operations on the data, still you can use Spark for other purposes like I mentioned below.
Spark streaming can be used to ingest or export data from Cassandra ( I used it a lot personally). The same data import/export can be achieved with small hand-written JDBC agents but Spark streaming code I wrote for ingesting 10GB data from Cassandra contains less than 20 lines of code with multi machine-multi threading built-in and an admin UI where I can see the job progress.
With Spark+Zeppelin, we can visualize Cassandra data using Spark, we can build beautiful UIs with little Spark code where users can even enter input and see the result as graph/table etc.
Note: Actually, visualization can be better with Kibana/ElasticSearch or Solr/Banana when used with Cassandra but they are very hard to setup and indexing has it's own issues to deal with.
There are a lot of other use cases, but personally I used Spark as a Swiss army knife for multiple tasks.
Apache cassandra is have feature like fast read and write so you can use it with the apache spark streaming to write your data directly into cassandra without legacy.
For use case you can consider any video application to upload video with the help of streaming and directly store it into cassandra blob.
I am looking for directions:
I have a cassandra database with latitude & longitude data. I need to search for data within a radius or a box coordinates around a point. I am using golang(gocql) client to query Cassandra.
I need some understanding regarding Spark and Cassandra as this seams like the way to go.
Is the following assumptions correct; I have 2 Cassandra nodes(the data in a replica of 2).
Should I then install an extra node and install Spark on this and then connect it to the other two existing Cassandra nodes containing the data(With the Spark Connector from DataStax).
And do the two existing Cassandra nodes need to have Spark workers installed on them to work with Spark Master node?
When the Spark setup is in place, do you query(Scala) the existing data and then save the data onto the Spark node and then query this with the gaoling(gocql) client?
Any directions is welcome
Thanks in advance
Geospatial Searching is a pretty deep topic. If it's just doing searches that you're after (not batch/analytics), I can tell you that you probably don't want to use Spark. Spark isn't very good at 'searching' for data - even when it's geospatial. The main reason is that Spark doesn't index data for efficient searches and you'd have to create a job/context (unless using job server) every time you'd want to do a search. That takes forever when you're thinking in terms of user facing application time.
Solr, Elastic Search, and DataStax Enterprise Search (Disclaimer I work for DataStax) are all capable of box and radius searches on Cassandra data and do so in near real time.
To answer your original question though, if the bulk of your analytics in general come from Cassandra data, it may be good idea to run Spark on the same nodes as Cassandra for data locality. The nice thing is that Spark scales quite nicely, so if you find Spark taking too many resources from Cassandra, you can simply scale out (both Cassandra and Spark).
Should I then install an extra node and install Spark on this and then
connect it to the other two existing Cassandra nodes containing the
data(With the Spark Connector from DataStax).
Spark is a cluster compute engine so it needs a cluster of nodes to work well. You'll need to install it on all nodes if you want it to be as efficient as possible.
And do the two existing Cassandra nodes need to have Spark workers
installed on them to work with Spark Master node?
I don't think they 'have' to have them, but it's a good idea for locality. There's a really good video on academy.datastax.com that shows how the spark cassandra connector reads data from Cassandra to Spark. I think it will clear a lot of things up for you: https://academy.datastax.com/demos/how-spark-cassandra-connector-reads-data
When the Spark setup is in place, do you query(Scala) the existing
data and then save the data onto the Spark node and then query this
with the gaoling(gocql) client?
The Spark-Cassandra connector can communicate to both Cassandra and Spark. There are methods, saveToCassandra(), for example, that will write data back to Cassandra your jobs are processed. Then you can use your client as you normally would.
There are some really good free Spark + Cassandra tutorials at academy.datastax.com. This is also a good place to start: http://rustyrazorblade.com/2015/01/introduction-to-spark-cassandra/
I am trying to get all inserts, updates, deletes to a normalized DB2 database (hosted on an IBM Mainframe) synced to a Cassandra database. I also need to denormalize these changes before I write them to Cassandra so that the data structure meets my Cassandra model.
Searched on google but tools either lack processing support or streaming CDC support.
Is there any tool out there that can help me achieve the above?
It's likely that no stock tool exists. What's the format of the CDC stream coming out? What queries do you need to run? Like any other Cassandra data modeling question, start with the queries you need to run and work backwards to the table structure(s).