I am new to DSE graphs.
I have around 1000 records in a csv file, where each record has around 20 attributes, which I want to load in gremlin. All the records would form a separate vertex in the graph.
Is there a way to directly load all the records in one go. I found this link that used storage backend as DynamoDB but that link didn't help. I have my backend as DSE Cassandra.
I also tried populating the Cassandra DB with the records and then trying to form graph using this data but it didn't work.
Please let me know if there is way to do the required. Thanks in advance.
The DSE Graph Loader is a standalone tool can load CSV documents into DSE Graph. You would need to create a mapping script to use with the loader. You can find its documentation at https://docs.datastax.com/en/dse/5.1/dse-dev/datastax_enterprise/graph/dgl/dglCSV.html
Related
I am using HBase as backend for Janusgraph. I have to migrate to Cassandra as backend. What is the best way to migrate the old data?
one way to go for it is to read data from Hbase and put into Cassandra using java code.
Migrating data from JanusGraph is not well supported, so I would prefer myself to start from copies of the data that were made before ingesting it into JanusGraph. If that is not an option, your suggestion of using java code to read from one graph and ingest into the other comes first.
Naturally, you want to parallellize this, because millions of operations on a single thread and process take too long for being practical. Although JanusGraph supports OLAP traversals for reading vertices and edges in parallel, JanusGraph OLAP has its own problems and you are probably better of segmenting the data using a mixed index in JanusGraph and have each process/thread read the segment assigned to it using an OLTP traversal.
Can you guys explain to me what’s the best way to export an extensive list of tables from various Oracle and SQL Server schemas in json format using Apache Spark? Can Spark handle multiples dataframes in the same application?
Thanks!
Yes, you can.. Assume that you have data in SQL Server and Oracle DB as well, create the connection in and load the data into two dataframe , post that you can use toJson or likewise functions and create your own json structure as per the requirement, in short, yes, spark can handle reading from multiple different source in a single application.
The reading from various source like - Oracle , PostGreSQL is easily available in google.
I have a Cassandra Database and a Spark cluster that will get his inputs from Cassandra to do some processing.
In my Cassandra database, I have some table that are time series. I am looking for a way to visualize those time series easily without multiplying databases.
Grafana is a great tool for that, but infortunately, it seems like there is no way to plug it to Cassandra.
So, for now I am using Zeppelin notebooks using my Cassandra/Spark cluster, but the available features to display time series aren't as good as those from Grafana.
I also cannot replace my Cassandra by InfluxDB, because my Cassandra is not used only for time series storing.
Unfortunately there is no direct plugin for Cassandra as a datasource for Grafana. Below are the different possible ways you can achieve Cassandra to Grafana integration.
There is a pull request for Cassandra as a datasource https://github.com/grafana/grafana/pull/9774, this is not merged to Grafana master branch though.
you could run a fork of Grafana with this PR and use the plugin.
You can use KairosDB on top of Cassandra (We can configure KairosDB to use Cassandra as a Datastore, so no multiple databases:) and use KairosDB plugin. but this approach has some drawbacks:
we need to map the Cassandra schema to KairosDB schema, KairosDB
schema is metrics based schema.
Though KairosDB uses cassandra as
a Datastore, it will store the data in different schema and table, so
data is duplicated.
If your app is writing data to Cassandra
directly, you need to write simple client to pull the latest data
from cassandra and push to KairosDB
You can implement the SimpleJSON plugin for Grafana (https://github.com/grafana/simple-json-datasource). There are lots examples available for SimpleJSON implementation, write one for Cassandra and opensource :)
You can push the data ElasticSearch and use it as a Datasource. ES is supported as a Datasource for all major visualization tools.
A bit too late but there is a direct integration now, Cassandra datasource for Grafana
https://github.com/HadesArchitect/GrafanaCassandraDatasource
I would suggest to use Banana Visualization, but for this Solr should be enabled on Timeseries Table. Banana is a forked version of KIBANA. Also has powerful dashboard configuration capabilities.
https://github.com/lucidworks/banana
I have created a table, ztest7 in the default database in my hive. I am able to query it using beeline. In tableau, I can query it using a custom sql.
However the table does NOT show when I search for it.
Am I missing something here?
Tableau Desktop Version = v10.1.1
Hive = v2.0.1
Spark = v2.1.0
Best Regards
I have the same issue with Tableau Desktop 10 (mac) to Hive (2.1.1) via Spark SQL 2.1 (on centos 7 server)
This is what I got from Tableau Support:
In Tableau Desktop, the ability to connect to Spark SQL without a
defining a default schema is not currently built into the product.
As a preliminary step, to define a default schema, configure the Spark
SQL hivemetastore to utilize a SchemaRDD or DataFrame. This must be
defined in the Hive Metastore for Tableau Desktop to be able to access
it. Pure schema-less Spark RDD's can not be queried by Spark SQL
because of the lack of a schema. RDDs can be converted into
SchemaRDDs, which have additional schema metadata as Spark SQL
provides access to SchemaRDDs. When a SchemaRDD is created, it is only
available in the local namespace or context, and is unavailable to
external services accessing Spark through ODBC and the Spark Thrift
Server. For Tableau to have access, the SchemaRDD needs to be
registered in a catalog that is available outside of just the local
context; the Hive Metastore is currently the only supported service.
I don't know how to check/implement this.
PS: I'd have posted this as a comment because I am not allowed to as I am new to Stack Overflow.
In the file labeled Table on the left side of the screen, Try selecting contains, entering part of your table name and hitting enter
I ran into similar issue. In my case, I had loaded tables using HIVE but the tableau connection to the data source was made using Impala as shown in the image below.
To fix the issue of not seeing the tables in tableau dropdown, try running INVALIDATE METADATA database.table_name in the impala interface. This fixed the problem for me.
To know why this fixes the issue, refer this link.
I am looking for directions:
I have a cassandra database with latitude & longitude data. I need to search for data within a radius or a box coordinates around a point. I am using golang(gocql) client to query Cassandra.
I need some understanding regarding Spark and Cassandra as this seams like the way to go.
Is the following assumptions correct; I have 2 Cassandra nodes(the data in a replica of 2).
Should I then install an extra node and install Spark on this and then connect it to the other two existing Cassandra nodes containing the data(With the Spark Connector from DataStax).
And do the two existing Cassandra nodes need to have Spark workers installed on them to work with Spark Master node?
When the Spark setup is in place, do you query(Scala) the existing data and then save the data onto the Spark node and then query this with the gaoling(gocql) client?
Any directions is welcome
Thanks in advance
Geospatial Searching is a pretty deep topic. If it's just doing searches that you're after (not batch/analytics), I can tell you that you probably don't want to use Spark. Spark isn't very good at 'searching' for data - even when it's geospatial. The main reason is that Spark doesn't index data for efficient searches and you'd have to create a job/context (unless using job server) every time you'd want to do a search. That takes forever when you're thinking in terms of user facing application time.
Solr, Elastic Search, and DataStax Enterprise Search (Disclaimer I work for DataStax) are all capable of box and radius searches on Cassandra data and do so in near real time.
To answer your original question though, if the bulk of your analytics in general come from Cassandra data, it may be good idea to run Spark on the same nodes as Cassandra for data locality. The nice thing is that Spark scales quite nicely, so if you find Spark taking too many resources from Cassandra, you can simply scale out (both Cassandra and Spark).
Should I then install an extra node and install Spark on this and then
connect it to the other two existing Cassandra nodes containing the
data(With the Spark Connector from DataStax).
Spark is a cluster compute engine so it needs a cluster of nodes to work well. You'll need to install it on all nodes if you want it to be as efficient as possible.
And do the two existing Cassandra nodes need to have Spark workers
installed on them to work with Spark Master node?
I don't think they 'have' to have them, but it's a good idea for locality. There's a really good video on academy.datastax.com that shows how the spark cassandra connector reads data from Cassandra to Spark. I think it will clear a lot of things up for you: https://academy.datastax.com/demos/how-spark-cassandra-connector-reads-data
When the Spark setup is in place, do you query(Scala) the existing
data and then save the data onto the Spark node and then query this
with the gaoling(gocql) client?
The Spark-Cassandra connector can communicate to both Cassandra and Spark. There are methods, saveToCassandra(), for example, that will write data back to Cassandra your jobs are processed. Then you can use your client as you normally would.
There are some really good free Spark + Cassandra tutorials at academy.datastax.com. This is also a good place to start: http://rustyrazorblade.com/2015/01/introduction-to-spark-cassandra/