Load example GraphSON file in Gremlin with Cassandra - cassandra

I'm trying to load the example Graph of the Gods file that is distributed with Titan with the loadGraphSON function. I have executed the following steps and are working with Titan 0.5.4 with Hadoop 2.
Downloaded and unpacked a fresh Titan 0.5.4 with Hadoop 2.
Started Titan, Rexster, Cassandra, ElasticSearch with the command bin/titan.sh -c cassandra-es start
Run Gremlin with: bin/gremlin.sh
Open a new TitanFactory instance with the required settings: g = TitanFactory.open('conf/titan-cassandra-es.properties')
Then I tried to load the Graph of the Gods from the examples-directory with g.loadGraphSON("examples/graph-of-the-gods")
I do not get an error, but trying to show all vertices with g.V returns nothing. Am I executing the rights steps here, or am I doing something wrong?

Note that this question was answered on the Aurelius Graphs Mailing List:
https://groups.google.com/d/msg/aureliusgraphs/FiCvX891r6g/BkmWj3xc3ikJ
Basically:
1) the filename should be examples/graph-of-the-gods.json
2) you can also use GraphOfTheGodsFactory.load(g) which will also create indexes and type definitions
I'd say the second point above would be the preferred manner in which to load Graph of the Gods.

If you're not setting up a Titan Hadoop job, you could try using the Blueprints GraphSON reader to load graph data. See https://github.com/tinkerpop/blueprints/wiki/GraphSON-Reader-and-Writer-Library
In a Gremlin shell it looks a bit like this:
inStream = new FileInputStream("../examples/graph-of-the-gods.json")
GraphSONReader.inputGraph(g, inStream)

Related

How can I install flashtext on every executor?

I am using the flashtext library in a couple of UDFs. It works when I run it locally in Client mode, but once I try to run it in the Cloudera Workbench with several executors, I get an ModuleNotFoundError.
After some research I found that it is possible to add archives (and packages?) to a SparkSession when creating it, so I tried:
SparkSession.builder.config('spark.archives', 'flashtext-2.7-pyh9f0a1d_0.tar.gz')
but it didn't help, the same error remains.
According to Spark Configuration doc, there are other configs I could try, e.g. spark.submit.pyFiles, but I don't understand how these py-files to be added would have to look like.
Would it be enough to just create a pyton script with this content?
from flashtext import KeywordProcessor
Could you tell me the easiest way how I can install flashtext on every node?
Edit:
In the meantime, I figured that not only Flashtext was causing issues, but also every relative import from other scripts that I intended to use in a UDF. In order to fix it, I followed this article. I also took the source code from Flashtext and imported it to the main file without installing the actual library.
I think in order to point Spark executors to python modules extracted from your archive, you will need to add another config setting, that adds their location to PYTHONPATH. Something like this:
SparkSession.builder \
.config('spark.archives', 'flashtext-2.7-pyh9f0a1d_0.tar.gz#myUDFs') \
.config('spark.executorEnv.PYTHONPATH', './myUDFs')
Citing from the same link you have in the question:
spark.executorEnv.[EnvironmentVariableName]...Add the environment
variable specified by EnvironmentVariableName to the Executor process.
The user can specify multiple of these to set multiple environment
variables.
There are no environment details in your question (or I'm simply not familiar with Cloudera Workbench) but if you're trying to run Spark on YARN, you may need to use slightly different setting spark.yarn.dist.archives.
Also, please make sure that your driver log contains message confirming that an archive was actually uploaded, as in:
:
22/11/08 INFO yarn.Client: Uploading resource file:/absolute/path/to/your/archive.zip -> hdfs://nameservice/user/<your-user-id>/.sparkStaging/<application-id>/archive.zip
:

Error While inserting rows into Kudu using Spark Shell

I am new to Apache Kudu, I installed it on my Ubuntu system and later created a table in it using Apache Spark shell. Now I am trying to insert data into that table using insertRows() for that I am using the but below given command,
kuduContext.insertRows(customersDF, "spark_kudu_tbl")
Where customersDF is a Data Frame and spark_kudu_tbl is a table in the Kudu data base. I am getting below error,
java.lang.NoSuchMethodError: org.apache.kudu.spark.kudu.KuduContext.insertRows(Lorg/apache/spark/sql/Dataset;Ljava/lang/String;)V
... 70 elided
I have tried different options but no one is giving results to me. Can any one give any solution for my question.
From the error message it appears as though you are using wrong kudu-spark artifact, you should use kudu-spark2_2. please start your spark-shell as below (replace the last bit with your kudu version)
spark-shell --packages org.apache.kudu:kudu-spark2_2.11:1.3.0

Getting started with Spark (Datastax Enterprise)

I'm trying to setup and run my first Spark query following the official example.
On our local machines we have already setup last version of Datastax Enterprise packet (for now it is 4.7).
I do everything exactly according documentation, I appended latest version of dse.jar to my project but errors comes right from the beginning:
Here is the snippet from their example
SparkConf conf = DseSparkConfHelper.enrichSparkConf(new SparkConf())
.setAppName( "My application");
DseSparkContext sc = new DseSparkContext(conf);
Now it appears that DseSparkContext class has only default empty constructor.
Right after these lines comes the following
JavaRDD<String> cassandraRdd = CassandraJavaUtil.javaFunctions(sc)
.cassandraTable("my_keyspace", "my_table", .mapColumnTo(String.class))
.select("my_column");
And here comes the main problem, CassandraJavaUtil.javaFunctions(sc)method accepts only SparkContext on input and not DseSparkContext (SparkContext and DseSparkContext are completely different classes and one is not inherited from another).
I assume that documentation is not up to date with the realese version and if anyone met this problem before, please share with me your experience,
Thank you!
There looks like a bug in the docs. That should be
DseSparkContext.apply(conf)
Since DseSparkContext is a Scala object which uses the Apply function to create new SparkContexts. In Scala you can just write DseSparkContext(conf) but in Java you must actually call the method. I know you don't have access to this code so I'll make sure that this gets fixed in the documentation and see if we can get better API docs up.

Labelling Neo4j database using Neo4django

This question is related to the github issue of Neo4django. I want to create multiple graphs using Neo4j graph DB from Django web framework. I'm using Django 1.4.5, neo4j 1.9.2 and neo4django 0.1.8.
As of now Neo4django doesn't support labeling but the above is my core purpose and I want to be able to create labels from Neo4django. So I went into the source code and tried to tweak it a little to see if I can make this addition. In my understanding, the file 'db/models/properties.py' has class BoundProperty(AttrRouter) which calls gremlin script through function save(instance, node, node_is_new). The script is as follows:
script = '''
node=g.v(nodeId);
results = Neo4Django.updateNodeProperties(node, propMap);
'''
The script calls the update function from library.groovy and all the function looks intuitive and nice. I'm trying to add on this function to support labeling but I have no experience of groovy. Does anyone have any suggestions on how to proceed? Any help would be appreciated. If it works it would be a big addition to neo4django :)
Thank you
A little background:
The Groovy code you've highlighted is executed using the Neo4j Gremlin plugin. First it supports the Gremlin graph DSL (eg node=g.v(nodeId)), which is implemented atop the Groovy language. Groovy itself is a dynamic superset of Java, so most valid Java code will work with scripts sent via connection.gremlin(...). Each script sent should define a results variable that will be returned to neo4django, even if it's just null.
Anyway, accessing Neo4j this way is handy (though will be deprecated I've heard :( ) because you can use the full Neo4j embeddeded Java API. Try something like this to add a label to a node
from neo4django.db import connection
connection.gremlin("""
node = g.v(nodeId)
label = DynamicLabel.label('Label_Name')
node.rawVertex.addLabel(label)
""", nodeId=node_id)
You might also need to add an import for DynamicLabel- I haven't run this code so I'm not sure. Debugging code written this way is a little tough, so make liberal use of the Gremlin tab in the Neo4j admin.
If you come up with a working solution, I'd love to see it (or an explanatory blog post!)- I'm sure it could be helpful to other users.
HTH!
NB - Labels will be properly supported shortly after Neo4j 2.0's release- they'll replace the current in-graph type structure.

Apache Pig: Load a file that shows fine using hadoop fs -text

I have files that are named part-r-000[0-9][0-9] and that contain tab separated fields. I can view them using hadoop fs -text part-r-00000 but can't get them loaded using pig.
What I've tried:
x = load 'part-r-00000';
dump x;
x = load 'part-r-00000' using TextLoader();
dump x;
but that only gives me garbage. How can I view the file using pig?
What might be of relevance is that my hdfs is still using CDH-2 at the moment.
Furthermore, if I download the file to local and run file part-r-00000 it says part-r-00000: data, I don't know how to unzip it locally.
According to HDFS Documentation, hadoop fs -text <file> can be used on "zip and TextRecordInputStream" data, so your data may be in one of these formats.
If the file was compressed, normally Hadoop would add the extension when outputting to HDFS, but if this was missing, you could try testing by unzipping/ungzipping/unbzip2ing/etc locally. It appears Pig should do this decompressing automatically, but may require the file extension be present (e.g. part-r-00000.zip) -- more info.
I'm not too sure on the TextRecordInputStream.. it sounds like it would just be the default method of Pig, but I could be wrong. I didn't see any mention of LOAD'ing this data via Pig when I did a quick Google.
Update:
Since you've discovered it is a sequence file, here's how you can load it using PiggyBank:
-- using Cloudera directory structure:
REGISTER /usr/lib/pig/contrib/piggybank/java/piggybank.jar
--REGISTER /home/hadoop/lib/pig/piggybank.jar
DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();
-- Sample job: grab counts of tweets by day
A = LOAD 'mydir/part-r-000{00..99}' # not sure if pig likes the {00..99} syntax, but worth a shot
USING SequenceFileLoader AS (key:long, val:long, etc.);
If you want to manipulate (read/write) sequence files with Pig then you can give a try to Twitter's Elephant-Bird as well.
You can find here examples how to read/write them.
If you use custom Writables in you sequence file then you can implement a custom converter by extending AbstractWritableConverter .
Note, that Elephant-Bird needs to have an installed Thrift in your machine.
Before building it, make sure that it is using the correct Thrift version you have and also provide the correct path of the Thrift executable in its pom.xml:
<plugin>
<groupId>org.apache.thrift.tools</groupId>
<artifactId>maven-thrift-plugin</artifactId>
<version>0.1.10</version>
<configuration>
<thriftExecutable>/path_to_thrift/thrift</thriftExecutable>
</configuration>
</plugin>

Resources