How to export a Datastax graph based on a specific traversal using DseGraphFrame - apache-spark

I would like to export a DSE graph via a spark job , as per
https://docs.datastax.com/en/dse/6.0/dse-dev/datastax_enterprise/graph/graphAnalytics/dseGraphFrameExport.html
All this works fine within the spark-shell ,
I want to be doing this in Java using DseGraphFrame .
Unfortunately there is not much in the documentation
I am able to pack a jar with the following code and do a
spark-submit
SparkSession spark = SparkSession
.builder()
.appName("Datastax Java example")
.getOrCreate();
DseGraphFrame dseGraphFrame = DseGraphFrameBuilder.dseGraph(args[0], spark);
DataFrameWriter dataFrameWriter = dseGraphFrame.V().df().write();
dataFrameWriter.csv("vertices");
The above works fine ,
what I want to be doing is use a specific traversal to filter what I export.
That is use something like that
dseGraphFrame.V().hasLabel("label").df().write();
The above does not work as dseGraphFrame.V().hasLabel("label") does not have .df()
Is this the correct way of doing things
Any help would be appreciated

A late answer to this question, perhaps still of use:
In Java, you need to cast this to a DseGraphTraversal first. This can then be converted to a DataFrame with the .df() method:
((DseGraphTraversal)dseGraphFrame.V().hasLabel("label")).df().write();

Related

Not able to write data in Hive using sparksql

I am loading Data from one Hive table to another using spark Sql. I've created sparksession with enableHiveSupport and I'm able to create table in hive using sparksql, but when I'm loading data from one hive table to another hive table using sparksql I'm getting permission issue:
Permission denied: user=anonymous,access=WRITE, path="hivepath".
I am running this using spark user but not able to understand why its taking anonymous as user instead of spark. Can anyone suggest how should I resolve this issue?
I'm using below code.
sparksession.sql("insert overwrite into table dbname.tablename" select * from dbname.tablename").
If you're using spark, you need to set username in your spark context.
System.setProperty("HADOOP_USER_NAME","newUserName")
val spark = SparkSession
.builder()
.appName("SparkSessionApp")
.master("local[*]")
.getOrCreate()
println(spark.sparkContext.sparkUser)
First thing is you may try this for ananymous user
root#host:~# su - hdfs
hdfs#host:~$ hadoop fs -mkdir /user/anonymous
hdfs#host:~$ hadoop fs -chown anonymous /user/anonymous
In general
export HADOOP_USER_NAME=youruser before spark-submit will work.
along with spark-submit configuration like below.
--conf "spark.yarn.appMasterEnv.HADOOP_USER_NAME=${HADDOP_USER_NAME}" \
alternatively you can try using
sudo -su username spark-submit --class your class
see this
Note : This user name setting should be part of your initial
cluster setup ideally if its done then no need to do all these above
and its seemless.
I personally dont prefer user name hard coding in the code it should be from outside the spark job.
To validate with which user you are running,
run below command: -
sc.sparkUser
It will show you the current user and then
you can try setting new user as per the below code
And in scala, you can set the username by
System.setProperty("HADOOP_USER_NAME","newUserName")

Sharing a spark session

Lets say I have a python file my_python.py in which I have created a SparkSession 'spark' . I have a jar say my_jar.jar in which some spark logic is written. I am not creating SparkSession in my jar , rather I want to use the same session created in my_python.py. How to write a spark-submit command which take my python file , my jar and my sparksession 'spark' as an argument to my jar file.
Is it possible ?
If not , please share the alternative to do so.
So I feel there are two questions -
Q1. How in scala file you can reuse already created spark session?
Ans: Inside your scala code, you should use builder to get an existing session:
SparkSession.builder().getOrCreate()
Please check the Spark doc
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/SparkSession.html
Q2: How you do spark-submit with a .py file as driver and scala jar(s) as supporting jars?
And: It should be in something like this
./spark-submit --jars myjar.jar,otherjar.jar --py-files path/to/myegg.egg path/to/my_python.py arg1 arg2 arg3
So if you notice the method name, it is getOrCreate() - that means if a spark session is already created, no new session will be created rather existing session will be used.
Check this link for full implementation example:
https://www.crowdstrike.com/blog/spark-hot-potato-passing-dataframes-between-scala-spark-and-pyspark/

PySpark throwing ParseException for syntactical correct Hive Query

I got a DDL query that works fine within beeline, but when I try to run the same query within a sparkSession it throws a parse Exception.
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, HiveContext
# Initialise Hive metastore
SparkContext.setSystemProperty("hive.metastore.uris","thrift://localhsost:9083")
# Create Spark Session
sparkSession = (SparkSession\
.builder\
.appName('test_case')\
.enableHiveSupport()\
.getOrCreate())
sparkSession.sql("CREATE EXTERNAL TABLE B LIKE A")
Pyspark Exception:
pyspark.sql.utils.ParseException: u"\nmismatched input 'LIKE' expecting <EOF>(line 1, pos 53)\n\n== SQL ==\nCREATE EXTERNAL TABLE B LIKE A\n-----------------------------------------------------^^^\n"
How Can I make the hiveQL function work within pySpark?
The problem seems to be that the query is executed like a SparkSQL-Query and not like a HiveQL-Query, even though I got enableHiveSupport activated for the sparkSession.
Spark SQL queries use SparkSQL by default. To enable HiveQL syntax, I believe you need to give it a hint about your intent via a comment. (In fairness, I don't think this is well-documented; I've only been able to find a tangential reference to this being a thing here, and only in the Scala version of the example.)
For example, I'm able to get my command to parse by writing:
%sql
-- `USING HIVE`
CREATE TABLE narf LIKE poit
Now, I don't have Hive Support enabled on my session, so my query fails... but it does parse!
Edit: Since your SQL statement is in a Python string, you can use a multi-line string to use the single-line comment syntax, like this:
sparkSession.sql("""
-- `USING HIVE`
CREATE EXTERNAL TABLE B LIKE A
""")
There's also a delimited comment syntax in SQL, e.g.
sparkSession.sql("/* `USING HIVE` */ CREATE EXTERNAL TABLE B LIKE A")
which may work just as well.

How to save a dataframe into HBase?

I have a df with a schema, also create a table in HBase with phoenix. What i want is to save this df to HBase using spark. I have tried the descriptions in the following link and run the spark-shell with phoenix plugin dependencies.
spark-shell --jars ./phoenix-spark-4.8.0-HBase-1.2.jar,./phoenix-4.8.0-HBase-1.2-client.jar,./spark-sql_2.11-2.0.1.jar
However, i got an error saying even when i run the read function ;
val df = sqlContext.load("org.apache.phoenix.spark", Map("table" -> "INPUT_TABLE",
| "zkUrl" -> hbaseConnectionString))
java.lang.NoClassDefFoundError: org/apache/spark/sql/DataFrame
I have a feeling that i am on the wrong track. So if there is another way of putting data generated on spark into HBase, i will appreciate if you share it with me.
https://phoenix.apache.org/phoenix_spark.html

NoClassDefFoundError when using avro in spark-shell

I keep getting
java.lang.NoClassDefFoundError: org/apache/avro/mapred/AvroWrapper
when calling show() on a DataFrame object. I'm attempting to do this through the shell (spark-shell --master yarn). I can see that the shell recognizes the schema when creating the DataFrame object, but if I execute any actions on the data it will always throw the NoClassDefFoundError when trying to instantiate the AvroWrapper. I've tried adding avro-mapred-1.8.0.jar in my $HDFS_USER/lib directory on the cluster and even included it using the --jar option when launching the shell. Neither of these options worked. Any advice would be greatly appreciated. Below is example code:
scala> import org.apache.spark.sql._
scala> import com.databricks.spark.avro._
scala> val sqc = new SQLContext(sc)
scala> val df = sqc.read.avro("my_avro_file") // recognizes the schema and creates the DataFrame object
scala> df.show // this is where I get NoClassDefFoundError
The DataFrame object itself is created at the val df =... line, but data is not read yet. Spark only starts reading and processing the data, when you ask for some kind of output (like a df.count(), or df.show()).
So the original issue is that the avro-mapred package is missing.
Try launching your Spark Shell like this:
spark-shell --packages org.apache.avro:avro-mapred:1.7.7,com.databricks:spark-avro_2.10:2.0.1
The Spark Avro package marks the Avro Mapred package as provided, but it is not available on your system (or classpath) for one or other reason.
If anyone else runs into this problem, I finally solved it. I removed the CDH spark package and downloaded it from http://spark.apache.org/downloads.html. After that everything worked fine. Not sure what the issues was with the CDH version, but I'm not going to waste anymore time trying to figure it out.

Resources