Not able to write data in Hive using sparksql - apache-spark

I am loading Data from one Hive table to another using spark Sql. I've created sparksession with enableHiveSupport and I'm able to create table in hive using sparksql, but when I'm loading data from one hive table to another hive table using sparksql I'm getting permission issue:
Permission denied: user=anonymous,access=WRITE, path="hivepath".
I am running this using spark user but not able to understand why its taking anonymous as user instead of spark. Can anyone suggest how should I resolve this issue?
I'm using below code.
sparksession.sql("insert overwrite into table dbname.tablename" select * from dbname.tablename").

If you're using spark, you need to set username in your spark context.
System.setProperty("HADOOP_USER_NAME","newUserName")
val spark = SparkSession
.builder()
.appName("SparkSessionApp")
.master("local[*]")
.getOrCreate()
println(spark.sparkContext.sparkUser)

First thing is you may try this for ananymous user
root#host:~# su - hdfs
hdfs#host:~$ hadoop fs -mkdir /user/anonymous
hdfs#host:~$ hadoop fs -chown anonymous /user/anonymous
In general
export HADOOP_USER_NAME=youruser before spark-submit will work.
along with spark-submit configuration like below.
--conf "spark.yarn.appMasterEnv.HADOOP_USER_NAME=${HADDOP_USER_NAME}" \
alternatively you can try using
sudo -su username spark-submit --class your class
see this
Note : This user name setting should be part of your initial
cluster setup ideally if its done then no need to do all these above
and its seemless.
I personally dont prefer user name hard coding in the code it should be from outside the spark job.

To validate with which user you are running,
run below command: -
sc.sparkUser
It will show you the current user and then
you can try setting new user as per the below code
And in scala, you can set the username by
System.setProperty("HADOOP_USER_NAME","newUserName")

Related

Querying snowflake metadata using spark connector

I want to run 'SHOW TABLES' statement through the spark-snowflake connector, I am running the spark on a Databricks platform and getting "Object 'SHOW' does not exist or not authorized" error.
df = spark.read \
.format("snowflake") \
.options(**options) \
.option("query", "show tables") \
.load()
df.show()
Sample query like "SELECT 1" is working as expected.
I know that I am able to install the native python-snowflake driver but I want to avoid this solution if possible because I already opened the session using spark.
There is also a way using "Utils.runQuery" function but I understood that is relevant only for DDL statement (It doesn't return the actual results).
Thanks!
When using DataFrames, the Snowflake connector supports SELECT queries only.
This is documented on our docs.

How to export a Datastax graph based on a specific traversal using DseGraphFrame

I would like to export a DSE graph via a spark job , as per
https://docs.datastax.com/en/dse/6.0/dse-dev/datastax_enterprise/graph/graphAnalytics/dseGraphFrameExport.html
All this works fine within the spark-shell ,
I want to be doing this in Java using DseGraphFrame .
Unfortunately there is not much in the documentation
I am able to pack a jar with the following code and do a
spark-submit
SparkSession spark = SparkSession
.builder()
.appName("Datastax Java example")
.getOrCreate();
DseGraphFrame dseGraphFrame = DseGraphFrameBuilder.dseGraph(args[0], spark);
DataFrameWriter dataFrameWriter = dseGraphFrame.V().df().write();
dataFrameWriter.csv("vertices");
The above works fine ,
what I want to be doing is use a specific traversal to filter what I export.
That is use something like that
dseGraphFrame.V().hasLabel("label").df().write();
The above does not work as dseGraphFrame.V().hasLabel("label") does not have .df()
Is this the correct way of doing things
Any help would be appreciated
A late answer to this question, perhaps still of use:
In Java, you need to cast this to a DseGraphTraversal first. This can then be converted to a DataFrame with the .df() method:
((DseGraphTraversal)dseGraphFrame.V().hasLabel("label")).df().write();

Sharing a spark session

Lets say I have a python file my_python.py in which I have created a SparkSession 'spark' . I have a jar say my_jar.jar in which some spark logic is written. I am not creating SparkSession in my jar , rather I want to use the same session created in my_python.py. How to write a spark-submit command which take my python file , my jar and my sparksession 'spark' as an argument to my jar file.
Is it possible ?
If not , please share the alternative to do so.
So I feel there are two questions -
Q1. How in scala file you can reuse already created spark session?
Ans: Inside your scala code, you should use builder to get an existing session:
SparkSession.builder().getOrCreate()
Please check the Spark doc
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/SparkSession.html
Q2: How you do spark-submit with a .py file as driver and scala jar(s) as supporting jars?
And: It should be in something like this
./spark-submit --jars myjar.jar,otherjar.jar --py-files path/to/myegg.egg path/to/my_python.py arg1 arg2 arg3
So if you notice the method name, it is getOrCreate() - that means if a spark session is already created, no new session will be created rather existing session will be used.
Check this link for full implementation example:
https://www.crowdstrike.com/blog/spark-hot-potato-passing-dataframes-between-scala-spark-and-pyspark/

How to start sparksession in pyspark

I want to change the default memory, executor and core settings of a spark session.
The first code in my pyspark notebook on HDInsight cluster in Jupyter looks like this:
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("Juanita_Smith")\
.config("spark.executor.instances", "2")\
.config("spark.executor.cores", "2")\
.config("spark.executor.memory", "2g")\
.config("spark.driver.memory", "2g")\
.getOrCreate()
On completion, I read the parameters back, which looks like the statement worked
However if I look in yarn, the setting have indeed not worked.
Which settings or commands do I need to make to let the session configuration take effect ?
Thank you for help in advance
By the time your notebook kernel has started, the SparkSession is already created with parameters defined in a kernel configuration file. To change this, you will need to update or replace the kernel configuration file, which I believe is usually somewhere like <jupyter home>/kernels/<kernel name>/kernel.json.
Update
If you have access to the machine hosting your Jupyter server, you can find the location of the current kernel configurations using jupyter kernelspec list. You can then either edit one of the pyspark kernel configurations, or copy it to a new file and edit that. For your purposes, you will need to add the following arguments to the PYSPARK_SUBMIT_ARGS:
"PYSPARK_SUBMIT_ARGS": "--conf spark.executor.instances=2 --conf spark.executor.cores=2 --conf spark.executor.memory=2g --conf spark.driver.memory=2g"

How to save a dataframe into HBase?

I have a df with a schema, also create a table in HBase with phoenix. What i want is to save this df to HBase using spark. I have tried the descriptions in the following link and run the spark-shell with phoenix plugin dependencies.
spark-shell --jars ./phoenix-spark-4.8.0-HBase-1.2.jar,./phoenix-4.8.0-HBase-1.2-client.jar,./spark-sql_2.11-2.0.1.jar
However, i got an error saying even when i run the read function ;
val df = sqlContext.load("org.apache.phoenix.spark", Map("table" -> "INPUT_TABLE",
| "zkUrl" -> hbaseConnectionString))
java.lang.NoClassDefFoundError: org/apache/spark/sql/DataFrame
I have a feeling that i am on the wrong track. So if there is another way of putting data generated on spark into HBase, i will appreciate if you share it with me.
https://phoenix.apache.org/phoenix_spark.html

Resources