How to save a dataframe into HBase? - apache-spark

I have a df with a schema, also create a table in HBase with phoenix. What i want is to save this df to HBase using spark. I have tried the descriptions in the following link and run the spark-shell with phoenix plugin dependencies.
spark-shell --jars ./phoenix-spark-4.8.0-HBase-1.2.jar,./phoenix-4.8.0-HBase-1.2-client.jar,./spark-sql_2.11-2.0.1.jar
However, i got an error saying even when i run the read function ;
val df = sqlContext.load("org.apache.phoenix.spark", Map("table" -> "INPUT_TABLE",
| "zkUrl" -> hbaseConnectionString))
java.lang.NoClassDefFoundError: org/apache/spark/sql/DataFrame
I have a feeling that i am on the wrong track. So if there is another way of putting data generated on spark into HBase, i will appreciate if you share it with me.
https://phoenix.apache.org/phoenix_spark.html

Related

How to resolve invalid column name on parquet file read itself in PySpark

I setup a standalone spark and a standalone HDFS.
I installed pyspark and was able to create spark session.
I uploaded one parquet file to HDFS under /data : hdfs://localhost:9000/data
I tried to create a dataframe out of this directory using PySpark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local[*]').appName("test").getOrCreate()
df = spark.read.parquet("hdfs://localhost:9000/data").withColumnRenamed("Wafer ID", "Wafer_ID")
I am getting invalid column name even with withColumnRenamed.
I tried with the following code but I got same error for this as well
df = spark.read.parquet("hdfs://localhost:9000/data").select(col("Wafer ID").alias("Wafer_ID"))
I have means to change the column names manually (pandas) or use different file entirely but I want to know if there is a way to solve this problem.
What am I doing wrong?

Not able to write data in Hive using sparksql

I am loading Data from one Hive table to another using spark Sql. I've created sparksession with enableHiveSupport and I'm able to create table in hive using sparksql, but when I'm loading data from one hive table to another hive table using sparksql I'm getting permission issue:
Permission denied: user=anonymous,access=WRITE, path="hivepath".
I am running this using spark user but not able to understand why its taking anonymous as user instead of spark. Can anyone suggest how should I resolve this issue?
I'm using below code.
sparksession.sql("insert overwrite into table dbname.tablename" select * from dbname.tablename").
If you're using spark, you need to set username in your spark context.
System.setProperty("HADOOP_USER_NAME","newUserName")
val spark = SparkSession
.builder()
.appName("SparkSessionApp")
.master("local[*]")
.getOrCreate()
println(spark.sparkContext.sparkUser)
First thing is you may try this for ananymous user
root#host:~# su - hdfs
hdfs#host:~$ hadoop fs -mkdir /user/anonymous
hdfs#host:~$ hadoop fs -chown anonymous /user/anonymous
In general
export HADOOP_USER_NAME=youruser before spark-submit will work.
along with spark-submit configuration like below.
--conf "spark.yarn.appMasterEnv.HADOOP_USER_NAME=${HADDOP_USER_NAME}" \
alternatively you can try using
sudo -su username spark-submit --class your class
see this
Note : This user name setting should be part of your initial
cluster setup ideally if its done then no need to do all these above
and its seemless.
I personally dont prefer user name hard coding in the code it should be from outside the spark job.
To validate with which user you are running,
run below command: -
sc.sparkUser
It will show you the current user and then
you can try setting new user as per the below code
And in scala, you can set the username by
System.setProperty("HADOOP_USER_NAME","newUserName")

How to export a Datastax graph based on a specific traversal using DseGraphFrame

I would like to export a DSE graph via a spark job , as per
https://docs.datastax.com/en/dse/6.0/dse-dev/datastax_enterprise/graph/graphAnalytics/dseGraphFrameExport.html
All this works fine within the spark-shell ,
I want to be doing this in Java using DseGraphFrame .
Unfortunately there is not much in the documentation
I am able to pack a jar with the following code and do a
spark-submit
SparkSession spark = SparkSession
.builder()
.appName("Datastax Java example")
.getOrCreate();
DseGraphFrame dseGraphFrame = DseGraphFrameBuilder.dseGraph(args[0], spark);
DataFrameWriter dataFrameWriter = dseGraphFrame.V().df().write();
dataFrameWriter.csv("vertices");
The above works fine ,
what I want to be doing is use a specific traversal to filter what I export.
That is use something like that
dseGraphFrame.V().hasLabel("label").df().write();
The above does not work as dseGraphFrame.V().hasLabel("label") does not have .df()
Is this the correct way of doing things
Any help would be appreciated
A late answer to this question, perhaps still of use:
In Java, you need to cast this to a DseGraphTraversal first. This can then be converted to a DataFrame with the .df() method:
((DseGraphTraversal)dseGraphFrame.V().hasLabel("label")).df().write();

Pyspark : GET row from hbase using row-key

I have a use-case to read from HBase inside a pyspark job and is currently doing a scan on the HBase table like this,
conf = {"hbase.zookeeper.quorum": host, "hbase.cluster.distributed": "true", "hbase.mapreduce.inputtable": "table_name", "hbase.mapreduce.scan.row.start": start, "hbase.mapreduce.scan.row.stop": stop}
rdd = sc.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat", "org.apache.hadoop.hbase.io.ImmutableBytesWritable","org.apache.hadoop.hbase.client.Result", keyConverter=keyConv, valueConverter=valueConv,conf=cmdata_conf)
I am unable to find the conf to do a GET on the HBase table. Can someone help me? I could find that filters are not supported with pyspark. But is it not possible to do a simple GET?
Thanks!

NoClassDefFoundError when using avro in spark-shell

I keep getting
java.lang.NoClassDefFoundError: org/apache/avro/mapred/AvroWrapper
when calling show() on a DataFrame object. I'm attempting to do this through the shell (spark-shell --master yarn). I can see that the shell recognizes the schema when creating the DataFrame object, but if I execute any actions on the data it will always throw the NoClassDefFoundError when trying to instantiate the AvroWrapper. I've tried adding avro-mapred-1.8.0.jar in my $HDFS_USER/lib directory on the cluster and even included it using the --jar option when launching the shell. Neither of these options worked. Any advice would be greatly appreciated. Below is example code:
scala> import org.apache.spark.sql._
scala> import com.databricks.spark.avro._
scala> val sqc = new SQLContext(sc)
scala> val df = sqc.read.avro("my_avro_file") // recognizes the schema and creates the DataFrame object
scala> df.show // this is where I get NoClassDefFoundError
The DataFrame object itself is created at the val df =... line, but data is not read yet. Spark only starts reading and processing the data, when you ask for some kind of output (like a df.count(), or df.show()).
So the original issue is that the avro-mapred package is missing.
Try launching your Spark Shell like this:
spark-shell --packages org.apache.avro:avro-mapred:1.7.7,com.databricks:spark-avro_2.10:2.0.1
The Spark Avro package marks the Avro Mapred package as provided, but it is not available on your system (or classpath) for one or other reason.
If anyone else runs into this problem, I finally solved it. I removed the CDH spark package and downloaded it from http://spark.apache.org/downloads.html. After that everything worked fine. Not sure what the issues was with the CDH version, but I'm not going to waste anymore time trying to figure it out.

Resources