Unable to access HBase Via API - apache-spark

I have hbase installed over three nodes. I am trying to load hbase via spark with the help of below code.
from __future__ import print_function
import sys
from pyspark import SparkContext
import json
if __name__ == "__main__":
print ("*******************************")
sc = SparkContext(appName="HBaseOutputFormat")
host = sys.argv[1]
table = "hbase_test"
port = "2181"
conf = {"hbase.zookeeper.quorum": host,
"hbase.mapred.outputtable": table,
"hbase.zookeeper.property.clientPort":port,
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.TableOutputFormat",
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
rdd = sc.parallelize([sys.argv[2:]]).map(lambda x: (x[0], x))
print (rdd.collect())
rdd.saveAsNewAPIHadoopDataset(
conf=conf,
keyConverter=keyConv,
valueConverter=valueConv)
sc.stop()
I am executing code as:
spark-submit --driver-class-path /usr/iop/4.3.0.0-0000/hbase/lib/hbase-server.jar:/usr/iop/4.3.0.0-0000/hbase/lib/hbase-common.jar:/usr/iop/4.3.0.0-0000/hbase/lib/hbase-client.jar:/usr/iop/4.3.0.0-0000/hbase/lib/zookeeper.jar:/usr/iop/4.3.0.0-0000/hbase/lib/hbase-protocol.jar:/usr/iop/4.3.0.0-0000/spark2/examples/jars/scopt_2.11-3.3.0.jar:/home/tanveer/spark-examples_2.10-1.1.0.jar --conf spark.ui.port=5054 --master local[2] /data/usr/tanveer/from_home/spark/hbase_outputformat.py HBASE_MASTER_ip row1 f1 q1 value1
But the job stucks and doesn't proceed. Below is the snapshot:
As per some previous threads I tried changing /etc/hosts to comment localhost line but it didn't worked.
Requesting your help.

On further debugging I referred to below blog post from Hortononworks link for best practice:
https://community.hortonworks.com/articles/4091/hbase-client-application-best-practices.html
I have added hbase configuration file to driver class path and ran the code and it worked perfectly fine.
Modified spark-submit can be viewed as:
spark-submit --driver-class-path /usr/iop/4.3.0.0-0000/hbase/lib/hbase-server.jar:/usr/iop/4.3.0.0-0000/hbase/lib/hbase-common.jar:/usr/iop/4.3.0.0-0000/hbase/lib/hbase-client.jar:/usr/iop/4.3.0.0-0000/hbase/lib/zookeeper.jar:/usr/iop/4.3.0.0-0000/hbase/lib/hbase-protocol.jar:/usr/iop/4.3.0.0-0000/spark2/examples/jars/scopt_2.11-3.3.0.jar:/home/tanveer/spark-examples_2.10-1.1.0.jar:**/etc/hbase/conf** --conf spark.ui.port=5054 --master local[2] /data/usr/tanveer/from_home/spark/hbase_outputformat.py host row1 f1 q1 value1

Related

Load/import CSV file in to mongodb using PYSPARK

I want to know how to load/import a CSV file in to mongodb using pyspark. I have a csv file named cal.csv placed in the desktop. Can somebody share the code snippet.
First read the csv as pyspark dataframe.
from pyspark import SparkConf,SparkContext
from pyspark.sql import SQLContext
sc = SparkContext(conf = conf)
sql = SQLContext(sc)
df = sql.read.csv("cal.csv", header=True, mode="DROPMALFORMED")
Then write it to mongodb,
df.write.format('com.mongodb.spark.sql.DefaultSource').mode('append')\
.option('database',NAME).option('collection',COLLECTION_MONGODB).save()
Specify the NAME and COLLECTION_MONGODB as created by you.
Also, you need to give conf and packages alongwith spark-submit according to your version,
/bin/spark-submit --conf "spark.mongodb.inuri=mongodb://127.0.0.1/DATABASE.COLLECTION_NAME?readPreference=primaryPreferred"
--conf "spark.mongodb.output.uri=mongodb://127.0.0.1/DATABASE.COLLECTION_NAME"
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0
tester.py
Specify COLLECTION_NAME and DATABASE above. tester.py assuming name of the code file. For more information, refer this.
This worked for me. database:people Collection:con
pyspark --conf "spark.mongodb.input.uri=mongodb://127.0.0.1/people.con?readPreference=primaryPreferred" \
--conf "spark.mongodb.output.uri=mongodb://127.0.0.1/people.con" \
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.3.0
from pyspark.sql import SparkSession
my_spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/people.con") \
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/people.con") \
.getOrCreate()
df = spark.read.csv(path = "file:///home/user/Desktop/people.csv", header=True, inferSchema=True)
df.printSchema()
df.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").option("database","people").option("collection", "con").save()
Next go to mongo and check if collection is wrtten by following below steps
mongo
show dbs
use people
show collections
db.con.find().pretty()

Save CSV file to hbase table using Spark and Phoenix

Can someone point me to a working example of saving a csv file to Hbase table using Spark 2.2
Options that I tried and failed (Note: all of them work with Spark 1.6 for me)
phoenix-spark
hbase-spark
it.nerdammer.bigdata : spark-hbase-connector_2.10
All of them finally after fixing everything give similar error to this Spark HBase
Thanks
Add below parameters to your spark job-
spark-submit \
--conf "spark.yarn.stagingDir=/somelocation" \
--conf "spark.hadoop.mapreduce.output.fileoutputformat.outputdir=/s‌​omelocation" \
--conf "spark.hadoop.mapred.output.dir=/somelocation"
Phoexin has plugin and jdbc thin client which can connect(read/write) to HBASE, example are in https://phoenix.apache.org/phoenix_spark.html
Option 1 : Connect via zookeeper url - phoenix plugin
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.phoenix.spark._
val sc = new SparkContext("local", "phoenix-test")
val sqlContext = new SQLContext(sc)
val df = sqlContext.load(
"org.apache.phoenix.spark",
Map("table" -> "TABLE1", "zkUrl" -> "phoenix-server:2181")
)
df
.filter(df("COL1") === "test_row_1" && df("ID") === 1L)
.select(df("ID"))
.show
Option 2 : Use JDBC thin client provied by phoenix query server
more info on https://phoenix.apache.org/server.html
jdbc:phoenix:thin:url=http://localhost:8765;serialization=PROTOBUF

Spark Hive reporting pyspark.sql.utils.AnalysisException: u'Table not found: XXX' when run on yarn cluster

I'm attempting to run a pyspark script on BigInsights on Cloud 4.2 Enterprise that accesses a Hive table.
First I create the hive table:
[biadmin#bi4c-xxxxx-mastermanager ~]$ hive
hive> CREATE TABLE pokes (foo INT, bar STRING);
OK
Time taken: 2.147 seconds
hive> LOAD DATA LOCAL INPATH '/usr/iop/4.2.0.0/hive/doc/examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
Loading data to table default.pokes
Table default.pokes stats: [numFiles=1, numRows=0, totalSize=5812, rawDataSize=0]
OK
Time taken: 0.49 seconds
hive>
Then I create a simple pyspark script:
[biadmin#bi4c-xxxxxx-mastermanager ~]$ cat test_pokes.py
from pyspark import SparkContext
sc = SparkContext()
from pyspark.sql import HiveContext
hc = HiveContext(sc)
pokesRdd = hc.sql('select * from pokes')
print( pokesRdd.collect() )
I attempt to execute with:
[biadmin#bi4c-xxxxxx-mastermanager ~]$ spark-submit \
--master yarn-cluster \
--deploy-mode cluster \
--jars /usr/iop/4.2.0.0/hive/lib/datanucleus-api-jdo-3.2.6.jar, \
/usr/iop/4.2.0.0/hive/lib/datanucleus-core-3.2.10.jar, \
/usr/iop/4.2.0.0/hive/lib/datanucleus-rdbms-3.2.9.jar \
test_pokes.py
However, I encounter the error:
Traceback (most recent call last):
File "test_pokes.py", line 8, in <module>
pokesRdd = hc.sql('select * from pokes')
File "/disk6/local/usercache/biadmin/appcache/application_1477084339086_0481/container_e09_1477084339086_0481_01_000001/pyspark.zip/pyspark/sql/context.py", line 580, in sql
File "/disk6/local/usercache/biadmin/appcache/application_1477084339086_0481/container_e09_1477084339086_0481_01_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/disk6/local/usercache/biadmin/appcache/application_1477084339086_0481/container_e09_1477084339086_0481_01_000001/pyspark.zip/pyspark/sql/utils.py", line 51, in deco
pyspark.sql.utils.AnalysisException: u'Table not found: pokes; line 1 pos 14'
End of LogType:stdout
If I run spark-submit standalone, I can see the table exists ok:
[biadmin#bi4c-xxxxxx-mastermanager ~]$ spark-submit test_pokes.py
…
…
16/12/21 13:09:13 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 18962 bytes result sent to driver
16/12/21 13:09:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 168 ms on localhost (1/1)
16/12/21 13:09:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/12/21 13:09:13 INFO DAGScheduler: ResultStage 0 (collect at /home/biadmin/test_pokes.py:9) finished in 0.179 s
16/12/21 13:09:13 INFO DAGScheduler: Job 0 finished: collect at /home/biadmin/test_pokes.py:9, took 0.236558 s
[Row(foo=238, bar=u'val_238'), Row(foo=86, bar=u'val_86'), Row(foo=311, bar=u'val_311')
…
…
See my previous question related to this issue: hive spark yarn-cluster job fails with: "ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory"
This question is similar to this other question: Spark can access Hive table from pyspark but not from spark-submit. However, unlike that question I am using HiveContext.
Update: see here for the final solution https://stackoverflow.com/a/41272260/1033422
This is because the spark-submit job is unable to find the hive-site.xml, so it cannot connect to the Hive metastore. Please add --files /usr/iop/4.2.0.0/hive/conf/hive-site.xml to your spark-submit command.
It looks like you are affected by this bug: https://issues.apache.org/jira/browse/SPARK-15345.
I had a similar issue with Spark 1.6.2 and 2.0.0 on HDP-2.5.0.0:
My goal was to create a dataframe from a Hive SQL query, under these conditions:
python API,
cluster deploy-mode (driver program running on one of the executor nodes)
use YARN to manage the executor JVMs (instead of a standalone Spark master instance).
The initial tests gave these results:
spark-submit --deploy-mode client --master local ... =>
WORKING
spark-submit --deploy-mode client --master yarn ... => WORKING
spark-submit --deploy-mode cluster --master yarn .... => NOT WORKING
In case #3, the driver running on one of the executor nodes could find the database. The error was:
pyspark.sql.utils.AnalysisException: 'Table or view not found: `database_name`.`table_name`; line 1 pos 14'
Fokko Driesprong's answer listed above worked for me.
With, the command listed below, the driver running on the executor node was able to access a Hive table in a database which is not default:
$ /usr/hdp/current/spark2-client/bin/spark-submit \
--deploy-mode cluster --master yarn \
--files /usr/hdp/current/spark2-client/conf/hive-site.xml \
/path/to/python/code.py
The python code I have used to test with Spark 1.6.2 and Spark 2.0.0 is:
(Change SPARK_VERSION to 1 to test with Spark 1.6.2. Make sure to update the paths in the spark-submit command accordingly)
SPARK_VERSION=2
APP_NAME = 'spark-sql-python-test_SV,' + str(SPARK_VERSION)
def spark1():
from pyspark.sql import HiveContext
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName(APP_NAME)
sc = SparkContext(conf=conf)
hc = HiveContext(sc)
query = 'select * from database_name.table_name limit 5'
df = hc.sql(query)
printout(df)
def spark2():
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(APP_NAME).enableHiveSupport().getOrCreate()
query = 'select * from database_name.table_name limit 5'
df = spark.sql(query)
printout(df)
def printout(df):
print('\n########################################################################')
df.show()
print(df.count())
df_list = df.collect()
print(df_list)
print(df_list[0])
print(df_list[1])
print('########################################################################\n')
def main():
if SPARK_VERSION == 1:
spark1()
elif SPARK_VERSION == 2:
spark2()
if __name__ == '__main__':
main()
For me the accepted answer did not work.
(--files /usr/iop/4.2.0.0/hive/conf/hive-site.xml)
Adding the below code on top of the code file solved it.
import findspark
findspark.init('/usr/share/spark-2.4') # for 2.4

Getting parameters of Spark submit while running a Spark job

I am running a spark job by spark-submit and using its --files parameter to load a log4j.properties file.
In my Spark job I need to get this parameter
object LoggerSparkUsage {
def main(args: Array[String]): Unit = {
//DriverHolder.log.info("unspark")
println("args are....."+args.mkString(" "))
val conf = new SparkConf().setAppName("Simple_Application")//.setMaster("local[4]")
val sc = new SparkContext(conf)
// conf.getExecutorEnv.
val count = sc.parallelize(Array(1, 2, 3)).count()
println("these are files"+conf.get("files"))
LoggerDriver.log.info("log1 for info..")
LoggerDriver.log.info("log2 for infor..")
f2
}
def f2{LoggerDriver.log.info("logs from another function..")}
}
my spark submit is something like this:
/opt/mapr/spark/spark-1.6.1/bin/spark-submit --class "LoggerSparkUsage" --master yarn-client --files src/main/resources/log4j.properties /mapr/cellos-mapr/user/mbazarganigilani/SprkHbase/target/scala-2.10/sprkhbase_2.10-1.0.2.jar
I tried to get the properties using
conf.get("files")
but it gives me an exception
can anyone give me a solution for this?
A correct key for files is spark.files:
scala.util.Try(sc.getConf.get("spark.files"))
but to get actual path on the workers you have to use SparkFiles:
org.apache.spark.SparkFiles.get(fileName)
If it is not sufficient you can pass these second as application arguments and retrieve from main args or use custom key in spark.conf.

spark-submit is not exiting until I hit ctrl+C

I am running this spark command to run spark Scala program successfully using Hortonworks vm. But once the job is completed it is not exiting from spark-submit command until I hit ctrl+C. Why?
spark-submit --class SimpleApp --master yarn-client --num-executors 3 --driver-memory 512m --executor-memory12m --executor-cores 1 target/scala-2.10/application_2.10-1.0.jar /user/root/decks/largedeck.txt
Here is the code, I am running.
/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val cards = sc.textFile(args(0)).flatMap(_.split(" "))
val cardCount = cards.count()
println(cardCount)
}
}
You have to call stop() on context to exit your program cleanly.
I had the same kind of problem when writing files to S3. I use the spark 2.0 version, even after adding stop() if it didn't work for you. Try the below settings
In Spark 2.0 you can use,
val spark = SparkSession.builder().master("local[*]").appName("App_name").getOrCreate()
spark.conf.set("spark.hadoop.mapred.output.committer.class","com.appsflyer.spark.DirectOutputCommitter")
spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")

Resources