What is the best possible way of interacting with Hbase using Pyspark - apache-spark

I am using pyspark [spark2.3.1] and Hbase1.2.1, I am wondering what could be the best possible way of accessing Hbase using pyspark?
I did some initial level of search and found that there are few options available like using shc-core:1.1.1-2.1-s_2.11.jar this could be achieved, but whereever I try to look for some example, at most of the places code is written in Scala or examples are also scala based. I tried implementing basic code in pyspark:
from pyspark import SparkContext
from pyspark.sql import SQLContext
def main():
sc = SparkContext()
sqlc = SQLContext(sc)
data_source_format = 'org.apache.spark.sql.execution.datasources.hbase'
catalog = ''.join("""{
"table":{"namespace":"default", "name":"firsttable"},
"rowkey":"key",
"columns":{
"firstcol":{"cf":"rowkey", "col":"key", "type":"string"},
"secondcol":{"cf":"d", "col":"colname", "type":"string"}
}
}""".split())
df = sqlc.read.options(catalog=catalog).format(data_source_format).load()
df.select("secondcol").show()
# entry point for PySpark application
if __name__ == '__main__':
main()
and running it using:
spark-submit --master yarn-client --files /opt/hbase-1.1.2/conf/hbase-site.xml --packages com.hortonworks:shc-core:1.1.1-2.1-s_2.11 --jars /home/ubuntu/hbase-spark-2.0.0-alpha4.jar HbaseMain2.py
It is returning me blank output:
+---------+
|secondcol|
+---------+
+---------+
I am not sure what am I doing wrong? Also not sure what would be the best approach of doing this??
Any references would be appreciated.
Regards

Finally, Using SHC, I am able to connect to HBase-1.2.1 with Spark-2.3.1 using pyspark code. Following is my work:
All my hadoop [namenode, datanode, nodemanager, resourcemanager] & hbase [Hmaster, HRegionServer, HQuorumPeer] deamons were up and running on my EC2 instance.
I placed emp.csv file at hdfs location /test/emp.csv, with data:
key,empId,empName,empWeight
1,"E007","Bhupesh",115.10
2,"E008","Chauhan",110.23
3,"E009","Prithvi",90.0
4,"E0010","Raj",80.0
5,"E0011","Chauhan",100.0
I created readwriteHBase.py file with following line of code [for reading emp.csv file from HDFS, then creating tblEmployee first in HBase, pushing the data into tblEmployee then once again reading some data from the same table and displaying it on console]:
from pyspark.sql import SparkSession
def main():
spark = SparkSession.builder.master("yarn-client").appName("HelloSpark").getOrCreate()
dataSourceFormat = "org.apache.spark.sql.execution.datasources.hbase"
writeCatalog = ''.join("""{
"table":{"namespace":"default", "name":"tblEmployee", "tableCoder":"PrimitiveType"},
"rowkey":"key",
"columns":{
"key":{"cf":"rowkey", "col":"key", "type":"int"},
"empId":{"cf":"personal","col":"empId","type":"string"},
"empName":{"cf":"personal", "col":"empName", "type":"string"},
"empWeight":{"cf":"personal", "col":"empWeight", "type":"double"}
}
}""".split())
writeDF = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/test/emp.csv")
print("csv file read", writeDF.show())
writeDF.write.options(catalog=writeCatalog, newtable=5).format(dataSourceFormat).save()
print("csv file written to HBase")
readCatalog = ''.join("""{
"table":{"namespace":"default", "name":"tblEmployee"},
"rowkey":"key",
"columns":{
"key":{"cf":"rowkey", "col":"key", "type":"int"},
"empId":{"cf":"personal","col":"empId","type":"string"},
"empName":{"cf":"personal", "col":"empName", "type":"string"}
}
}""".split())
print("going to read data from Hbase table")
readDF = spark.read.options(catalog=readCatalog).format(dataSourceFormat).load()
print("data read from HBase table")
readDF.select("empId", "empName").show()
readDF.show()
# entry point for PySpark application
if __name__ == '__main__':
main()
Ran this script on VM console using command:
spark-submit --master yarn-client --packages com.hortonworks:shc-core:1.1.1-2.1-s_2.11 --repositories http://nexus-private.hortonworks.com/nexus/content/repositories/IN-QA/ readwriteHBase.py
Intermediate Result: After reading CSV file:
+---+-----+-------+---------+
|key|empId|empName|empWeight|
+---+-----+-------+---------+
| 1| E007|Bhupesh| 115.1|
| 2| E008|Chauhan| 110.23|
| 3| E009|Prithvi| 90.0|
| 4|E0010| Raj| 80.0|
| 5|E0011|Chauhan| 100.0|
+---+-----+-------+---------+
Final Output : after reading data from HBase table:
+-----+-------+
|empId|empName|
+-----+-------+
| E007|Bhupesh|
| E008|Chauhan|
| E009|Prithvi|
|E0010| Raj|
|E0011|Chauhan|
+-----+-------+
Note: While creating Hbase table and inserting data into HBase table it expects NumberOfRegions should be greater than 3, hence I have added options(catalog=writeCatalog, newtable=5) while adding data to HBase

Related

Last Access Time Update in Hive metastore

I am using the following property in my Hive console/ .hiverc file, so that whenever I query the table, it updates the LAST_ACCESS_TIME column in TBLS table of Hive metastore.
set hive.exec.pre.hooks = org.apache.hadoop.hive.ql.hooks.UpdateInputAccessTimeHook$PreExec;
However, if I use spark-sql or spark-shell, it does not seems to be working and LAST_ACCESS_TIME does not gets updated in hive metastore.
Here's how I am reading the table :
>>> df = spark.sql("select * from db.sometable")
>>> df.show()
I have set up the above hook in hive-site.xml in both /etc/hive/conf and /etc/spark/conf.
Your code may skip past some of the hive integrations. My recollection is that to get more of the Hive-ish integrations you need to bring in the HiveContext, something like this:
from pyspark import SparkContext, SparkConf, HiveContext
if __name__ == "__main__":
# create Spark context with Spark configuration
conf = SparkConf().setAppName("Data Frame Join")
sc = SparkContext(conf=conf)
sqlContext = HiveContext(sc)
df_07 = sqlContext.sql("SELECT * from sample_07")
https://docs.cloudera.com/runtime/7.2.7/developing-spark-applications/topics/spark-sql-example.html
Hope this helps

Vizualization with zeppelin using cassandra and spark

I'm new with zeppelin, but it look like interesting.
I'd like to do some visualization with cassandra's data reading with spark within zeppelin. But I can't do it, yet!
This is my code:
import org.apache.spark.sql.cassandra._
import org.apache.spark.sql
val createDDL = """CREATE TEMPORARY VIEW keyspaces9
USING org.apache.spark.sql.cassandra
OPTIONS (
table "foehis",
keyspace "tfm",
pushdown "true")"""
spark.sql(createDDL)
spark.sql("SELECT hoclic,hodtac,hohrac,hotpac FROM keyspaces").show
And I get:
res41: org.apache.spark.sql.DataFrame = []
+------+--------+------+------+
|hoclic| hodtac|hohrac|hotpac|
+------+--------+------+------+
| 1011|10180619| 510| ENPR|
| 1011|20140427| 800| ANDE|
| 1011|20140427| 800| ANDE|
| 1011|20170522| 1100| ANDE|
| 1011|20170522| 1100| ANDE|
....
But I don't have the ability to make a viz
How do I convert that data into a table for zeppelin?
Register the DataFrame as a Table using df.registerTempTable.
In your case, register 'keyspaces' dataframe as table and then you can execute the SQL queries on the table and create visualizations.
Sample code:

DSE SearchAnalytics with Scala error

By refering to this link , I tried to query cassandra table in spark Dataframe
val spark = SparkSession
.builder()
.appName("CassandraSpark")
.config("spark.cassandra.connection.host", "127.0.0.1")
.config("spark.cassandra.connection.port", "9042")
.master("local[2]")
.getOrCreate();
The node which I'm using is SearchAnalytics node
With using this spark session , i tried sql query
val ss = spark.sql("select * from killr_video.videos where solr_query = '{\"q\":\"video_id:1\"}'")
Search indexing is already enabled on that table.
After running the program , here is the error i am getting
Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or view not found: `killr_video`.`videos`; line 1 pos 14;
'Project [*]
+- 'UnresolvedRelation killr_video.videos
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:82)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:66)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:691)
How can i get Cassandra data into Spark?
From this error message it looks like that you're running your code using the standalone Spark, not via DSE Analytics (via dse spark-submit, or dse spark).
In this case you need to register tables - DSE documentation describes how to do it for all tables, using dse client-tool & spark-sql:
dse client-tool --use-server-config spark sql-schema --all > output.sql
spark-sql --jars byos-5.1.jar -f output.sql
For my example, it looks like following:
USE test;
CREATE TABLE t122
USING org.apache.spark.sql.cassandra
OPTIONS (
keyspace "test",
table "t122",
pushdown "true");
Here is an example of solr_query that just works out of box if I run it in the spark-shell started with dse spark:
scala> val ss = spark.sql("select * from test.t122 where solr_query='{\"q\":\"t:t2\"}'").show
+---+----------+---+
| id|solr_query| t|
+---+----------+---+
| 2| null| t2|
+---+----------+---+
To make your life easier, it's better to use DSE Analytics, not the bring your own spark.

Spark DataFrame losing string data in yarn-client mode

By some reason if I'm adding new column, appending string to existing data/column or creating new DataFrame from code, it misinterpreting string data, so show() doesn't work properly, filters (such as withColumn, where, when, etc.) doesn't work ether.
Here is example code:
object MissingValue {
def hex(str: String): String = str.getBytes("UTF-8").map(f => Integer.toHexString((f&0xFF)).toUpperCase).mkString("-")
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("MissingValue")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val list = List((101,"ABC"),(102,"BCD"),(103,"CDE"))
val rdd = sc.parallelize(list).map(f => Row(f._1,f._2))
val schema = StructType(StructField("COL1",IntegerType,true)::StructField("COL2",StringType,true)::Nil)
val df = sqlContext.createDataFrame(rdd,schema)
df.show()
val str = df.first().getString(1)
println(s"${str} == ${hex(str)}")
sc.stop()
}
}
If I run it in local mode then everything works as expected:
+----+----+
|COL1|COL2|
+----+----+
| 101| ABC|
| 102| BCD|
| 103| CDE|
+----+----+
ABC == 41-42-43
But when I run the same code in yarn-client mode it produces:
+----+----+
|COL1|COL2|
+----+----+
| 101| ^E^#^#|
| 102| ^E^#^#|
| 103| ^E^#^#|
+----+----+
^E^#^# == 5-0-0
This problem exists only for string values, so first column (Integer) is fine.
Also if I'm creating rdd from the dataframe then everything is fine i.e. df.rdd.take(1).apply(0).getString(1)
I'm using Spark 1.5.0 from CDH 5.5.2
EDIT:
It seems that this happens when the difference between driver memory and executor memory is too high --driver-memory xxG --executor-memory yyG i.e. when I decreasing executor memory or increasing driver memory then the problem disappears.
This is a bug related to executor memory and Oops size:
https://issues.apache.org/jira/browse/SPARK-9725
https://issues.apache.org/jira/browse/SPARK-10914
https://issues.apache.org/jira/browse/SPARK-17706
It is fixed in Spark version 1.5.2

Using Hive along with Spark Cassandra connector?

Can I use Hive in concert with the Spark cassandra connector ?
scala> import org.apache.spark.sql.hive.HiveContext
scala> hiveCtx = new HiveContext(sc)
This produces:
ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,
/etc/hive/conf.dist/ivysettings.xml will be used
and then
scala> val rows = hiveCtx.sql("SELECT first_name,last_name,house FROM
test_gce.students WHERE student_id=1")
results in this error:
org.apache.spark.sql.AnalysisException: no such table test_gce.students; line 1 pos 48
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:260)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:268)
...
Is it possible to create a HiveContext from the SparkContext and use it as I am trying to do while using the Spark cassandra connector ?
Here is how I called spark-shell:
spark-shell --jars ~/spark-cassandra-connector/spark-cassandra-connector-assembly-1.4.0-M1-SNAPSHOT.jar --conf spark.cassandra.connection.host=10.240.0.0
Also, I am able to successfully access Cassandra with the pure connector code rather than just using Hive:
scala> val cRDD=sc.cassandraTable("test_gce", "students")
scala>cRDD.select("first_name","last_name","house").where("student_id=?",1).collect()
res0: Array[com.datastax.spark.connector.CassandraRow] =
Array(CassandraRow{first_name: Harry, last_name: Potter, house: Godric Gryffindor})

Resources