Save CSV file to hbase table using Spark and Phoenix - apache-spark

Can someone point me to a working example of saving a csv file to Hbase table using Spark 2.2
Options that I tried and failed (Note: all of them work with Spark 1.6 for me)
phoenix-spark
hbase-spark
it.nerdammer.bigdata : spark-hbase-connector_2.10
All of them finally after fixing everything give similar error to this Spark HBase
Thanks

Add below parameters to your spark job-
spark-submit \
--conf "spark.yarn.stagingDir=/somelocation" \
--conf "spark.hadoop.mapreduce.output.fileoutputformat.outputdir=/s‌​omelocation" \
--conf "spark.hadoop.mapred.output.dir=/somelocation"

Phoexin has plugin and jdbc thin client which can connect(read/write) to HBASE, example are in https://phoenix.apache.org/phoenix_spark.html
Option 1 : Connect via zookeeper url - phoenix plugin
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.phoenix.spark._
val sc = new SparkContext("local", "phoenix-test")
val sqlContext = new SQLContext(sc)
val df = sqlContext.load(
"org.apache.phoenix.spark",
Map("table" -> "TABLE1", "zkUrl" -> "phoenix-server:2181")
)
df
.filter(df("COL1") === "test_row_1" && df("ID") === 1L)
.select(df("ID"))
.show
Option 2 : Use JDBC thin client provied by phoenix query server
more info on https://phoenix.apache.org/server.html
jdbc:phoenix:thin:url=http://localhost:8765;serialization=PROTOBUF

Related

PySpark Kafka - NoClassDefFound: org/apache/commons/pool2

I am encountering problem with printing the data to console from kafka topic.
The error message I get is shown in below image.
As you can see in the above image that after batch 0 , it doesn't process further.
All this are snapshots of the error messages. I don't understand the root cause of the errors occurring. Please help me.
Following are kafka and spark version:
spark version: spark-3.1.1-bin-hadoop2.7
kafka version: kafka_2.13-2.7.0
I am using the following jars:
kafka-clients-2.7.0.jar
spark-sql-kafka-0-10_2.12-3.1.1.jar
spark-token-provider-kafka-0-10_2.12-3.1.1.jar
Here is my code:
spark = SparkSession \
.builder \
.appName("Pyspark structured streaming with kafka and cassandra") \
.master("local[*]") \
.config("spark.jars","file:///C://Users//shivani//Desktop//Spark//kafka-clients-2.7.0.jar,file:///C://Users//shivani//Desktop//Spark//spark-sql-kafka-0-10_2.12-3.1.1.jar,file:///C://Users//shivani//Desktop//Spark//spark-cassandra-connector-2.4.0-s_2.11.jar,file:///D://mysql-connector-java-5.1.46//mysql-connector-java-5.1.46.jar,file:///C://Users//shivani//Desktop//Spark//spark-token-provider-kafka-0-10_2.12-3.1.1.jar")\
.config("spark.executor.extraClassPath","file:///C://Users//shivani//Desktop//Spark//kafka-clients-2.7.0.jar,file:///C://Users//shivani//Desktop//Spark//spark-sql-kafka-0-10_2.12-3.1.1.jar,file:///C://Users//shivani//Desktop//Spark//spark-cassandra-connector-2.4.0-s_2.11.jar,file:///D://mysql-connector-java-5.1.46//mysql-connector-java-5.1.46.jar,file:///C://Users//shivani//Desktop//Spark//spark-token-provider-kafka-0-10_2.12-3.1.1.jar")\
.config("spark.executor.extraLibrary","file:///C://Users//shivani//Desktop//Spark//kafka-clients-2.7.0.jar,file:///C://Users//shivani//Desktop//Spark//spark-sql-kafka-0-10_2.12-3.1.1.jar,file:///C://Users//shivani//Desktop//Spark//spark-cassandra-connector-2.4.0-s_2.11.jar,file:///D://mysql-connector-java-5.1.46//mysql-connector-java-5.1.46.jar,file:///C://Users//shivani//Desktop//Spark//spark-token-provider-kafka-0-10_2.12-3.1.1.jar")\
.config("spark.driver.extraClassPath","file:///C://Users//shivani//Desktop//Spark//kafka-clients-2.7.0.jar,file:///C://Users//shivani//Desktop//Spark//spark-sql-kafka-0-10_2.12-3.1.1.jar,file:///C://Users//shivani//Desktop//Spark//spark-cassandra-connector-2.4.0-s_2.11.jar,file:///D://mysql-connector-java-5.1.46//mysql-connector-java-5.1.46.jar,file:///C://Users//shivani//Desktop//Spark//spark-token-provider-kafka-0-10_2.12-3.1.1.jar")\
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
#streaming dataframe that reads from kafka topic
df_kafka=spark.readStream\
.format("kafka")\
.option("kafka.bootstrap.servers",kafka_bootstrap_servers)\
.option("subscribe",kafka_topic_name)\
.option("startingOffsets", "latest") \
.load()
print("Printing schema of df_kafka:")
df_kafka.printSchema()
#converting data from kafka broker to string type
df_kafka_string=df_kafka.selectExpr("CAST(value AS STRING) as value")
# schema to read json format data
ts_schema = StructType() \
.add("id_str", StringType()) \
.add("created_at", StringType()) \
.add("text", StringType())
#parse json data
df_kafka_string_parsed=df_kafka_string.select(from_json(col("value"),ts_schema).alias("twts"))
df_kafka_string_parsed_format=df_kafka_string_parsed.select("twts.*")
df_kafka_string_parsed_format.printSchema()
df=df_kafka_string_parsed_format.writeStream \
.trigger(processingTime="1 seconds") \
.outputMode("update")\
.option("truncate","false")\
.format("console")\
.start()
df.awaitTermination()
The error (NoClassDefFound, followed by the kafka010 package) is saying that spark-sql-kafka-0-10 is missing its transitive dependency on org.apache.commons:commons-pool2:2.6.2, as you can see here
You can either download that JAR as well, or you can change your code to use --packages instead of spark.jars option, and let Ivy handle downloading transitive dependencies
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache...'
spark = SparkSession.bulider...

Load/import CSV file in to mongodb using PYSPARK

I want to know how to load/import a CSV file in to mongodb using pyspark. I have a csv file named cal.csv placed in the desktop. Can somebody share the code snippet.
First read the csv as pyspark dataframe.
from pyspark import SparkConf,SparkContext
from pyspark.sql import SQLContext
sc = SparkContext(conf = conf)
sql = SQLContext(sc)
df = sql.read.csv("cal.csv", header=True, mode="DROPMALFORMED")
Then write it to mongodb,
df.write.format('com.mongodb.spark.sql.DefaultSource').mode('append')\
.option('database',NAME).option('collection',COLLECTION_MONGODB).save()
Specify the NAME and COLLECTION_MONGODB as created by you.
Also, you need to give conf and packages alongwith spark-submit according to your version,
/bin/spark-submit --conf "spark.mongodb.inuri=mongodb://127.0.0.1/DATABASE.COLLECTION_NAME?readPreference=primaryPreferred"
--conf "spark.mongodb.output.uri=mongodb://127.0.0.1/DATABASE.COLLECTION_NAME"
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0
tester.py
Specify COLLECTION_NAME and DATABASE above. tester.py assuming name of the code file. For more information, refer this.
This worked for me. database:people Collection:con
pyspark --conf "spark.mongodb.input.uri=mongodb://127.0.0.1/people.con?readPreference=primaryPreferred" \
--conf "spark.mongodb.output.uri=mongodb://127.0.0.1/people.con" \
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.3.0
from pyspark.sql import SparkSession
my_spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/people.con") \
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/people.con") \
.getOrCreate()
df = spark.read.csv(path = "file:///home/user/Desktop/people.csv", header=True, inferSchema=True)
df.printSchema()
df.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").option("database","people").option("collection", "con").save()
Next go to mongo and check if collection is wrtten by following below steps
mongo
show dbs
use people
show collections
db.con.find().pretty()

connect to mysql from spark

I am trying to follow the instructions mentioned here...
https://www.percona.com/blog/2016/08/17/apache-spark-makes-slow-mysql-queries-10x-faster/
and here...
https://www.percona.com/blog/2015/10/07/using-apache-spark-mysql-data-analysis/
I am using sparkdocker image.
docker run -it -p 8088:8088 -p 8042:8042 -p 4040:4040 -h sandbox sequenceiq/spark:1.6.0 bash
cd /usr/local/spark/
./sbin/start-master.sh
./bin/spark-shell --driver-memory 1G --executor-memory 1g --executor-cores 1 --master local
This works as expected:
scala> sc.parallelize(1 to 1000).count()
But this shows an error:
val jdbcDF = spark.read.format("jdbc").options(
Map("url" -> "jdbc:mysql://1.2.3.4:3306/test?user=dba&password=dba123",
"dbtable" -> "ontime.ontime_part",
"fetchSize" -> "10000",
"partitionColumn" -> "yeard", "lowerBound" -> "1988", "upperBound" -> "2016", "numPartitions" -> "28"
)).load()
And here is the error:
<console>:25: error: not found: value spark
val jdbcDF = spark.read.format("jdbc").options(
How do I connect to MySQL from within spark shell?
With spark 2.0.x,you can use DataFrameReader and DataFrameWriter.
Use SparkSession.read to access DataFrameReader and use Dataset.write to access DataFrameWriter.
Suppose using spark-shell.
read example
val prop=new java.util.Properties()
prop.put("user","username")
prop.put("password","yourpassword")
val url="jdbc:mysql://host:port/db_name"
val df=spark.read.jdbc(url,"table_name",prop)
df.show()
read example 2
val jdbcDF = spark.read
.format("jdbc")
.option("url", "jdbc:mysql:dbserver")
.option("dbtable", “schema.tablename")
.option("user", "username")
.option("password", "password")
.load()
from spark doc
write example
import org.apache.spark.sql.SaveMode
val prop=new java.util.Properties()
prop.put("user","username")
prop.put("password","yourpassword")
val url="jdbc:mysql://host:port/db_name"
//df is a dataframe contains the data which you want to write.
df.write.mode(SaveMode.Append).jdbc(url,"table_name",prop)
Create the spark context first
Make sure you have jdbc jar files in attached to your classpath
if you are trying to read data from jdbc. use dataframe API instead of RDD as dataframes have better performance. refer to the below performance comparsion graph.
here is the syntax for reading from jdbc
SparkConf conf = new SparkConf().setAppName("app"))
.setMaster("local[2]")
.set("spark.serializer",prop.getProperty("spark.serializer"));
JavaSparkContext sc = new JavaSparkContext(conf);
sqlCtx = new SQLContext(sc);
df = sqlCtx.read()
.format("jdbc")
.option("url", "jdbc:mysql://1.2.3.4:3306/test")
.option("driver", "com.mysql.jdbc.Driver")
.option("dbtable","dbtable")
.option("user", "dbuser")
.option("password","dbpwd"))
.load();
It looks like spark is not defined, you should use the SQLContext to connect to the driver like this:
import org.apache.spark.sql.SQLContext
val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
val dataframe_mysql = sqlcontext.read.format("jdbc").option("url", "jdbc:mysql://Public_IP:3306/DB_NAME").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "tblage").option("user", "sqluser").option("password", "sqluser").load()
Later you can user sqlcontext where you used spark (in spark.read etc)
This is a common problem for those migrating to Spark 2.0.0 from the earlier versions. The Spark documentation is not very good. To solve this, you have to define a SparkSession, like this:
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("Spark SQL Example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
This solution is hidden in the Spark SQL, Dataframes and Data Sets Guide located here. SparkSession is the new entry point to the DataFrame API and it incorporates both SQLContext and HiveContext and has some additional advantages, so there is no need to define either of those anymore. Further information about this can be found here.
Please accept this as the answer, if you find this useful.

how to use Cassandra Context in spark 2.0

In previous Version of Spark like 1.6.1, i am using creating Cassandra Context using spark Context,
import org.apache.spark.{ Logging, SparkContext, SparkConf }
//config
val conf: org.apache.spark.SparkConf = new SparkConf(true)
.set("spark.cassandra.connection.host", CassandraHost)
.setAppName(getClass.getSimpleName)
lazy val sc = new SparkContext(conf)
val cassandraSqlCtx: org.apache.spark.sql.cassandra.CassandraSQLContext = new CassandraSQLContext(sc)
//Query using Cassandra context
cassandraSqlCtx.sql("select id from table ")
But In Spark 2.0 , Spark Context is replaced with Spark session, how can i use cassandra context?
Short Answer: You don't. It has been deprecated and removed.
Long Answer: You don't want to. The HiveContext provides everything except for the catalogue and supports a much wider range of SQL(HQL~). In Spark 2.0 this just means you will need to manually register Cassandra tables use createOrReplaceTempView until an ExternalCatalogue is implemented.
In Sql this looks like
spark.sql("""CREATE TEMPORARY TABLE words
|USING org.apache.spark.sql.cassandra
|OPTIONS (
| table "words",
| keyspace "test")""".stripMargin)
In the raw DF api it looks like
spark
.read
.format("org.apache.spark.sql.cassandra")
.options(Map("keyspace" -> "test", "table" -> "words"))
.load
.createOrReplaceTempView("words")
Both of these commands will register the table "words" for SQL queries.

Using Hive along with Spark Cassandra connector?

Can I use Hive in concert with the Spark cassandra connector ?
scala> import org.apache.spark.sql.hive.HiveContext
scala> hiveCtx = new HiveContext(sc)
This produces:
ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,
/etc/hive/conf.dist/ivysettings.xml will be used
and then
scala> val rows = hiveCtx.sql("SELECT first_name,last_name,house FROM
test_gce.students WHERE student_id=1")
results in this error:
org.apache.spark.sql.AnalysisException: no such table test_gce.students; line 1 pos 48
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:260)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$7.applyOrElse(Analyzer.scala:268)
...
Is it possible to create a HiveContext from the SparkContext and use it as I am trying to do while using the Spark cassandra connector ?
Here is how I called spark-shell:
spark-shell --jars ~/spark-cassandra-connector/spark-cassandra-connector-assembly-1.4.0-M1-SNAPSHOT.jar --conf spark.cassandra.connection.host=10.240.0.0
Also, I am able to successfully access Cassandra with the pure connector code rather than just using Hive:
scala> val cRDD=sc.cassandraTable("test_gce", "students")
scala>cRDD.select("first_name","last_name","house").where("student_id=?",1).collect()
res0: Array[com.datastax.spark.connector.CassandraRow] =
Array(CassandraRow{first_name: Harry, last_name: Potter, house: Godric Gryffindor})

Resources