Phoenix "org.apache.phoenix.spark.DefaultSource" error - apache-spark

I am new to phoenix, I am trying to load hbase table into Phoenix. When I try to load Phoenix, I am getting below error.
java.lang.ClassNotFoundException: org.apache.phoenix.spark.DefaultSource
My code:
package com.vas.reports
import org.apache.spark.SparkContext
import org.apache.spark.sql.{SQLContext, SaveMode}
import org.apache.phoenix.spark
import java.sql.DriverManager
import com.google.common.collect.ImmutableMap
import org.apache.hadoop.hbase.filter.FilterBase
import org.apache.phoenix.query.QueryConstants
import org.apache.phoenix.filter.ColumnProjectionFilter;
import org.apache.phoenix.hbase.index.util.ImmutableBytesPtr;
import org.apache.phoenix.hbase.index.util.VersionUtil;
import org.apache.hadoop.hbase.filter.Filter
object PhoenixRead {
case class Record(NO:Int,NAME:String,DEPT:Int)
def main(args: Array[String]) {
val sc= new SparkContext("local","phoenixsample")
val sqlcontext=new SQLContext(sc)
val numWorkers = sc.getExecutorStorageStatus.map(_.blockManagerId.executorId).filter(_ != "driver").length
import sqlcontext.implicits._
val df1=sc.parallelize(List((2,"Varun", 58),
(3,"Alice", 45),
(4,"kumar", 55))).
toDF("NO", "NAME", "DEPT")
df1.show()
println(numWorkers)
println("pritning df2")
val df =sqlcontext.load("org.apache.phoenix.spark",Map("table"->"udm_main","zkUrl"->"phoenix url:2181/hbase-unsecure"))
df.show()
SPARK-SUBMIT
~~~~~~~~~~~~
spark-submit --class com.vas.reports.PhoenixRead --jars /home/hadoop1/phoenix-core-4.4.0-HBase-1.1.jar /shared/test/ratna-0.0.1-SNAPSHOT.jar
Please look into this and suggest me.

This is because, you need to add following library files in HBASE_HOME/libs and SPARK_HOME/lib.
in HBASE_HOME/libs:
phoenix-spark-4.7.0-HBase-1.1.jar
phoenix-4.7.0-HBase-1.1-server.jar
in SPARK_HOME/lib:
phoenix-spark-4.7.0-HBase-1.1.jar
phoenix-4.7.0-HBase-1.1-client.jar

Related

I am using Spark-Scala from local machine. Spark version 3.0.1

I am using Spark-Scala from local machine. Spark version 3.0.1. I need to read data from S3 bucket publicly open from IntelliJ IDEA. Below is my code
package AWS
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.SparkConf
object Read extends App {
val spark = SparkSession.builder()
.master("local[3]")
.appName("Accessing AWS S3")
.getOrCreate()
// spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "XXXXXXXXXXXXXXXXXXXXXXXXX")
// spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "XXXXXXXXXXXXXXXXXXXXXXXXX")
// spark.sparkContext.hadoopConfiguration.set("fs.s3n.endpoint", "s3.amazonaws.com")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", "XXXXXXXXX")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", "XXXXXXXXXXXXXXXXXXXXXXXXX")
spark.sparkContext.hadoopConfiguration.set("fs.s3n.endpoint", "s3.amazonaws.com")
val dept_df = spark.read.format("csv").load("s3a://hr-data-lake/departments.csv")
dept_df.printSchema()
dept_df.show(truncate = false)
}
//s3://hr-data-lake/departments.csv
//sc.hadoopConfiguration.set("fs.s3a.access.key", "XXXXXXXXX")
//sc.hadoopConfiguration.set("fs.s3a.secret.key", "XXXXXXXXXXXXXXXXXXXXXXXXX")
//spark.sparkContext.set("fs.s3a.access.key", "XXXXXXXXXXXXXXXXXXXXXXXXX")
//sc.hadoopConfiguration.set("fs.s3a.secret.key", "XXXXXXXXXXXXXXXXXXXXXXXXX")
Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

Unable to read Kinesis stream from SparkStreaming

import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.Milliseconds
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.dstream.DStream.toPairDStreamFunctions
import com.amazonaws.auth.AWSCredentials
import com.amazonaws.auth.DefaultAWSCredentialsProviderChain
import com.amazonaws.auth.SystemPropertiesCredentialsProvider
import com.amazonaws.services.kinesis.AmazonKinesisClient
import com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream
import org.apache.spark.streaming.kinesis.KinesisInputDStream
import org.apache.spark.streaming.kinesis.KinesisInitialPositions.Latest
import org.apache.spark.streaming.kinesis.KinesisInitialPositions.TrimHorizon
import java.util.Date
val tStream = KinesisInputDStream.builder
.streamingContext(ssc)
.streamName(streamName)
.endpointUrl(endpointUrl)
.regionName(regionName)
.initialPosition(new TrimHorizon())
.checkpointAppName(appName)
.checkpointInterval(kinesisCheckpointInterval)
.storageLevel(StorageLevel.MEMORY_AND_DISK_2)
.build()
tStream.foreachRDD(rdd => if (rdd.count() > 0) rdd.saveAsTextFile("/user/hdfs/test/") else println("No record to read"))
Here, even though I see data coming into the stream, my above spark job isn't getting any records. I am sure that I am connecting to right stream with all credentials.
Please help me out.

could not find implicit value for evidence parameter of type org.apache.spark.sql.Encoder[String]

I am trying to load a dataframe into a Hive table.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql._
object SparkToHive {
def main(args: Array[String]) {
val warehouseLocation = "file:${system:user.dir}/spark-warehouse"
val sparkSession = SparkSession.builder.master("local[2]").appName("Saving data into HiveTable using Spark")
.enableHiveSupport()
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("hive.metastore.warehouse.dir", "/user/hive/warehouse")
.config("spark.sql.warehouse.dir", warehouseLocation)
.getOrCreate()
**import sparkSession.implicits._**
val partfile = sparkSession.read.text("partfile").as[String]
val partdata = partfile.map(part => part.split(","))
case class Partclass(id:Int, name:String, salary:Int, dept:String, location:String)
val partRDD = partdata.map(line => PartClass(line(0).toInt, line(1), line(2).toInt, line(3), line(4)))
val partDF = partRDD.toDF()
partDF.write.mode(SaveMode.Append).insertInto("parttab")
}
}
I haven't executed it yet but I am getting the following error at this line:
import sparkSession.implicits._
could not find implicit value for evidence parameter of type org.apache.spark.sql.Encoder[String]
How can I fix this ?
Please move your case class Partclass outside of SparkToHive object. It should be fine then
And there are ** in you implicits import statement. Try
import sparkSession.sqlContext.implicits._
The mistake I made was
Case class should be outside the main and inside the object
In this line: val partfile = sparkSession.read.text("partfile").as[String], I used read.text("..") to get a file into Spark where we can use read.textFile("...")

multi operating system for distributed computing in spark 2.1.0

Seeking Help
As I'm New to Spark just want to know.
I had setup spark master by changing spark-env.sh file changing parameters to
export SCALA_HOME=/cats/dev/scala/scala-2.12.2
export SPARK_WORKER_MEMORY=1g
export SPARK_WORKER_INSTANCES=3
export SPARK_WORKER_DIR=/cats/dev/spark-2.1.0-bin-hadoop2.7/work/sparkdata
export SPARK_MASTER_IP="192.168.1.54"
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=8080
for a multi-node on the same machine
for executing small project I had to use the command
(/cats/dev/spark-2.1.0-bin-hadoop2.7/bin/spark-submit --class "Person" --master spark://cats-All-Series:7077 /cats/sbt_projects/test/people/target/scala-2.10/people-assembly-1.0.jar
after executing when I check web UI my app finished time was 6 seconds. but when I deploy on a single without out master and slave the same got executed in 0.3 seconds.
without a master, I had executed in spark-shell firstly loaded the code
(scala> :load /cats/scala/db2connectivity/sparkjdbc.scala) and executed(scala> sparkjdbc.connectSpark) for it.
>import org.apache.spark._
import org.apache.spark.sql.{Dataset, DataFrame, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
>import java.io.Serializable
import java.util.List
import java.util.Properties
object Person extends App {
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
connectSpark()
/*Following is spark-jdbc code using scala*/
def connectSpark(): Unit ={
val url = "jdbc:db2://localhost:50000/sample"
val driver = "com.ibm.db2.jcc.DB2Driver"
val username = "db2inst1"
val password = "db2inst1"
val prop = new java.util.Properties
prop.setProperty("user",username)
prop.setProperty("password",password)
Class.forName(driver)
val jdbcDF = sqlContext.read
.format("jdbc")
.option("url", url)
.option("driver", driver)
.option("dbtable", "allbands")
.option("user", "db2inst1")
.option("password", "db2inst1")
.load()
jdbcDF.show()
println("Done loading")
}
}

Spark MiniCluster

Is it possible to create a "Spark MiniCluster" entirely programmatically to run small Spark apps from inside a Scala program? I do NOT want to start the Spark shell, but instead get a "MiniCluster" entirely fabricated in the Main of my program.
You can create application and use local master to start Spark in standalone mode:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object LocalApp {
def main(args: Array[String]) {
val sc = new SparkContext("local[*]", "local-app", new SparkConf())
// Do whatever you need
sc.stop()
}
}
You can do exactly the same thing with any supported language.

Resources