is the plugin update to spark 2.0 ?
I can't use the plugin
val df = spark.read
.format("org.apache.phoenix.spark")
.option("table", "web_stat")
.option("zkUrl", "localhost:2181")
.option("driver","org.apache.phoenix.jdbc.PhoenixDriver")
.load()
ERROR:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/DataFrame
just jdbc connect phoenix is OK!
when i just use the spark jdbc connector ,it comes
val df = spark.read
.format("jdbc")
.option("driver", "org.apache.phoenix.jdbc.PhoenixDriver")
.option("url", " jdbc:phoenix:localhost:2181")
.option("dbtable", "web_stat")
.load()
ERROR
Exception in thread "main" java.lang.NullPointerException at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:167)
at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:117)
at
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:53)
at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:345)
at
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
at
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
at
org.apache.spark.sql.phoenix.SparkPhoenixExample$.main(SparkPhoenixExample.scala:65)
Spark 2.0 is not yet working with Phoenix. see this URL for a patch https://issues.apache.org/jira/browse/PHOENIX-3333
Related
I got an error when trying to write data to Redshift using PySpark on an EMR cluster.
df.write.format("jdbc") \
.option("url", "jdbc:redshift://clustername.yyyyy.us-east-1.redshift.amazonaws.com:5439/db") \
.option("driver", "com.amazon.redshift.jdbc42.Driver") \
.option("dbtable", "public.table") \
.option("user", user_redshift) \
.option("password", password_redshift) \
.mode("overwrite") \
.save()
The error I have got is:
py4j.protocol.Py4JJavaError: An error occurred while calling o143.save.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, , executor 1):
java.sql.SQLException: [Amazon](500310) Invalid operation: The session is read-only;
at com.amazon.redshift.client.messages.inbound.ErrorResponse.toErrorException(Unknown Source)
at com.amazon.redshift.client.PGMessagingContext.handleErrorResponse(Unknown Source)
at com.amazon.redshift.client.PGMessagingContext.handleMessage(Unknown Source)
at com.amazon.jdbc.communications.InboundMessagesPipeline.getNextMessageOfClass(Unknown Source)
at com.amazon.redshift.client.PGMessagingContext.doMoveToNextClass(Unknown Source)
at com.amazon.redshift.client.PGMessagingContext.getParameterDescription(Unknown Source)
at com.amazon.redshift.client.PGClient.prepareStatement(Unknown Source)
at com.amazon.redshift.dataengine.PGQueryExecutor.<init>(Unknown Source)
at com.amazon.redshift.dataengine.PGDataEngine.prepare(Unknown Source)
at com.amazon.jdbc.common.SPreparedStatement.<init>(Unknown Source)
...
I appreciate any help. Thanks!
We also faced the same issue for our EMR pySpark cluster.
EMR with "ReleaseLabel": "emr-5.33.0" and Spark version 2.4.7
We resolved it with the following changes
Used the redshift Jar: redshift-jdbc42-2.0.0.7.jar from https://docs.aws.amazon.com/redshift/latest/mgmt/jdbc20-previous-driver-version-20.html
Changed the JDBC URL to the following:
jdbc:redshift://clustername.yyyyy.us-east-1.redshift.amazonaws.com:5439/db?user=username&password=password;ReadOnly=false
You can then try to run your spark-submit with the following:
spark-submit --jars s3://jars/redshift-jdbc42-2.0.0.7.jar s3://scripts/scriptname.py
where scriptname.py has
df.write\
.format('jdbc')\
.option("driver", "com.amazon.redshift.jdbc42.Driver")\
.option("url", jdbcUrl)\
.option("dbtable", "schema.table")\
.option("aws_iam_role", "XXXX") \
.option("tempdir", f"s3://XXXXXX") \
.mode('append')\
.save()
Can someone help me understand the cause behind this error:
ERROR Query alert [id = d19f51b1-8131-40dd-ab62, runId = 276833a0-235f-4d2e-bd61] terminated with error
java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at org.apache.spark.sql.execution.datasources.BasicWriteJobStatsTracker$.metrics(BasicWriteStatsTracker.scala:180)
at org.apache.spark.sql.execution.streaming.FileStreamSink.basicWriteJobStatsTracker(FileStreamSink.scala:103)
at org.apache.spark.sql.execution.streaming.FileStreamSink.addBatch(FileStreamSink.scala:140)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:568)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1.apply(SQLExecution.scala:111)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:240)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:97)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:170)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:566)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:251)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:61)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:565)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:207)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:175)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:175)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:251)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:61)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:175)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:169)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:296)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:208)
The cluster configs are:
Databricks runtime 5.5 LTS
Scala 2.11
Spark 2.4.3
Driver: 64GB mem, 16 cores, 3DBU
workers: 64GB mem, 16 cores, 3DBU (2-4 workers, auto scalable)
there are 3 streaming queries running in parallel as defined in fairscheduler.xml
Spark configs are:
spark.sql.autoBroadcastJoinThreshold=-1
spark.sql.broadcastTimeout=1200
spark.executor.instances=4
spark.executor.cores=16
spark.executor.memory=29g
spark.sql.shuffle.partitions=32
spark.default.parallelism=32
spark.driver.maxResultSize=25g
spark.scheduler.mode=FAIR
spark.scheduler.allocation.file=/dbfs/config/fairscheduler.xml
Adding code flow below:
implicit class PipedObject[A](value: A) {
def conditionalPipe(f: A => A)(pred: Boolean): A =
if (pred) f(value) else value
}
implicit val spark: SparkSession = SparkSession
.builder()
.appName("MyApp")
.conditionalPipe(sess => sess.master("local[6]"))(false)
.getOrCreate()
import spark.implicits._
val cookedData = getCookedStreamingData() // streaming data as input from event hub
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "cook")
cookedData.writeStream
.option("checkpointLocation", "checkpointLocation1")
.queryName("queryName1")
.format("avro")
.option("path", "dir1")
.start()
val scoredData = score(cookedData)
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "score")
scoredData.writeStream
.option("checkpointLocation", "checkpointLocation2")
.queryName("queryName2")
.format("avro")
.option("path", "dir2")
.start()
val alertData = score(scoredData)
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "alert")
alertData.writeStream
.option("checkpointLocation", "checkpointLocation3")
.queryName("queryName3")
.format("avro")
.option("path", "dir3")
.start()
Sample fairScheduler.xml file:
<allocations>
<pool name="default">
<schedulingMode>FIFO</schedulingMode>
<weight>2</weight>
<minShare>2</minShare>
</pool>
<pool name="cook">
<schedulingMode>FAIR</schedulingMode>
<weight>1</weight>
<minShare>5</minShare>
</pool>
<pool name="score">
<schedulingMode>FAIR</schedulingMode>
<weight>1</weight>
<minShare>5</minShare>
</pool>
<pool name="alert">
<schedulingMode>FAIR</schedulingMode>
<weight>1</weight>
<minShare>5</minShare>
</pool>
</allocations>
java.util.NoSuchElementException: None.get
is purely your scala programming bug. since there is no code snippet I could'nt able to point it.
If you are using options then before reading the element, you need to check
isDefined before using get on Option
or else you can use getOrElse() function from the Option to supply a default value.
In case you are using multiple sparkcontext it may arise...
Have a look at this... Spark Streaming Exception: java.util.NoSuchElementException: None.get
I am using spark-sql-2.4.1v with kafka 0.10.x using java 1.8.
Dataset<Row> dataSet= sparkSession
.readStream()
.format("kafka")
.option("subscribe", INFO_TOPIC)
.option("startingOffsets", "latest")
.option("enable.auto.commit", false)
.option("maxOffsetsPerTrigger", 1000)
.option("auto.offset.reset", "latest")
.option("failOnDataLoss", false)
.load();
StreamingQuery query = dataSet.writeStream()
.format(PARQUET_FORMAT)
.option("path", parqetFileName)
.option("checkpointLocation", checkPtLocation)
.trigger(Trigger.ProcessingTime("15 seconds"))
.start();
query.awaitTermination();
After writing data into my hdfs path (i.e. parqetFileName) it fails with below error.
[DataStreamer for file /user/parquet/raw/part-00001-7cba7fa3-a98f-442d-9584-b71085b7cd82-c000.snappy.parquet] WARN org.apache.hadoop.hdfs.DataStreamer - Caught exception
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1249)
at java.lang.Thread.join(Thread.java:1323)
at org.apache.hadoop.hdfs.DataStreamer.closeResponder(DataStreamer.java:980)
at org.apache.hadoop.hdfs.DataStreamer.endBlock(DataStreamer.java:630)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:807)
What is wrong here and how to fix it ?
You must have streamContext.awaitTermination() in your code - otherwise application will exit immediately after starting your stream.
I'm trying to start my job which I've done for testing integration spark with atlas.
This is simple job which reads from one topic and write to another.
val sparkConf = new SparkConf()
.setAppName("atlas-test")
.setMaster("local[2]")
.set("spark.extraListeners", "com.hortonworks.spark.atlas.SparkAtlasEventTracker")
.set("spark.sql.queryExecutionListeners", "com.hortonworks.spark.atlas.SparkAtlasEventTracker")
.set("spark.sql.streaming.streamingQueryListeners", "com.hortonworks.spark.atlas.SparkAtlasStreamingQueryEventTracker")
val spark = SparkSession.builder()
.config(sparkConf)
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
val df = spark.read.format("kafka")
.option("kafka.bootstrap.servers", BROKER_SERVERS)
.option("subscribe", "foobar2")
.option("startingOffset", "earliest")
.option("kafka.atlas.cluster.name", clusterName)
.load()
println("---------------------------------------------")
df.printSchema()
val dfs = df.selectExpr("CAST(key as STRING)","CAST(value AS STRING)").as[(String, String)]
dfs.show()
println("---------------------------------------------")
df.write
.format("kafka")
.option("kafka.bootstrap.servers", BROKER_SERVERS)
.option("topic", "foobar-out")
.option("kafka.atlas.cluster.name", clusterName)
.save()
Everything seems understandable. So I try to run the job in my IDE (Intellij) and almost everytime I got this exception
19/08/12 17:00:08 WARN SparkExecutionPlanProcessor: Caught exception during parsing event
java.lang.NullPointerException
at org.apache.spark.sql.internal.SQLConf$$anonfun$14.apply(SQLConf.scala:133)
at org.apache.spark.sql.internal.SQLConf$$anonfun$14.apply(SQLConf.scala:133)
at scala.Option.map(Option.scala:146)
at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:133)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.simpleString(SaveIntoDataSourceCommand.scala:52)
at org.apache.spark.sql.catalyst.plans.QueryPlan.verboseString(QueryPlan.scala:177)
at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:548)
at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:472)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$4.apply(QueryExecution.scala:197)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$4.apply(QueryExecution.scala:197)
at org.apache.spark.sql.execution.QueryExecution.stringOrError(QueryExecution.scala:99)
at org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:197)
at com.hortonworks.spark.atlas.sql.CommandsHarvester$.com$hortonworks$spark$atlas$sql$CommandsHarvester$$getPlanInfo(CommandsHarvester.scala:214)
at com.hortonworks.spark.atlas.sql.CommandsHarvester$.com$hortonworks$spark$atlas$sql$CommandsHarvester$$makeProcessEntities(CommandsHarvester.scala:222)
at com.hortonworks.spark.atlas.sql.CommandsHarvester$SaveIntoDataSourceHarvester$.harvest(CommandsHarvester.scala:183)
at com.hortonworks.spark.atlas.sql.SparkExecutionPlanProcessor$$anonfun$2.apply(SparkExecutionPlanProcessor.scala:108)
at com.hortonworks.spark.atlas.sql.SparkExecutionPlanProcessor$$anonfun$2.apply(SparkExecutionPlanProcessor.scala:89)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at com.hortonworks.spark.atlas.sql.SparkExecutionPlanProcessor.process(SparkExecutionPlanProcessor.scala:89)
at com.hortonworks.spark.atlas.sql.SparkExecutionPlanProcessor.process(SparkExecutionPlanProcessor.scala:63)
at com.hortonworks.spark.atlas.AbstractEventProcessor$$anonfun$eventProcess$1.apply(AbstractEventProcessor.scala:72)
at com.hortonworks.spark.atlas.AbstractEventProcessor$$anonfun$eventProcess$1.apply(AbstractEventProcessor.scala:71)
at scala.Option.foreach(Option.scala:257)
at com.hortonworks.spark.atlas.AbstractEventProcessor.eventProcess(AbstractEventProcessor.scala:71)
at com.hortonworks.spark.atlas.AbstractEventProcessor$$anon$1.run(AbstractEventProcessor.scala:38)
I'm using spark 2.4.0 with scala 2.11
And I have some misunderstanding about result. Honestly can't understand after this job in my atlas (local machine) will appear something? Because sometimes jobs run successful but nothing appears in Atlas.
I am a relative newbie to spark/cassandra. As such I have a basic question. I have compiled an uber jar and loaded it to my spark/cassandra server. Now I am in a pickle, how do I run it via the cassandra (DSE) enviornment? I know the spark shell command is "dse spark-submit" but when I try to do a "dse spark-submit" I get a "NullPointerException"
Here is the full output:
Exception in thread "main" java.lang.NullPointerException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
The program code is very basic and has been proven to work in the spark shell
package xxx.seaoxxxx
import com.datastax.spark.connector._
import org.apache.spark.{SparkConf, SparkContext}
class test {
def main(args: Array[String]){
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "xx.xxx.xx.xx")
.setAppName("Seasonality")
val sc = new SparkContext("spark://xx.xxx.xx.xx:7077", "Season", conf)
val ks = "loadset"
val incf = "period"
val rdd = sc.cassandraTable(ks, incf)
rdd.count
println("done with test")
sc.stop()
}
}
The spark-submit code is as follows:
dse spark-submit \
--class xxx.seaoxxxx.test \
--master spark://xxx.xx.x.xxx:7077 \
/home/ubuntu/spark/Seasonality_v6-assembly-1.0.1.jar 100
Thanks,
Eric
The current release, DataStax Enterprise 4.5, supports dse spark-class instead of dse spark-submit: http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkStart.html?scroll=sparkStart__spkShrkLaunch