spark streaming kafka : Unknown error fetching data for topic-partition - apache-spark

I'm trying to read a Kafka topic from a Spark cluster using Structured Streaming API with Kafka integration in Spark
val sparkSession = SparkSession.builder()
.master("local[*]")
.appName("some-app")
.getOrCreate()
Kafka stream creation
import sparkSession.implicits._
val dataFrame = sparkSession
.readStream
.format("kafka")
.option("subscribepattern", "preprod-*")
.option("kafka.bootstrap.servers", "<brokerUrl>:9094")
.option("kafka.ssl.protocol", "TLS")
.option("kafka.security.protocol", "SSL")
.option("kafka.ssl.key.password", secretPassword)
.option("kafka.ssl.keystore.location", "/tmp/xyz.jks")
.option("kafka.ssl.keystore.password", secretPassword)
.option("kafka.ssl.truststore.location", "/abc.jks")
.option("kafka.ssl.truststore.password", secretPassword)
.load()
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
.writeStream
.format("console")
.start()
.awaitTermination()
running it using the command
/usr/local/spark/bin/spark-submit
--packages "org.apache.spark:spark-streaming-kafka-0-10_2.11:2.3.1,org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.1"
myjar.jar
Getting the below error
2018-09-28 07:29:23 INFO AbstractCoordinator:505 - Discovered coordinator brokerUrl.com:32400 (id: 2147483647 rack: null) for group spark-kafka-source-c72dcb79-f3bc-4dfd-86a5-9d14be48fa04-1188588017-executor.
2018-09-28 07:29:23 INFO AbstractCoordinator:505 - Discovered coordinator brokerUrl.com:32400 (id: 2147483647 rack: null) for group spark-kafka-source-c72dcb79-f3bc-4dfd-86a5-9d14be48fa04-1188588017-executor.
2018-09-28 07:29:23 INFO AbstractCoordinator:505 - Discovered coordinator brokerUrl.com:32400 (id: 2147483647 rack: null) for group spark-kafka-source-c72dcb79-f3bc-4dfd-86a5-9d14be48fa04-1188588017-executor.
2018-09-28 07:29:23 INFO AbstractCoordinator:505 - Discovered coordinator brokerUrl.com:32400 (id: 2147483647 rack: null) for group spark-kafka-source-c72dcb79-f3bc-4dfd-86a5-9d14be48fa04-1188588017-executor.
2018-09-28 07:29:47 WARN Fetcher:594 - Unknown error fetching data for topic-partition preprod-sanity-test-5
2018-09-28 07:30:25 WARN Fetcher:594 - Unknown error fetching data for topic-partition preprod-sanity-test-7
2018-09-28 07:30:27 WARN Fetcher:594 - Unknown error fetching data for topic-partition preprod-sanity-test-7
2018-09-28 07:30:27 WARN Fetcher:594 - Unknown error fetching data for topic-partition preprod-sanity-test-5
2018-09-28 07:30:50 WARN Fetcher:594 - Unknown error fetching data for topic-partition preprod-sanity-test-8
2018-09-28 07:30:50 WARN Fetcher:594 - Unknown error fetching data for topic-partition preprod-sanity-test-4
2018-09-28 07:30:50 WARN Fetcher:594 - Unknown error fetching data for topic-partition preprod-sanity-test-7
2018-09-28 07:30:50 WARN Fetcher:594 - Unknown error fetching data for topic-partition preprod-sanity-test-8
2018-09-28 07:30:50 WARN Fetcher:594 - Unknown error fetching data for topic-partition preprod-sanity-test-4
2018-09-28 07:30:50 WARN Fetcher:594 - Unknown error fetching data for topic-partition preprod-sanity-test-5
.....
....
so on

What's your Kafka broker version? And how did you generate these messages?
If these messages have headers (https://issues.apache.org/jira/browse/KAFKA-4208), you will need to use Kafka 0.11+ to consume them as old Kafka client cannot read such messages. If so, you can use the following command:
/usr/local/spark/bin/spark-submit --packages "org.apache.kafka:kafka-clients:0.11.0.3,org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.1"
myjar.jar

Related

NoSuchMethodError trying to ingest HDFS data into Elasticsearch

I'm using Spark 3.12, Scala 2.12, Hadoop 3.1.1.3.1.2-50, Elasticsearch 7.10.1 (due to license issues), Centos 7
to try an ingest json data in gzip files located on HDFS into Elasticsearch using spark streaming.
I get a
Logical Plan:
FileStreamSource[hdfs://pct/user/papago-mlops-datalake/raw/mt-log/engine=n2mt/year=2022/date=0430/hour=00]
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:356)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:244)
Caused by: java.lang.NoSuchMethodError: org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(Lorg/apache/spark/sql/SparkSession;Lorg/apache/spark/sql/execution/QueryExecution;Lscala/Function0;)Ljava/lang/Object;
at org.elasticsearch.spark.sql.streaming.EsSparkSqlStreamingSink.addBatch(EsSparkSqlStreamingSink.scala:62)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$16(MicroBatchExecution.scala:586)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$15(MicroBatchExecution.scala:584)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:357)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:355)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:68)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runBatch(MicroBatchExecution.scala:584)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:226)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:357)
at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:355)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:68)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:194)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:57)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:188)
at org.apache.spark.sql.execution.streaming.StreamExecution.$anonfun$runStream$1(StreamExecution.scala:334)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:317)
... 1 more
ApplicationMaster host: ac3m8x2183.bdp.bdata.ai
ApplicationMaster RPC port: 39673
queue: batch
start time: 1654588583366
final status: FAILED
tracking URL: https://gemini-rm2.bdp.bdata.ai:9090/proxy/application_1654575947385_29572/
user: papago-mlops-datalake
Exception in thread "main" org.apache.spark.SparkException: Application application_1654575947385_29572 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1269)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1627)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:904)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
using
implementation("org.elasticsearch:elasticsearch-hadoop:8.2.2")
implementation("com.typesafe:config:1.4.2")
implementation("org.apache.spark:spark-sql_2.12:3.1.2")
testImplementation("org.scalatest:scalatest_2.12:3.2.12")
testRuntimeOnly("com.vladsch.flexmark:flexmark-all:0.61.0")
compileOnly("org.apache.spark:spark-sql_2.12:3.1.2")
compileOnly("org.apache.spark:spark-core_2.12:3.1.2")
compileOnly("org.apache.spark:spark-launcher_2.12:3.1.2")
compileOnly("org.apache.spark:spark-streaming_2.12:3.1.2")
compileOnly("org.elasticsearch:elasticsearch-spark-30_2.12:8.2.2")
libraries. I tried using ES-Hadoop version 7.10.1, but ES-Spark only supports down to 7.12.0 for Spark 3.0 and I still get the same error.
My code is pretty simple
def main(args: Array[String]): Unit = {
// Set the log level to only print errors
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession
.builder()
.config(ConfigurationOptions.ES_NET_HTTP_AUTH_USER, elasticsearchUser)
.config(ConfigurationOptions.ES_NET_HTTP_AUTH_PASS, elasticsearchPass)
.config(ConfigurationOptions.ES_NODES, elasticsearchHost)
.config(ConfigurationOptions.ES_PORT, elasticsearchPort)
.appName(appName)
.master(master)
.getOrCreate()
val streamingDF: DataFrame = spark.readStream
.schema(jsonSchema)
.format("org.apache.spark.sql.execution.datasources.json.JsonFileFormat")
.load(pathToJSONResource)
streamingDF.writeStream
.outputMode(outputMode)
.format(destination)
.option("checkpointLocation", checkpointLocation)
.start(indexAndDocType)
.awaitTermination()
// Stop the session
spark.stop()
}
}
If I can't use the ES-Hadoop libraries is there another way I can go about ingesting JSON into ES from HDFS?

Stream data from one kafka topic to another using pyspark

I have been trying to stream some sample data using pyspark from one kafka topic to another (I want to apply some transformations, but, could not get the basic data movement to work). Below is my spark code.
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
from pyspark.sql.types import StructType, StringType, IntegerType
from pyspark.sql.functions import from_json, col
import time
confluentApiKey = 'someapikeyvalue'
confluentSecret = 'someapikey'
spark = SparkSession.builder\
.appName("repartition-job") \
.config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1')\
.getOrCreate()
df = spark\
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "pkc-cloud:9092") \
.option("subscribe", "test1") \
.option("topic", "test1") \
.option("sasl.mechanisms", "PLAIN")\
.option("security.protocol", "SASL_SSL")\
.option("sasl.username", confluentApiKey)\
.option("kafka.sasl.jaas.config", "kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username='{}' password='{}';".format(confluentApiKey, confluentSecret))\
.option("kafka.ssl.endpoint.identification.algorithm", "https")\
.option("sasl.password", confluentSecret)\
.option("startingOffsets", "earliest")\
.option("basic.auth.credentials.source", "USER_INFO")\
.option("failOnDataLoss", "true").load()
df.printSchema()
query = df \
.selectExpr("CAST(key AS STRING) AS key", "to_json(struct(*)) AS value") \
.writeStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "pkc-cloud:9092") \
.option("topic", "test2") \
.option("sasl.mechanisms", "PLAIN")\
.option("security.protocol", "SASL_SSL")\
.option("sasl.username", confluentApiKey)\
.option("kafka.sasl.jaas.config", "kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username='{}' password='{}';".format(confluentApiKey, confluentSecret))\
.option("kafka.ssl.endpoint.identification.algorithm", "https")\
.option("sasl.password", confluentSecret)\
.option("startingOffsets", "latest")\
.option("basic.auth.credentials.source", "USER_INFO")\
.option("checkpointLocation", "/tmp/checkpoint").start()
I have been able to get the schema printed well.
|-- key: binary (nullable = true)
|-- value: binary (nullable = true)
|-- topic: string (nullable = true)
|-- partition: integer (nullable = true)
|-- offset: long (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- timestampType: integer (nullable = true)
And when attempting to write to another Kafka topic using writeStream, I see the below logs and dont see the data being written and spark shuts down.
22/02/04 18:29:26 INFO CheckpointFileManager: Writing atomically to file:/tmp/checkpoint/metadata using temp file file:/tmp/checkpoint/.metadata.e6c58f93-5c1c-4f26-97cf-a8d3ed389a57.tmp
22/02/04 18:29:26 INFO CheckpointFileManager: Renamed temp file file:/tmp/checkpoint/.metadata.e6c58f93-5c1c-4f26-97cf-a8d3ed389a57.tmp to file:/tmp/checkpoint/metadata
22/02/04 18:29:27 INFO MicroBatchExecution: Starting [id = 71f0aeb8-46fc-49b5-8baf-3b83cb4df71f, runId = 7a274767-8830-448c-b9bc-d03217cd4465]. Use file:/tmp/checkpoint to store the query checkpoint.
22/02/04 18:29:27 INFO MicroBatchExecution: Reading table [org.apache.spark.sql.kafka010.KafkaSourceProvider$KafkaTable#3be72f6d] from DataSourceV2 named 'kafka' [org.apache.spark.sql.kafka010.KafkaSourceProvider#276c9fdc]
22/02/04 18:29:27 INFO SparkUI: Stopped Spark web UI at http://spark-sample-9d328d7ec5fee0bc-driver-svc.default.svc:4045
22/02/04 18:29:27 INFO KubernetesClusterSchedulerBackend: Shutting down all executors
22/02/04 18:29:27 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down
22/02/04 18:29:27 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)
22/02/04 18:29:27 INFO MicroBatchExecution: Starting new streaming query.
22/02/04 18:29:27 INFO MicroBatchExecution: Stream started from {}
22/02/04 18:29:27 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/02/04 18:29:27 INFO MemoryStore: MemoryStore cleared
22/02/04 18:29:27 INFO BlockManager: BlockManager stopped
22/02/04 18:29:27 INFO BlockManagerMaster: BlockManagerMaster stopped
22/02/04 18:29:27 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/02/04 18:29:28 INFO SparkContext: Successfully stopped SparkContext
22/02/04 18:29:28 INFO ConsumerConfig: ConsumerConfig values:
....
....
....
ssl.keystore.type = JKS
ssl.protocol = TLS
ssl.provider = null
ssl.secure.random.implementation = null
ssl.trustmanager.algorithm = PKIX
ssl.truststore.location = null
ssl.truststore.password = null
ssl.truststore.type = JKS
value.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer
22/02/04 18:29:28 INFO ShutdownHookManager: Shutdown hook called
22/02/04 18:29:28 INFO ShutdownHookManager: Deleting directory /tmp/spark-7acecef5-0f0b-4b9a-af81-c8aa12f7fcad
22/02/04 18:29:28 INFO AppInfoParser: Kafka version: 2.4.1
22/02/04 18:29:28 INFO AppInfoParser: Kafka commitId: c57222ae8cd7866b
22/02/04 18:29:28 INFO AppInfoParser: Kafka startTimeMs: 1643999368467
22/02/04 18:29:28 INFO ShutdownHookManager: Deleting directory /var/data/spark-641f2e65-8f10-46b9-9821-d3b1f3536c0e/spark-1a103622-3329-4444-8e69-40f5a341c372/pyspark-59e822b4-a4a4-403b-9937-170d99c67584
22/02/04 18:29:28 INFO KafkaConsumer: [Consumer clientId=consumer-spark-kafka-source-75381ad2-1ce9-4e2b-a0b7-18d6ecb5ea8b--2090736517-driver-0-1, groupId=spark-kafka-source-75381ad2-1ce9-4e2b-a0b7-18d6ecb5ea8b--2090736517-driver-0] Subscribed to topic(s): test1
22/02/04 18:29:28 INFO ShutdownHookManager: Deleting directory /var/data/spark-641f2e65-8f10-46b9-9821-d3b1f3536c0e/spark-1a103622-3329-4444-8e69-40f5a341c372
22/02/04 18:29:28 INFO MetricsSystemImpl: Stopping s3a-file-system metrics system...
22/02/04 18:29:28 INFO MetricsSystemImpl: s3a-file-system metrics system stopped.
22/02/04 18:29:28 INFO MetricsSystemImpl: s3a-file-system metrics system shutdown complete.
Also, sometimes, I do see the below logs where the kafka connection fails to establish.
22/02/06 04:50:55 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-fc21e146-82f0-4fc7-a2da-34f3e8f70026-289148490-driver-0-1, groupId=spark-kafka-source-fc21e146-82f0-4fc7-a2da-34f3e8f70026-289148490-driver-0] Bootstrap broker pkc-.confluent.cloud:9092 (id: -1 rack: null) disconnected
22/02/06 04:50:56 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-fc21e146-82f0-4fc7-a2da-34f3e8f70026-289148490-driver-0-1, groupId=spark-kafka-source-fc21e146-82f0-4fc7-a2da-34f3e8f70026-289148490-driver-0] Bootstrap broker pkc-.confluent.cloud:9092 (id: -1 rack: null) disconnected
22/02/06 04:50:57 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-fc21e146-82f0-4fc7-a2da-34f3e8f70026-289148490-driver-0-1, groupId=spark-kafka-source-fc21e146-82f0-4fc7-a2da-34f3e8f70026-289148490-driver-0] Bootstrap broker pkc-.confluent.cloud:9092 (id: -1 rack: null) disconnected
22/02/06 04:50:58 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-fc21e146-82f0-4fc7-a2da-34f3e8f70026-289148490-driver-0-1, groupId=spark-kafka-source-fc21e146-82f0-4fc7-a2da-34f3e8f70026-289148490-driver-0] Bootstrap broker pkc-.confluent.cloud:9092 (id: -1 rack: null) disconnected
What am I doing wrong?

Though I have setMaster as local, my spark application gives error

I have the following application (I am starting and stopping spark) in Windows. I use Scala-IDE(Eclipse). I get "A master URL must be set in your configuration" error even though I have set it here. I use spark-2.4.4 version.
Can someone please help me to fix this issue.
import org.apache.spark._;
import org.apache.spark.sql._;
object SampleApp {
def main(args: Array[String]) {
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("Simple Application")
val sc = new SparkContext(conf)
sc.stop()
}
}
The error is:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/10/28 22:58:56 INFO SparkContext: Running Spark version 2.4.4
19/10/28 22:58:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/10/28 22:58:56 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:368)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(SparkSession.scala:935)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
at com.spark.renga.SampleApp$.main(SampleApp.scala:8)
at com.spark.renga.SampleApp.main(SampleApp.scala)
19/10/28 22:58:56 ERROR Utils: Uncaught exception in thread main
java.lang.NullPointerException
at org.apache.spark.SparkContext.postApplicationEnd(SparkContext.scala:2416)
at org.apache.spark.SparkContext.$anonfun$stop$2(SparkContext.scala:1931)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1931)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:585)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(SparkSession.scala:935)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
at com.spark.renga.SampleApp$.main(SampleApp.scala:8)
at com.spark.renga.SampleApp.main(SampleApp.scala)
19/10/28 22:58:56 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:368)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(SparkSession.scala:935)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
at com.spark.renga.SampleApp$.main(SampleApp.scala:8)
at com.spark.renga.SampleApp.main(SampleApp.scala)
if you are using version 2.4.4 try this:
import org.apache.spark.sql.SparkSession
object SampleApp {
def main(args: Array[String]) {
val spark = SparkSession
.builder
.master("local[*]")
.appName("test")
.getOrCreate()
println(spark.sparkContext.version)
spark.stop()
}
}

Exception while integrating spark with Kafka

Doing Spark-kafka streaming on word-count. Built a jar using sbt.
When I do spark-submit the following exception is throwing.
Exception in thread "streaming-start" java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileStatus.isDirectory()Z
at org.apache.spark.streaming.util.FileBasedWriteAheadLog.initializeOrRecover(FileBasedWriteAheadLog.scala:245)
at org.apache.spark.streaming.util.FileBasedWriteAheadLog.<init>(FileBasedWriteAheadLog.scala:80)
at org.apache.spark.streaming.util.WriteAheadLogUtils$$anonfun$2.apply(WriteAheadLogUtils.scala:142)
at org.apache.spark.streaming.util.WriteAheadLogUtils$$anonfun$2.apply(WriteAheadLogUtils.scala:142)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.streaming.util.WriteAheadLogUtils$.createLog(WriteAheadLogUtils.scala:141)
at org.apache.spark.streaming.util.WriteAheadLogUtils$.createLogForDriver(WriteAheadLogUtils.scala:99)
at org.apache.spark.streaming.scheduler.ReceivedBlockTracker$$anonfun$createWriteAheadLog$1.apply(ReceivedBlockTracker.scala:256)
at org.apache.spark.streaming.scheduler.ReceivedBlockTracker$$anonfun$createWriteAheadLog$1.apply(ReceivedBlockTracker.scala:254)
at scala.Option.map(Option.scala:146)
at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.createWriteAheadLog(ReceivedBlockTracker.scala:254)
at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.<init>(ReceivedBlockTracker.scala:77)
at org.apache.spark.streaming.scheduler.ReceiverTracker.<init>(ReceiverTracker.scala:109)
at org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:87)
at org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply$mcV$sp(StreamingContext.scala:583)
at org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:578)
at org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:578)
at org.apache.spark.util.ThreadUtils$$anon$2.run(ThreadUtils.scala:126)
18/03/27 12:43:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#12010fd1{/streaming,null,AVAILABLE,#Spark}
18/03/27 12:43:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#552ed807{/streaming/json,null,AVAILABLE,#Spark}
18/03/27 12:43:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#7318daf8{/streaming/batch,null,AVAILABLE,#Spark}
18/03/27 12:43:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#3f1ddac2{/streaming/batch/json,null,AVAILABLE,#Spark}
18/03/27 12:43:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#37864b77{/static/streaming,null,AVAILABLE,#Spark}
18/03/27 12:43:55 INFO streaming.StreamingContext: StreamingContext started
my spark submit:
spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.0 --class "KafkaWordCount" --master local[4] scala_project_2.11-1.0.jar localhost:2181 test-consumer-group word-count 1
scala_version: 2.11.8
spark_version: 2.2.0
sbt_version: 1.0.3
object KafkaWordCount {
def main(args: Array[String]) {
val (zkQuorum, group, topics, numThreads) = ("localhost:2181", "test-consumer-group", "word-count", 1)
val sparkConf = new SparkConf()
.setMaster("local[*]")
.setAppName("KafkaWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.checkpoint("checkpoint")
val topicMap = topics.split(",").map((_, numThreads)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
val words = lines.flatMap(_.split(" "))
words.foreachRDD(rdd => println("#####################rdd###################### " + rdd.first))
val wordCounts = words.map(x => (x, 1L))
.reduceByKeyAndWindow(_ + _, _ - _, Minutes(10), Seconds(2), 2)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}

How to use a specific directory for metastore with HiveContext?

So this is what I tried in Spark Shell.
scala> import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.HiveContext
scala> import java.nio.file.Files
import java.nio.file.Files
scala> val hiveDir = Files.createTempDirectory("hive")
hiveDir: java.nio.file.Path = /var/folders/gg/g3hk6fcj4rxc6lb1qsvxc_vdxxwf28/T/hive5050481206678469338
scala> val hiveContext = new HiveContext(sc)
15/12/31 12:05:27 INFO HiveContext: Initializing execution hive, version 0.13.1
hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext#6f959640
scala> hiveContext.sql(s"SET hive.metastore.warehouse.dir=${hiveDir.toUri}")
15/12/31 12:05:34 INFO HiveContext: Initializing HiveMetastoreConnection version 0.13.1 using Spark classes.
...
res0: org.apache.spark.sql.DataFrame = [: string]
scala> Seq("create database foo").foreach(hiveContext.sql)
15/12/31 12:05:42 INFO ParseDriver: Parsing command: create database foo
15/12/31 12:05:42 INFO ParseDriver: Parse Completed
...
15/12/31 12:05:43 INFO HiveMetaStore: 0: create_database: Database(name:foo, description:null, locationUri:null, parameters:null, ownerName:aa8y, ownerType:USER)
15/12/31 12:05:43 INFO audit: ugi=aa8y ip=unknown-ip-addr cmd=create_database: Database(name:foo, description:null, locationUri:null, parameters:null, ownerName:aa8y, ownerType:USER)
15/12/31 12:05:43 INFO HiveMetaStore: 0: get_database: foo
15/12/31 12:05:43 INFO audit: ugi=aa8y ip=unknown-ip-addr cmd=get_database: foo
15/12/31 12:05:43 ERROR RetryingHMSHandler: MetaException(message:Unable to create database path file:/user/hive/warehouse/foo.db, failed to create database foo)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_database_core(HiveMetaStore.java:734)
...
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
15/12/31 12:05:43 ERROR DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Unable to create database path file:/user/hive/warehouse/foo.db, failed to create database foo)
at org.apache.hadoop.hive.ql.metadata.Hive.createDatabase(Hive.java:248)
...
at org.apache.hadoop.hive.ql.metadata.Hive.createDatabase(Hive.java:242)
... 78 more
15/12/31 12:05:43 ERROR Driver: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Unable to create database path file:/user/hive/warehouse/foo.db, failed to create database foo)
15/12/31 12:05:43 ERROR ClientWrapper:
======================
HIVE FAILURE OUTPUT
======================
SET hive.metastore.warehouse.dir=file:///var/folders/gg/g3hk6fcj4rxc6lb1qsvxc_vdxxwf28/T/hive5050481206678469338/
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Unable to create database path file:/user/hive/warehouse/foo.db, failed to create database foo)
======================
END HIVE FAILURE OUTPUT
======================
org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Unable to create database path file:/user/hive/warehouse/foo.db, failed to create database foo)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:349)
...
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
scala>
But it doesn't seem to recognize the directory I am setting. I've removed content from the stacktrace since it was very verbose. The entire stacktrace is here.
I am not sure what I am doing wrong. Would appreciate any help provided.

Resources