Spark-submit fails when using kafka structured streaming in pyspark 2.3.1 - apache-spark

Spark-submit fails when using kafka structured streaming in pyspark 2.3.1
But the same code works in pyspark command, so l want to know how to solve it
from pyspark.sql.types import *
from pyspark.sql import SparkSession
topic="topicname"
spark=SparkSession\
.builder\
.appName("test_{}".format(topic))\
.getOrCreate()
source_df = spark.readStream\
.format("kafka")\
.option("kafka.bootstrap.servers","ip:6667")\
.option("subscribe", topic)\
.option("failOnDataLoss","false")\
.option("maxOffsetsPerTrigger",30000)\
.load()
query=source_df.selectExpr("CAST(key AS STRING)")\
.writeStream\
.format("json")\
.option("checkpointLocation","/data/testdata_test")\
.option("path","/data/testdata_test_checkpoint")\
.start()
command is like this
//use this command -> fail
spark-submit --master yarn --jars hdfs:///vcrm_data/spark-sql-kafka-0-10_2.11-2.3.1.jar,hdfs:///vcrm_data/kafka-clients-1.1.1.3.2.0.0-520.jar test.py
//use this command then regist code -> success
pyspark --jars hdfs:///vcrm_data/spark-sql-kafka-0-10_2.11-2.3.1.jar,hdfs:///vcrm_data/kafka-clients-2.6.0.jar
my spark env is
HDP(Hortonworks) 3.0.1(Spark2.3.1),
Kafka 1.1.1
Spark-submit log
20/08/24 20:19:01 INFO AppInfoParser: Kafka version: 2.6.0
20/08/24 20:19:01 INFO AppInfoParser: Kafka commitId: 62abe01bee039651
20/08/24 20:19:01 INFO AppInfoParser: Kafka startTimeMs: 1598267941674
20/08/24 20:19:01 INFO KafkaConsumer: [Consumer clientId=consumer-spark-kafka-source-6d5eb2af-8039-4073-a3f1-3ba44d01fedc--47946854-driver-0-2, groupId=spark-kafka-source-6d5eb2af-8039-4073-a3f1-3ba44d01fedc--47946854-driver-0] Subscribed to topic(s): test
20/08/24 20:19:01 INFO SparkContext: Invoking stop() from shutdown hook
20/08/24 20:19:01 INFO MicroBatchExecution: Starting new streaming query.
20/08/24 20:19:01 INFO AbstractConnector: Stopped Spark#644a2858{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
20/08/24 20:19:01 INFO SparkUI: Stopped Spark web UI at http://p0gn001.io:4040
20/08/24 20:19:01 INFO YarnClientSchedulerBackend: Interrupting monitor thread
20/08/24 20:19:01 INFO YarnClientSchedulerBackend: Shutting down all executors
20/08/24 20:19:01 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
20/08/24 20:19:01 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
(serviceOption=None,
services=List(),
started=false)
20/08/24 20:19:01 INFO YarnClientSchedulerBackend: Stopped
20/08/24 20:19:01 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/08/24 20:19:01 INFO MemoryStore: MemoryStore cleared
20/08/24 20:19:01 INFO BlockManager: BlockManager stopped
20/08/24 20:19:01 INFO BlockManagerMaster: BlockManagerMaster stopped
20/08/24 20:19:01 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/08/24 20:19:01 INFO SparkContext: Successfully stopped SparkContext
20/08/24 20:19:01 INFO ShutdownHookManager: Shutdown hook called
20/08/24 20:19:01 INFO ShutdownHookManager: Deleting directory /tmp/temporaryReader-ea374033-f3fc-4bca-9c15-4cccaa7da8ac
20/08/24 20:19:01 INFO ShutdownHookManager: Deleting directory /tmp/spark-075415ca-8e98-4bb0-916c-a89c4d4f9d1f
20/08/24 20:19:01 INFO ShutdownHookManager: Deleting directory /tmp/spark-83270204-3330-4361-bf56-a82c47d8c96f
20/08/24 20:19:01 INFO ShutdownHookManager: Deleting directory /tmp/spark-075415ca-8e98-4bb0-916c-a89c4d4f9d1f/pyspark-eb26535b-1bf6-495e-83a1-4bbbdc658c7a

Related

How to write to HBase in Azure Databricks?

I'm trying a build a lambda architecture using 'kafka-spark-hbase'. I'm using azure cloud and the components are on following platforms
1. Kafka (0.10) -HDinsights
2. Spark (2.4.3)- Databricks
3. Hbase (1.2)- HDinsights
All the 3 components are under the same V-net so there is no issue in connectivity.
I'm using spark structured streaming, and successfully able to connect to Kafka as a source.
Now as spark does not provide native support to connect to Hbase, I'm using 'Spark Hortonworks Connector' to write data to Hbase, and I have implemented the code to write a batch to hbase in "foreachbatch" api provided in spark 2.4 onward.
The code is as below:
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.expr
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.OutputMode
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.Dataset
import org.apache.spark.SparkConf
import org.apache.spark.sql.DataFrame
//code dependencies
import com.JEM.conf.HbaseTableConf
import com.JEM.constant.HbaseConstant
//'Spark hortonworks connector' dependency
import org.apache.spark.sql.execution.datasources.hbase._
//---------Variables--------------//
val kafkaBroker = "valid kafka broker"
val topic = "valid kafka topic"
val kafkaCheckpointLocation = "/checkpointDir"
//---------code--------------//
import spark.sqlContext.implicits._
val kafkaIpStream = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", kafkaBroker)
.option("subscribe", topic)
.option("checkpointLocation", kafkaCheckpointLocation)
.option("startingOffsets", "earliest")
.load()
val streamToBeSentToHbase = kafkaIpStream.selectExpr("cast (key as String)", "cast (value as String)")
.withColumn("ts", split($"key", "/")(1))
.selectExpr("key as rowkey", "ts", "value as val")
.writeStream
.option("failOnDataLoss", false)
.outputMode(OutputMode.Update())
.trigger(Trigger.ProcessingTime("30 seconds"))
.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF
.write
.options(Map(HBaseTableCatalog.tableCatalog -> HbaseTableConf.getRawtableConf(HbaseConstant.hbaseRawTable), HBaseTableCatalog.newTable -> "5"))
.format("org.apache.spark.sql.execution.datasources.hbase").save
}.start()
Form the logs I can see that code is successfully able to get the data but when it tries to write to Hbase I get the following exception.
19/10/09 12:42:48 ERROR MicroBatchExecution: Query [id = 1a54283d-ab8a-4bf4-af65-63becc166328, runId = 670f90de-8ca5-41d7-91a9-e8d36dfeef66] terminated with error
java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/NamespaceNotFoundException
at org.apache.spark.sql.execution.datasources.hbase.DefaultSource.createRelation(HBaseRelation.scala:59)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:72)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:88)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:146)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:134)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$5.apply(SparkPlan.scala:187)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:183)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:134)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:116)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:710)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:710)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1.apply(SQLExecution.scala:111)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:240)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:97)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:170)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:710)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:306)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:292)
at com.JEM.Importer.StreamingHbaseImporter$.$anonfun$main$1(StreamingHbaseImporter.scala:57)
at com.JEM.Importer.StreamingHbaseImporter$.$anonfun$main$1$adapted(StreamingHbaseImporter.scala:53)
at org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:36)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:568)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1.apply(SQLExecution.scala:111)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:240)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:97)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:170)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:566)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:251)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:61)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:565)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:207)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:175)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:175)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:251)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:61)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:175)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:169)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:296)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:208)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.NamespaceNotFoundException
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 44 more
19/10/09 12:42:48 INFO SparkContext: Invoking stop() from shutdown hook
19/10/09 12:42:48 INFO AbstractConnector: Stopped Spark#42f85fa4{HTTP/1.1,[http/1.1]}{172.20.170.72:47611}
19/10/09 12:42:48 INFO SparkUI: Stopped Spark web UI at http://172.20.170.72:47611
19/10/09 12:42:48 INFO StandaloneSchedulerBackend: Shutting down all executors
19/10/09 12:42:48 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
19/10/09 12:42:48 INFO SQLAppStatusListener: Execution ID: 1 Total Executor Run Time: 0
19/10/09 12:42:49 INFO SQLAppStatusListener: Execution ID: 0 Total Executor Run Time: 0
19/10/09 12:42:49 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/10/09 12:42:49 INFO MemoryStore: MemoryStore cleared
19/10/09 12:42:49 INFO BlockManager: BlockManager stopped
19/10/09 12:42:49 INFO BlockManagerMaster: BlockManagerMaster stopped
19/10/09 12:42:49 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/10/09 12:42:49 INFO SparkContext: Successfully stopped SparkContext
19/10/09 12:42:49 INFO ShutdownHookManager: Shutdown hook called
19/10/09 12:42:49 INFO ShutdownHookManager: Deleting directory /local_disk0/tmp/temporaryReader-79d0f9b8-c380-4141-9ac2-46c257c6c854
19/10/09 12:42:49 INFO ShutdownHookManager: Deleting directory /local_disk0/tmp/temporary-d00d2f73-96e3-4a18-9d5c-a9ff76a871bb
19/10/09 12:42:49 INFO ShutdownHookManager: Deleting directory /local_disk0/tmp/spark-b92c0171-286b-4863-9fac-16f4ac379da8
19/10/09 12:42:49 INFO ShutdownHookManager: Deleting directory /local_disk0/tmp/spark-ef92ca6b-2e7d-4917-b407-4426ad088cee
19/10/09 12:42:49 INFO ShutdownHookManager: Deleting directory /local_disk0/spark-9b138379-fa3a-49db-95dc-436cd7040a95
19/10/09 12:42:49 INFO ShutdownHookManager: Deleting directory /local_disk0/tmp/spark-2666c4ff-30a2-4161-868e-137af5fa3787
What all have I tried:
There are 3 ways to run spark jobs in databricks:
Notebook: I have Installed SHC jar as library on the cluster and placed 'hbase-site.xml' into my job jar and installed it onto the cluster as well, not I have pasted the main class code onto the notebook, When I run it I'm able to load dependencies from SHC but get above error.
Jar: This is almost similar to notebook, with that difference that instead of notebook, I give the main class and the jar to the job and run it. Gives me the same error.
Spark submit: I created an uber-jar with all the dependencies including SHC, I uploaded the hbase-site file to dbfs path and provided above in the submit command as below.
[
"--class","com.JEM.Importer.StreamingHbaseImporter","dbfs:/FileStore/JarPath/jarfile.jar",
"--packages","com.hortonworks:shc-core:1.1.1-2.1-s_2.11",
"--repositories","http://repo.hortonworks.com/content/groups/public/",
"--files","dbfs:/PathToSiteFile/hbase_site.xml"
]
Still I get the same error. Can anyone please help?
Thanks

Cannot connect kafka server to spark

I try to stream data from kafka server to spark. As you guess I failed. I use spark 2.2.0 and kafka_2.11-0.11.0.1. I loaded jars to eclipse and runned below code.
package com.defne
import java.nio.ByteBuffer
import scala.util.Random
import org.apache.spark._
import org.apache.spark.streaming.dstream._
import org.apache.spark.streaming.kafka._
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.storage.StorageLevel
import org.apache.log4j.Level
import java.util.regex.Pattern
import java.util.regex.Matcher
import kafka.serializer.StringDecoder
import Utilities._
import org.apache.spark.streaming.kafka
import org.apache.spark.streaming.kafka.KafkaUtils
object KafkaExample {
def main(args: Array[String]) {
val ssc = new StreamingContext("local[*]", "KafkaExample", Seconds(1))
val kafkaParams = Map("metadata.broker.list" -> "kafkaIP:9092", "group.id" -> "console-consumer-9526", "zookeeper.connect" -> "localhost:2181")
val topics = List("logstash_log").toSet
val lines = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics).map(_._2)
lines.print()
ssc.checkpoint("C:/checkpoint/")
ssc.start()
ssc.awaitTermination()
}
}
And I got below output. Interesting thing is no error exists but somehow I cannot connect to the kafka server.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/11/01 10:16:55 INFO SparkContext: Running Spark version 2.2.0
17/11/01 10:16:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/11/01 10:16:56 INFO SparkContext: Submitted application: KafkaExample
17/11/01 10:16:56 INFO SecurityManager: Changing view acls to: user
17/11/01 10:16:56 INFO SecurityManager: Changing modify acls to: user
17/11/01 10:16:56 INFO SecurityManager: Changing view acls groups to:
17/11/01 10:16:56 INFO SecurityManager: Changing modify acls groups to:
17/11/01 10:16:56 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(user); groups with view permissions: Set(); users with modify permissions: Set(user); groups with modify permissions: Set()
17/11/01 10:16:58 INFO Utils: Successfully started service 'sparkDriver' on port 53749.
17/11/01 10:16:59 INFO SparkEnv: Registering MapOutputTracker
17/11/01 10:16:59 INFO SparkEnv: Registering BlockManagerMaster
17/11/01 10:16:59 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/11/01 10:16:59 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/11/01 10:16:59 INFO DiskBlockManager: Created local directory at C:\Users\user\AppData\Local\Temp\blockmgr-2fa455d5-ef26-4fb9-ba4b-caf9f2fa3a68
17/11/01 10:16:59 INFO MemoryStore: MemoryStore started with capacity 897.6 MB
17/11/01 10:16:59 INFO SparkEnv: Registering OutputCommitCoordinator
17/11/01 10:16:59 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/11/01 10:17:00 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.56.1:4040
17/11/01 10:17:00 INFO Executor: Starting executor ID driver on host localhost
17/11/01 10:17:00 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 53770.
17/11/01 10:17:00 INFO NettyBlockTransferService: Server created on 192.168.56.1:53770
17/11/01 10:17:00 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/11/01 10:17:00 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.56.1, 53770, None)
17/11/01 10:17:00 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.56.1:53770 with 897.6 MB RAM, BlockManagerId(driver, 192.168.56.1, 53770, None)
17/11/01 10:17:00 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.56.1, 53770, None)
17/11/01 10:17:00 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.56.1, 53770, None)
17/11/01 10:17:01 INFO VerifiableProperties: Verifying properties
17/11/01 10:17:01 INFO VerifiableProperties: Property group.id is overridden to console-consumer-9526
17/11/01 10:17:01 INFO VerifiableProperties: Property zookeeper.connect is overridden to localhost:2181
17/11/01 10:17:02 INFO SimpleConsumer: Reconnect due to error:
java.lang.NoSuchMethodError: org.apache.kafka.common.network.NetworkSend.<init>(Ljava/lang/String;Ljava/nio/ByteBuffer;)V
at kafka.network.RequestOrResponseSend.<init>(RequestOrResponseSend.scala:41)
at kafka.network.RequestOrResponseSend.<init>(RequestOrResponseSend.scala:44)
at kafka.network.BlockingChannel.send(BlockingChannel.scala:114)
at kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:88)
at kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:86)
at kafka.consumer.SimpleConsumer.send(SimpleConsumer.scala:114)
at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$getPartitionMetadata$1.apply(KafkaCluster.scala:126)
at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$getPartitionMetadata$1.apply(KafkaCluster.scala:125)
at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$org$apache$spark$streaming$kafka$KafkaCluster$$withBrokers$1.apply(KafkaCluster.scala:346)
at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$org$apache$spark$streaming$kafka$KafkaCluster$$withBrokers$1.apply(KafkaCluster.scala:342)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at org.apache.spark.streaming.kafka.KafkaCluster.org$apache$spark$streaming$kafka$KafkaCluster$$withBrokers(KafkaCluster.scala:342)
at org.apache.spark.streaming.kafka.KafkaCluster.getPartitionMetadata(KafkaCluster.scala:125)
at org.apache.spark.streaming.kafka.KafkaCluster.getPartitions(KafkaCluster.scala:112)
at org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:211)
at org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)
at com.defne.KafkaExample$.main(KafkaExample.scala:44)
at com.defne.KafkaExample.main(KafkaExample.scala)
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.kafka.common.network.NetworkSend.<init>(Ljava/lang/String;Ljava/nio/ByteBuffer;)V
at kafka.network.RequestOrResponseSend.<init>(RequestOrResponseSend.scala:41)
at kafka.network.RequestOrResponseSend.<init>(RequestOrResponseSend.scala:44)
at kafka.network.BlockingChannel.send(BlockingChannel.scala:114)
at kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:101)
at kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:86)
at kafka.consumer.SimpleConsumer.send(SimpleConsumer.scala:114)
at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$getPartitionMetadata$1.apply(KafkaCluster.scala:126)
at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$getPartitionMetadata$1.apply(KafkaCluster.scala:125)
at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$org$apache$spark$streaming$kafka$KafkaCluster$$withBrokers$1.apply(KafkaCluster.scala:346)
at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$org$apache$spark$streaming$kafka$KafkaCluster$$withBrokers$1.apply(KafkaCluster.scala:342)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at org.apache.spark.streaming.kafka.KafkaCluster.org$apache$spark$streaming$kafka$KafkaCluster$$withBrokers(KafkaCluster.scala:342)
at org.apache.spark.streaming.kafka.KafkaCluster.getPartitionMetadata(KafkaCluster.scala:125)
at org.apache.spark.streaming.kafka.KafkaCluster.getPartitions(KafkaCluster.scala:112)
at org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:211)
at org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)
at com.defne.KafkaExample$.main(KafkaExample.scala:44)
at com.defne.KafkaExample.main(KafkaExample.scala)
17/11/01 10:17:02 INFO SparkContext: Invoking stop() from shutdown hook
17/11/01 10:17:02 INFO SparkUI: Stopped Spark web UI at http://192.168.56.1:4040
17/11/01 10:17:02 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/11/01 10:17:02 INFO MemoryStore: MemoryStore cleared
17/11/01 10:17:02 INFO BlockManager: BlockManager stopped
17/11/01 10:17:02 INFO BlockManagerMaster: BlockManagerMaster stopped
17/11/01 10:17:02 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/11/01 10:17:02 INFO SparkContext: Successfully stopped SparkContext
17/11/01 10:17:02 INFO ShutdownHookManager: Shutdown hook called
17/11/01 10:17:02 INFO ShutdownHookManager: Deleting directory C:\Users\user\AppData\Local\Temp\spark-a584950c-10ca-422b-990e-fd1980e2260c
Any help will be greatly appreciated.
NOTE
Add hostname:port for Kafka brokers, not Zookeeper
import kafka.serializer.StringDecoder
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.kafka.KafkaUtils
val ssc = new StreamingContext(new SparkConf, Seconds(60))
// hostname:port for Kafka brokers, not Zookeeper
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092,anotherhost:9092")
val topics = Set("sometopic", "anothertopic")
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)

mongo-spark error while loading data from mongo through spark using spark stratio connector

I am trying to load mongo data in spark using spark stratio connector(version:spark-mongodb_2.11-0.12.0). for that I have added all necessary dependencies. I am trying to create rdd by loading mongo data from my local mongo. Below is my code:
import org.apache.spark.SparkContext
import org.apache.spark.sql._
import com.mongodb.casbah.{WriteConcern => MongodbWriteConcern}
import com.stratio.datasource.mongodb._
import com.stratio.datasource.mongodb.config._
import com.stratio.datasource.mongodb.config.MongodbConfig._
import org.apache.spark.sql.SparkSession
import akka.actor.ActorSystem
import org.apache.spark.SparkConf
object newtest {def main(args:Array[String]){
System.setProperty("hadoop.home.dir", "C:\\winutil\\");
import org.apache.spark.sql.functions._
val sparkSession = SparkSession.builder().master("local").getOrCreate()
//spark.conf.set("spark.executor.memory", "2g")
val builder = MongodbConfigBuilder(Map(Host -> List("localhost:27017"), Database -> "test", Collection ->"SurvyAnswer", SamplingRatio -> 1.0, WriteConcern -> "normal"))
val readConfig = builder.build()
val columns=Array("GroupId", "_Id", "hgId")
val mongoRDD = sparkSession.sqlContext.fromMongoDB(readConfig)
mongoRDD.take(2).foreach(println)
I am getting below error while connection it is failing.I am not getting why this error is showing :
17/02/21 14:45:45 WARN SparkContext: Use an existing SparkContext, some configuration may not take effect.
17/02/21 14:45:45 INFO SharedState: Warehouse path is 'file:/C:/Users/gbhog/Desktop/BDG/example/mongospark/spark-warehouse'.
17/02/21 14:45:48 INFO cluster: Cluster created with settings {hosts=[localhost:27017], mode=MULTIPLE, requiredClusterType=UNKNOWN, serverSelectionTimeout='30000 ms', maxWaitQueueSize=500}
17/02/21 14:45:48 INFO cluster: Adding discovered server localhost:27017 to client view of cluster
Exception in thread "main" java.lang.NoSuchFieldError: NONE
at com.mongodb.casbah.WriteConcern$.<init>(WriteConcern.scala:40)
at com.mongodb.casbah.WriteConcern$.<clinit>(WriteConcern.scala)
at com.mongodb.casbah.BaseImports$class.$init$(Implicits.scala:162)
at com.mongodb.casbah.Imports$.<init>(Implicits.scala:142)
at com.mongodb.casbah.Imports$.<clinit>(Implicits.scala)
at com.mongodb.casbah.MongoClient.apply(MongoClient.scala:217)
at com.stratio.datasource.mongodb.partitioner.MongodbPartitioner.isShardedCollection(MongodbPartitioner.scala:78)
at com.stratio.datasource.mongodb.partitioner.MongodbPartitioner$$anonfun$computePartitions$1.apply(MongodbPartitioner.scala:67)
at com.stratio.datasource.mongodb.partitioner.MongodbPartitioner$$anonfun$computePartitions$1.apply(MongodbPartitioner.scala:66)
at com.stratio.datasource.mongodb.util.usingMongoClient$.apply(usingMongoClient.scala:27)
at com.stratio.datasource.mongodb.partitioner.MongodbPartitioner.computePartitions(MongodbPartitioner.scala:66)
17/02/21 14:45:48 INFO SparkContext: Invoking stop() from shutdown hook
17/02/21 14:45:48 INFO SparkUI: Stopped Spark web UI at http://192.168.242.1:4040
17/02/21 14:45:48 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/02/21 14:45:49 INFO MemoryStore: MemoryStore cleared
17/02/21 14:45:49 INFO BlockManager: BlockManager stopped
17/02/21 14:45:49 INFO BlockManagerMaster: BlockManagerMaster stopped
17/02/21 14:45:49 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/02/21 14:45:49 INFO SparkContext: Successfully stopped SparkContext
17/02/21 14:45:49 INFO ShutdownHookManager: Shutdown hook called

Why does my Spark Streaming application shut down immediately (and not process any Kafka records)?

I've created a Spark application in Python following the example described in Spark Streaming + Kafka Integration Guide (Kafka broker version 0.8.2.1 or higher) to stream Kafka messages using Apache Spark, but it's shutting down before I get the chance to send any messages.
This is where the shutdown section begins in the output.
16/11/26 17:11:06 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 1********6, 58045)
16/11/26 17:11:06 INFO VerifiableProperties: Verifying properties
16/11/26 17:11:06 INFO VerifiableProperties: Property group.id is overridden to
16/11/26 17:11:06 INFO VerifiableProperties: Property zookeeper.connect is overridden to
16/11/26 17:11:07 INFO SparkContext: Invoking stop() from shutdown hook
16/11/26 17:11:07 INFO SparkUI: Stopped Spark web UI at http://192.168.1.16:4040
16/11/26 17:11:07 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/11/26 17:11:07 INFO MemoryStore: MemoryStore cleared
16/11/26 17:11:07 INFO BlockManager: BlockManager stopped
16/11/26 17:11:07 INFO BlockManagerMaster: BlockManagerMaster stopped
16/11/26 17:11:07 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/11/26 17:11:07 INFO SparkContext: Successfully stopped SparkContext
16/11/26 17:11:07 INFO ShutdownHookManager: Shutdown hook called
16/11/26 17:11:07 INFO ShutdownHookManager: Deleting directory /private/var/folders/yn/t3pvrk7s231_11ff2lqr4jhr0000gn/T/spark-1876feee-9b71-413e-a505-99c414aafabf/pyspark-1d97c3dd-0889-42ed-b559-d0fd473faa22
16/11/26 17:11:07 INFO ShutdownHookManager: Deleting directory /private/var/folders/yn/t3pvrk7s231_11ff2lqr4jhr0000gn/T/spark-1876feee-9b71-413e-a505-99c414aafabf
Is there a way I should tell it to wait or am I missing something?
Full code:
from pyspark.streaming.kafka import KafkaUtils
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
sc = SparkContext("local[2]", "TwitterWordCount")
ssc = StreamingContext(sc, 1)
directKafkaStream = KafkaUtils.createDirectStream(ssc, ["next"], {"metadata.broker.list": "localhost:9092"})
offsetRanges = []
def storeOffsetRanges(rdd):
global offsetRanges
offsetRanges = rdd.offsetRanges()
return rdd
def printOffsetRanges(rdd):
for o in offsetRanges:
print("Printing! %s %s %s %s" % o.topic, o.partition, o.fromOffset, o.untilOffset)
directKafkaStream\
.transform(storeOffsetRanges)\
.foreachRDD(printOffsetRanges)
And here's the command to run it in case that's helpful.
spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2 producer.py
You will also need to start the streaming context. Take a look at this example.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#a-quick-example
ssc.start() # Start the computation
ssc.awaitTermination() # Wait for the computation to terminate
For Scala when submitting to yarn with cluster mode I had to use awaitAnyTermination:
query.start()
sparkSession.streams.awaitAnyTermination()
as (kind of) per the docs here Structured Streaming Guide half way through Quick Example.

How to pass data from Kafka to Spark Streaming?

I am trying to pass data from kafka to spark streaming.
This is what I've done till now:
Installed both kafka and spark
Started zookeeper with default properties config
Started kafka server with default properties config
Started kafka producer
Started kafka consumer
Sent message from producer to consumer. Works fine.
Wrote kafka-spark.py to receive messages from kafka to spark.
I try running ./bin/spark-submit examples/src/main/python/kafka-spark.py
I get an error.
kafka-spark.py -
from __future__ import print_function
import sys
from pyspark.streaming import StreamingContext
from pyspark import SparkContext,SparkConf
from pyspark.streaming.kafka import KafkaUtils
if __name__ == "__main__":
#conf = SparkConf().setAppName("Kafka-Spark").setMaster("spark://127.0.0.1:7077")
conf = SparkConf().setAppName("Kafka-Spark")
#sc = SparkContext(appName="KafkaSpark")
sc = SparkContext(conf=conf)
stream=StreamingContext(sc,1)
map1={'spark-kafka':1}
kafkaStream = KafkaUtils.createStream(stream, 'localhost:9092', "name", map1) #tried with localhost:2181 too
print("kafkastream=",kafkaStream)
sc.stop()
Full Log including the Error on running spark-kafka.py:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/01/18 13:05:33 INFO SparkContext: Running Spark version 1.6.0
16/01/18 13:05:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/01/18 13:05:33 INFO SecurityManager: Changing view acls to: username
16/01/18 13:05:33 INFO SecurityManager: Changing modify acls to: username
16/01/18 13:05:33 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(username); users with modify permissions: Set(username)
16/01/18 13:05:33 INFO Utils: Successfully started service 'sparkDriver' on port 54446.
16/01/18 13:05:34 INFO Slf4jLogger: Slf4jLogger started
16/01/18 13:05:34 INFO Remoting: Starting remoting
16/01/18 13:05:34 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem#127.0.0.1:50386]
16/01/18 13:05:34 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 50386.
16/01/18 13:05:34 INFO SparkEnv: Registering MapOutputTracker
16/01/18 13:05:34 INFO SparkEnv: Registering BlockManagerMaster
16/01/18 13:05:34 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-f5490271-cdb7-467d-a915-4f5ccab57f0e
16/01/18 13:05:34 INFO MemoryStore: MemoryStore started with capacity 511.1 MB
16/01/18 13:05:34 INFO SparkEnv: Registering OutputCommitCoordinator
16/01/18 13:05:34 INFO Server: jetty-8.y.z-SNAPSHOT
16/01/18 13:05:34 INFO AbstractConnector: Started SelectChannelConnector#0.0.0.0:4040
16/01/18 13:05:34 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/01/18 13:05:34 INFO SparkUI: Started SparkUI at http://127.0.0.1:4040
Java HotSpot(TM) Server VM warning: You have loaded library /tmp/libnetty-transport-native-epoll561240765619860252.so which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
16/01/18 13:05:34 INFO Utils: Copying ~/Dropbox/Work/ITNow/spark/spark-1.6.0/examples/src/main/python/kafka-spark.py to /tmp/spark-18227081-a1c8-43f2-8ca7-cfc4751f023f/userFiles-e93fc252-0ba1-42b7-b4fa-2e46f3a0601e/kafka-spark.py
16/01/18 13:05:34 INFO SparkContext: Added file file:~/Dropbox/Work/ITNow/spark/spark-1.6.0/examples/src/main/python/kafka-spark.py at file:~/Dropbox/Work/ITNow/spark/spark-1.6.0/examples/src/main/python/kafka-spark.py with timestamp 1453118734892
16/01/18 13:05:35 INFO Executor: Starting executor ID driver on host localhost
16/01/18 13:05:35 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 58970.
16/01/18 13:05:35 INFO NettyBlockTransferService: Server created on 58970
16/01/18 13:05:35 INFO BlockManagerMaster: Trying to register BlockManager
16/01/18 13:05:35 INFO BlockManagerMasterEndpoint: Registering block manager localhost:58970 with 511.1 MB RAM, BlockManagerId(driver, localhost, 58970)
16/01/18 13:05:35 INFO BlockManagerMaster: Registered BlockManager
________________________________________________________________________________________________
Spark Streaming's Kafka libraries not found in class path. Try one of the following.
1. Include the Kafka library and its dependencies with in the
spark-submit command as
$ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka:1.6.0 ...
2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-assembly, Version = 1.6.0.
Then, include the jar in the spark-submit command as
$ bin/spark-submit --jars <spark-streaming-kafka-assembly.jar> ...
________________________________________________________________________________________________
Traceback (most recent call last):
File "~/Dropbox/Work/ITNow/spark/spark-1.6.0/examples/src/main/python/kafka-spark.py", line 33, in <module>
kafkaStream = KafkaUtils.createStream(stream, 'localhost:9092', "name", map1)
File "~/Dropbox/Work/ITNow/spark/spark-1.6.0/python/lib/pyspark.zip/pyspark/streaming/kafka.py", line 80, in createStream
py4j.protocol.Py4JJavaError: An error occurred while calling o22.loadClass.
: java.lang.ClassNotFoundException: org.apache.spark.streaming.kafka.KafkaUtilsPythonHelper
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
16/01/18 13:05:35 INFO SparkContext: Invoking stop() from shutdown hook
16/01/18 13:05:35 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/metrics/json,null}
16/01/18 13:05:35 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/stage/kill,null}
16/01/18 13:05:35 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/api,null}
16/01/18 13:05:35 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/,null}
16/01/18 13:05:35 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/static,null}
16/01/18 13:05:35 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/executors/threadDump/json,null}
16/01/18 13:05:35 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/executors/threadDump,null}
16/01/18 13:05:35 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/executors/json,null}
16/01/18 13:05:35 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/executors,null}
16/01/18 13:05:35 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/environment/json,null}
16/01/18 13:05:35 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/environment,null}
16/01/18 13:05:35 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage/rdd/json,null}
16/01/18 13:05:35 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage/rdd,null}
16/01/18 13:05:35 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage/json,null}
16/01/18 13:05:35 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage,null}
16/01/18 13:05:35 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/pool/json,null}
16/01/18 13:05:35 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/pool,null}
16/01/18 13:05:35 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/stage/json,null}
16/01/18 13:05:35 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/stage,null}
16/01/18 13:05:35 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/json,null}
16/01/18 13:05:35 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages,null}
16/01/18 13:05:35 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/jobs/job/json,null}
16/01/18 13:05:35 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/jobs/job,null}
16/01/18 13:05:35 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/jobs/json,null}
16/01/18 13:05:35 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/jobs,null}
16/01/18 13:05:35 INFO SparkUI: Stopped Spark web UI at http://127.0.0.1:4040
16/01/18 13:05:35 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/01/18 13:05:35 INFO MemoryStore: MemoryStore cleared
16/01/18 13:05:35 INFO BlockManager: BlockManager stopped
16/01/18 13:05:35 INFO BlockManagerMaster: BlockManagerMaster stopped
16/01/18 13:05:35 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/01/18 13:05:35 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/01/18 13:05:35 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/01/18 13:05:35 INFO SparkContext: Successfully stopped SparkContext
16/01/18 13:05:35 INFO ShutdownHookManager: Shutdown hook called
16/01/18 13:05:35 INFO ShutdownHookManager: Deleting directory /tmp/spark-18227081-a1c8-43f2-8ca7-cfc4751f023f
16/01/18 13:05:35 INFO ShutdownHookManager: Deleting directory /tmp/spark-18227081-a1c8-43f2-8ca7-cfc4751f023f/pyspark-fcd47a97-57ef-46c3-bb16-357632580334
EDIT
On running ./bin/spark-submit --jars spark-streaming-kafka-assembly_2.10-1.6.0.jar examples/src/main/python/kafka-spark.py I get the HEXADECIMAL location instead of the actual string:
kafkastream= <pyspark.streaming.dstream.TransformedDStream object at 0x7fd6c4dad150>
Any idea what am I doing wrong? I'm really new to kakfa and spark so I need some help here. Thanks!
You need to submit spark-streaming-kafka-assembly_*.jar with your job:
spark-submit --jars spark-streaming-kafka-assembly_2.10-1.5.2.jar ./spark-kafka.py
Alternatively, if you want to also specify resources to be allocated at the same time:
spark-submit --deploy-mode cluster --master yarn --num-executors 5 --executor-cores 5 --executor-memory 20g --jars spark-streaming-kafka-assembly_2.10-1.6.0.jar ./spark-kafka.py
If you wanna run your code in a Jupyter-notebook, then this could be helpful:
from __future__ import print_function
import sys
from pyspark.streaming import StreamingContext
from pyspark import SparkContext,SparkConf
from pyspark.streaming.kafka import KafkaUtils
if __name__ == "__main__":
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars spark-streaming-kafka-assembly_2.10-1.6.0.jar pyspark-shell' #note that the "pyspark-shell" part is very important!!.
#conf = SparkConf().setAppName("Kafka-Spark").setMaster("spark://127.0.0.1:7077")
conf = SparkConf().setAppName("Kafka-Spark")
#sc = SparkContext(appName="KafkaSpark")
sc = SparkContext(conf=conf)
stream=StreamingContext(sc,1)
map1={'spark-kafka':1}
kafkaStream = KafkaUtils.createStream(stream, 'localhost:9092', "name", map1) #tried with localhost:2181 too
print("kafkastream=",kafkaStream)
sc.stop()
Note the introduction of the following line in __main__:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars spark-streaming-kafka-assembly_2.10-1.6.0.jar pyspark-shell'
Sources: https://github.com/jupyter/docker-stacks/issues/154
To print a DStream, spark provides a method pprint for Python. So you'll use
kafkastream.pprint()

Resources