Pyspark Structured streaming locally with Kafka-Jupyter - apache-spark

After looking at the other answers i still cant figure it out.
I am able to use kafkaProducer and kafkaConsumer to send and receive a messages from within my notebook.
producer = KafkaProducer(bootstrap_servers=['127.0.0.1:9092'],value_serializer=lambda m: json.dumps(m).encode('ascii'))
consumer = KafkaConsumer('hr',bootstrap_servers=['127.0.0.1:9092'],group_id='abc' )
I've tried to connect to the stream with both spark context and spark session.
from pyspark.streaming.kafka import KafkaUtils
sc = SparkContext("local[*]", "stream")
ssc = StreamingContext(sc, 1)
Which gives me this error
Spark Streaming's Kafka libraries not found in class path. Try one
of the following.
1. Include the Kafka library and its dependencies with in the
spark-submit command as
$ bin/spark-submit --packages org.apache.spark:spark-streaming-
kafka-0-8:2.3.2 ...
It seems that i needed to add the JAR to my
!/usr/local/bin/spark-submit --master local[*] /usr/local/Cellar/apache-spark/2.3.0/libexec/jars/spark-streaming-kafka-0-8-assembly_2.11-2.3.2.jar pyspark-shell
which returns
Error: No main class set in JAR; please specify one with --class
Run with --help for usage help or --verbose for debug output
What class do i put in?
How do i get Pyspark to connect to the consumer?

The command you have is trying to run spark-streaming-kafka-0-8-assembly_2.11-2.3.2.jar, and trying to find pyspark-shell as a Java class inside of that.
As the first error says, you missed a --packages after spark-submit, which means you would do
spark-submit --packages ... someApp.jar com.example.YourClass
If you are just locally in Jupyter, you may want to try Kafka-Python, for example, rather than PySpark... Less overhead, and no Java dependencies.

Related

neo4j spark connector doesn't work correctly

I want to integrate Spark GraphX with Neo4j using 1
I tried to follow the steps in 2 but it doesn't work.
What should I do exactly with the neo4j-connector-apache-spark_2.12-4.0.0.jar file ? I put it in the jar files in the Spark folder.
in bash I write:
C:>Spark\spark-3.1.1-bin-hadoop2.7\bin\spark-shell --jars neo4j-connector-apache-spark_2.12-4.0.0.jar
Any suggestions please?
Update no. 1
I tried this C:\Spark\spark-3.1.1-bin-hadoop2.7\bin\spark-shell --packages neo4j-contrib:neo4j-connector-apache-spark_2.12:4.0.0
I think it work. but when I want to write the DataFrame to nodes of type Person in spark-shell:
import org.apache.spark.sql.{SaveMode, SparkSession}
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
val df = Seq(
("John Doe"),
("Jane Doe")
).toDF("name")
df.write.format("org.neo4j.spark.DataSource")
.mode(SaveMode.ErrorIfExists)
.option("url", "bolt://localhost:7687")
.option("authentication.basic.username", "neo4j")
.option("authentication.basic.password", "neo4j")
.option("labels", ":Person")
.save()
It raises errors. what should I do?
Update no. 2
I follow the steps in the 3 and it gives error when entering this:
val neo = Neo4j(sc)
as follow:
error: not found: value Neo4j
Use:
$SPARK_HOME\bin\spark-shell --conf spark.neo4j.password=<password> --packages neo4j-contrib:neo4j-spark-connector:2.4.5-M2
instead of:
$SPARK_HOME\bin\spark-shell --conf spark.neo4j.bolt.password=<password> --packages neo4j-contrib:neo4j-spark-connector:2.4.5-M2
Just remove the bolt word.
Update'
Now I want to use the following package:
$SPARK_HOME/bin/spark-shell --packages neo4j-contrib:neo4j-connector-apache-spark_2.12:4.0.1_for_spark_3
As mentioned in 1
The only one that works is the following (the old version):
$SPARK_HOME/bin/spark-shell --packages neo4j-contrib:neo4j-spark-connector:2.4.5-M2
But using it, the Neo4jGraph.saveGraph is not working. The error is : Writing in read access mode not allowed.
Thanks for your help.

Spark Streaming - Netcat messages are not received in Spark streaming

i am trying to test spark streaming. i have stand alone cloudera quickstart vm. started the spark-shell with the following command:
spark-shell --master yarn-client --conf spark.ui.port=23123
In the spark-shell i have executed the following statements:
sc.stop()
import org.apache.spark.SparkConf
import org.apache.spark.streaming.StreamingContext
val conf = new SparkConf().setAppName("Spark Streaming")
val ssc = new StreamingContext(conf,org.apache.spark.streaming.Seconds(10))
val lines = ssc.socketTextStream("localhost",44444)
lines.print
In another terminal started the netcat service with the following command:
nc -lk 44444
In the spark-shell started the streaming context
ssc.start()
till now everything is fine. But, whatever the messages typed in the Netcat service are not received in Spark streaming.don't know where it is going wrong.
try spark-shell --master local[2] --conf spark.ui.port=23123 to see if it works.
If it works, then in your script, there is only one executor working, which is receiving message, but no executor is processing message.

Why does Structured Streaming fail with "java.lang.IncompatibleClassChangeError: Implementing class"?

I'd like to run a Spark application using Structured Streaming with PySpark.
I use Spark 2.2 with Kafka 0.10 version.
I fail with the following error:
java.lang.IncompatibleClassChangeError: Implementing class
spark-submit command used as below:
/bin/spark-submit \
--packages org.apache.spark:spark-streaming-kafka-0-10_2.11:2.2.0 \
--master local[*] \
/home/umar/structured_streaming.py localhost:2181 fortesting
structured_streaming.py code:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("StructuredStreaming").config("spark.driver.memory", "2g").config("spark.executor.memory", "2g").getOrCreate()
raw_DF = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:2181").option("subscribe", "fortesting").load()
values = raw_DF.selectExpr("CAST(value AS STRING)").as[String]
values.writeStream.trigger(ProcessingTime("5 seconds")).outputMode("append").format("console").start().awaitTermination()
You need spark-sql-kafka for structured streaming:
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0
Also make sure that you use the same versions of Scala (2.11 above) and Spark (2.2.0) as you use on your cluster.
Please reference This
You're using spark-streaming-kafka-0-10 which currently not support python.

Spark 2.0 - Dataset<Row> Write to Parquet in Java

I want to write Dataset into a Parquet file in Java, I use
Dataset<Row> ds = getDataFrame();
ds.write().parquet("data.parquet");
This code is run by spark-submit command as given below
sudo spark-submit --class getdata --master yarn --num-executors 4 --executor-cores 1 --jars guava-14.0.1.jar,hadoop-common-2.7.3.jar,hbase-client-1.3.0.jar,hbase-common-1.3.0.jar,hbase-protocol-1.3.0.jar,log4j-1.2.17.jar,metrics-core-2.2.0.jar,ojdbc6.jar,spark-core_2.11-2.0.2.jar,spark-assembly.jar,spark-sql_2.11-2.0.2.jar,hive-beeline-1.2.1.spark2.jar,hive-cli-1.2.1.spark2.jar,hive-exec-1.2.1.spark2.jar,hive-jdbc-1.2.1.spark2.jar,hive-metastore-1.2.1.spark2.jar,parquet-column-1.7.0.jar,parquet-common-1.7.0.jar,parquet-encoding-1.7.0.jar,parquet-format-2.3.0-incubating.jar,parquet-generator-1.7.0.jar,parquet-hadoop-1.7.0.jar,parquet-hadoop-bundle-1.6.0.jar,parquet-hive-1.0.1.jar,parquet-jackson-1.7.0.jar,spark-hive_2.11-2.0.2.jar getdata.jar
I get the following exception.
Exception in thread "main" java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.hive.orc.DefaultSource could not be instantiated
What am I missing? Please help.
A ServiceLoader finds DefaultSource implementations on the classpath invokes a constructor that returns a type which doesn't correspond to the expected return type.
An OrcRelation is returned where a HadoopFsRelation is expected, but OrcRelation doesn't implement HadoopFsRelation. It might be a version conflict, since I can't find HadoopFsRelation in 2.1.0, while it is there in older versions (e.g. 1.6.0).
Do you have multiple Spark versions on your classpath, or mixed Spark/Hive implementations?

PySpark distributed processing on a YARN cluster

I have Spark running on a Cloudera CDH5.3 cluster, using YARN as the resource manager. I am developing Spark apps in Python (PySpark).
I can submit jobs and they run succesfully, however they never seem to run on more than one machine (the local machine I submit from).
I have tried a variety of options, like setting --deploy-mode to cluster and --master to yarn-client and yarn-cluster, yet it never seems to run on more than one server.
I can get it to run on more than one core by passing something like --master local[8], but that obviously doesn't distribute the processing over multiple nodes.
I have a very simply Python script processing data from HDFS like so:
import simplejson as json
from pyspark import SparkContext
sc = SparkContext("", "Joe Counter")
rrd = sc.textFile("hdfs:///tmp/twitter/json/data/")
data = rrd.map(lambda line: json.loads(line))
joes = data.filter(lambda tweet: "Joe" in tweet.get("text",""))
print joes.count()
And I am running a submit command like:
spark-submit atest.py --deploy-mode client --master yarn-client
What can I do to ensure the job runs in parallel across the cluster?
Can you swap the arguments for the command?
spark-submit --deploy-mode client --master yarn-client atest.py
If you see the help text for the command:
spark-submit
Usage: spark-submit [options] <app jar | python file>
I believe #MrChristine is correct -- the option flags you specify are being passed to your python script, not to spark-submit. In addition, you'll want to specify --executor-cores and --num-executors since by default it will run on a single core and use two executors.
Its not true that python script doesn't run in cluster mode. I am not sure about previous versions but this is executing in spark 2.2 version on Hortonworks cluster.
Command : spark-submit --master yarn --num-executors 10 --executor-cores 1 --driver-memory 5g /pyspark-example.py
Python Code :
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
conf = (SparkConf()
.setMaster("yarn")
.setAppName("retrieve data"))
sc = SparkContext(conf = conf)
sqlContext = SQLContext(sc)
parquetFile = sqlContext.read.parquet("/<hdfs-path>/*.parquet")
parquetFile.createOrReplaceTempView("temp")
df1 = sqlContext.sql("select * from temp limit 5")
df1.show()
df1.write.save('/<hdfs-path>/test.csv', format='csv', mode='append')
sc.stop()
Output : Its big so i am not pasting. But it runs perfect.
It seems that PySpark does not run in distributed mode using Spark/YARN - you need to use stand-alone Spark with a Spark Master server. In that case, my PySpark script ran very well across the cluster with a Python process per core/node.

Resources