Spark Streaming - Netcat messages are not received in Spark streaming - apache-spark

i am trying to test spark streaming. i have stand alone cloudera quickstart vm. started the spark-shell with the following command:
spark-shell --master yarn-client --conf spark.ui.port=23123
In the spark-shell i have executed the following statements:
sc.stop()
import org.apache.spark.SparkConf
import org.apache.spark.streaming.StreamingContext
val conf = new SparkConf().setAppName("Spark Streaming")
val ssc = new StreamingContext(conf,org.apache.spark.streaming.Seconds(10))
val lines = ssc.socketTextStream("localhost",44444)
lines.print
In another terminal started the netcat service with the following command:
nc -lk 44444
In the spark-shell started the streaming context
ssc.start()
till now everything is fine. But, whatever the messages typed in the Netcat service are not received in Spark streaming.don't know where it is going wrong.

try spark-shell --master local[2] --conf spark.ui.port=23123 to see if it works.
If it works, then in your script, there is only one executor working, which is receiving message, but no executor is processing message.

Related

How to connect to remote Cassandra server through pyspark for write operation?

I am trying to connect to remote Cassandra server through pyspark, but it is not performing write operation in Cassandra while running cronjob. The same code works on the server on jupyter notebook, but not through cronjob.
`os.environ['PYSPARK_SUBMIT_ARGS'] = '--master local[*] pyspark-shell --packages com.datastax.spark:spark-cassandra-connector_2.12:2.5.0 --conf spark.cassandra.connection.host=127.0.0.1 pyspark-shell --conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions'
from pyspark import SparkContext
sc = SparkContext("local", "keyspace_name")
sqlContext = SQLContext(sc)
Data_to_Write.write.format("org.apache.spark.sql.cassandra").mode('append').options(table="tablename",keyspace="keyspace_name").save()`
I see this error in the cassandra logs : ERROR [Messaging-EventLoop-3-3] 2020-08-05 09:24:36,606 OutboundConnectionInitiator.java:373 - Failed to handshake with peer xx.xxx.xxx.xxx:9042(xx.xxx.xxx.xxx:9042) org.apache.cassandra.net.Crc$InvalidCrc –

spark-submit works for yarn-cluster mode but SparkLauncher doesn't, with same params

I'm able to submit a spark job through spark-submit however when I try to do the same programatically using SparkLauncher, it gives me nothing ( I dont even see a Spark job on the UI)
Below is the scenario:
I've a server(say hostname: cr-hdbc101.dev.local:7123) which hosts the hdfs cluster. I push a fat jar to the server which I'm trying to exec.
The following spark-submit works as expected and a spark job is submitted in yarn-cluster mode
spark-submit \
--verbose \
--class com.digital.StartSparkJob \
--master yarn \
--deploy-mode cluster \
--num-executors 2 \
--driver-memory 2g \
--executor-memory 3g \
--executor-cores 4 \
/usr/share/Deployments/Consolidateservice.jar "<arg_to_main>"
However the following piece of SparkLauncher code doesn't work
val sparkLauncher = new SparkLauncher()
sparkLauncher
.setSparkHome("/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/lib/spark")
.setAppResource("/usr/share/Deployments/Consolidateservice.jar")
.setMaster("yarn-cluster")
.setVerbose(true)
.setMainClass("com.digital.StartSparkJob")
.setDeployMode("cluster")
.setConf("spark.driver.cores", "2")
.setConf("spark.driver.memory", "2g")
.setConf("spark.executor.cores", "4")
.setConf("spark.executor.memory", "3g")
.addAppArgs(<arg_to_main>)
.startApplication()
I thought maybe SparkLauncher is not getting correct env variables to work with, so I send the following to SparkLauncher, but to no avail(basically I pass everything in the spark-env.sh to SparkLauncher)
val env: java.util.Map[String, String] = new java.util.HashMap[String, String]
env.put("SPARK_CONF_DIR", "/etc/spark/conf.cloudera.spark_on_yarn")
env.put("HADOOP_HOME", "/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/lib/hadoop")
env.put("YARN_CONF_DIR", "/etc/spark/conf.cloudera.spark_on_yarn/yarn-conf")
env.put("SPARK_LIBRARY_PATH", "/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/lib/spark/lib")
env.put("SCALA_LIBRARY_PATH", "/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/lib/spark/lib")
env.put("LD_LIBRARY_PATH", "/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/lib/hadoop/lib/native")
env.put("SPARK_DIST_CLASSPATH", "/etc/spark/conf.cloudera.spark_on_yarn/classpath.txt")
val sparkLauncher = new SparkLauncher(env)
sparkLauncher
.setSparkHome("/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/lib/spark")...
What adds to the frustration, is that when I use same SparkLauncher code for yarn-client mode, it works perfectly fine.
Can someone please point to me what am I missing, I just feel I'm staring at the issue without recognizing it.
NOTE: Both the main class(com.digital.StartSparkJob) and SparkLauncher code are part of the fat jar I'm pushing to the server. I just call the SparkLauncher code with an external API, which in turn should open a driver JVM on the cluster
SparkVersion: 1.6.0, scala ver: 2.10.5
I wasn't even getting logs on the Spark-UI...the sparkApp wasn't even running. Therefore I ran the sparkLauncher as a process(using .launch().waitFor() ) so that I can capture the error Logs.
I captured the logs using .getInputStream and .getErrorStream and found out that the user being passed to the cluster is wrong. My cluster will work only for user "abcd".
I did set System.setProperty("HADOOP_USER_NAME", "abcd"), as well as added "spark.yarn.appMasterEnv.HADOOP_USER_NAME=abcd" to spark-default.conf, before launching SparkLauncher. However looks like they don't get ported over to cluster.
I therefore passed the HADOOP_USER_NAME as an childArg to the SparkLauncher
val env: java.util.Map[String, String] = new java.util.HashMap[String, String]
env.put("SPARK_CONF_DIR", "/etc/spark/conf.cloudera.spark_on_yarn")
env.put("YARN_CONF_DIR", "/etc/spark/conf.cloudera.spark_on_yarn/yarn-conf")
env.put("HADOOP_USER_NAME", "abcd")
try {
val sparkLauncher = new SparkLauncher(env)...

Pyspark Structured streaming locally with Kafka-Jupyter

After looking at the other answers i still cant figure it out.
I am able to use kafkaProducer and kafkaConsumer to send and receive a messages from within my notebook.
producer = KafkaProducer(bootstrap_servers=['127.0.0.1:9092'],value_serializer=lambda m: json.dumps(m).encode('ascii'))
consumer = KafkaConsumer('hr',bootstrap_servers=['127.0.0.1:9092'],group_id='abc' )
I've tried to connect to the stream with both spark context and spark session.
from pyspark.streaming.kafka import KafkaUtils
sc = SparkContext("local[*]", "stream")
ssc = StreamingContext(sc, 1)
Which gives me this error
Spark Streaming's Kafka libraries not found in class path. Try one
of the following.
1. Include the Kafka library and its dependencies with in the
spark-submit command as
$ bin/spark-submit --packages org.apache.spark:spark-streaming-
kafka-0-8:2.3.2 ...
It seems that i needed to add the JAR to my
!/usr/local/bin/spark-submit --master local[*] /usr/local/Cellar/apache-spark/2.3.0/libexec/jars/spark-streaming-kafka-0-8-assembly_2.11-2.3.2.jar pyspark-shell
which returns
Error: No main class set in JAR; please specify one with --class
Run with --help for usage help or --verbose for debug output
What class do i put in?
How do i get Pyspark to connect to the consumer?
The command you have is trying to run spark-streaming-kafka-0-8-assembly_2.11-2.3.2.jar, and trying to find pyspark-shell as a Java class inside of that.
As the first error says, you missed a --packages after spark-submit, which means you would do
spark-submit --packages ... someApp.jar com.example.YourClass
If you are just locally in Jupyter, you may want to try Kafka-Python, for example, rather than PySpark... Less overhead, and no Java dependencies.

saveAsTable ends in failure in Spark-yarn cluster environment

I set up a spark-yarn cluster environment, and try spark-SQL with spark-shell:
spark-shell --master yarn --deploy-mode client --conf spark.yarn.archive=hdfs://hadoop_273_namenode_ip:namenode_port/spark-archive.zip
One thing to mention is the Spark is in Windows 7. After spark-shell starts up successfully, I execute the commands as below:
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
scala> val df_mysql_address = sqlContext.read.format("jdbc").option("url", "jdbc:mysql://mysql_db_ip/db").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "ADDRESS").option("user", "root").option("password", "root").load()
scala> df_mysql_address.show
scala> df_mysql_address.write.format("parquet").saveAsTable("address_local")
"show" command returns result-set correctly, but the "saveAsTable" ends in failure. The error message says:
java.io.IOException: Mkdirs failed to create file:/C:/jshen.workspace/programs/spark-2.2.0-bin-hadoop2.7/spark-warehouse/address_local/_temporary/0/_temporary/attempt_20171018104423_0001_m_000000_0 (exists=false, cwd=file:/tmp/hadoop/nm-local-dir/usercache/hduser/appcache/application_1508319604173_0005/container_1508319604173_0005_01_000003)
I expect and guess the table is to be saved in the hadoop cluster, but you can see that the dir (C:/jshen.workspace/programs/spark-2.2.0-bin-hadoop2.7/spark-warehouse) is the folder in my Windows 7, not in hdfs, not even in the hadoop ubuntu machine.
How could I do? Please advise, thanks.
The way to get rid of the problem is to provide "path" option prior to "save" operation as shown below:
scala> df_mysql_address.write.option("path", "/spark-warehouse").format("parquet").saveAsTable("address_l‌​ocal")
Thanks #philantrovert.

PySpark distributed processing on a YARN cluster

I have Spark running on a Cloudera CDH5.3 cluster, using YARN as the resource manager. I am developing Spark apps in Python (PySpark).
I can submit jobs and they run succesfully, however they never seem to run on more than one machine (the local machine I submit from).
I have tried a variety of options, like setting --deploy-mode to cluster and --master to yarn-client and yarn-cluster, yet it never seems to run on more than one server.
I can get it to run on more than one core by passing something like --master local[8], but that obviously doesn't distribute the processing over multiple nodes.
I have a very simply Python script processing data from HDFS like so:
import simplejson as json
from pyspark import SparkContext
sc = SparkContext("", "Joe Counter")
rrd = sc.textFile("hdfs:///tmp/twitter/json/data/")
data = rrd.map(lambda line: json.loads(line))
joes = data.filter(lambda tweet: "Joe" in tweet.get("text",""))
print joes.count()
And I am running a submit command like:
spark-submit atest.py --deploy-mode client --master yarn-client
What can I do to ensure the job runs in parallel across the cluster?
Can you swap the arguments for the command?
spark-submit --deploy-mode client --master yarn-client atest.py
If you see the help text for the command:
spark-submit
Usage: spark-submit [options] <app jar | python file>
I believe #MrChristine is correct -- the option flags you specify are being passed to your python script, not to spark-submit. In addition, you'll want to specify --executor-cores and --num-executors since by default it will run on a single core and use two executors.
Its not true that python script doesn't run in cluster mode. I am not sure about previous versions but this is executing in spark 2.2 version on Hortonworks cluster.
Command : spark-submit --master yarn --num-executors 10 --executor-cores 1 --driver-memory 5g /pyspark-example.py
Python Code :
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
conf = (SparkConf()
.setMaster("yarn")
.setAppName("retrieve data"))
sc = SparkContext(conf = conf)
sqlContext = SQLContext(sc)
parquetFile = sqlContext.read.parquet("/<hdfs-path>/*.parquet")
parquetFile.createOrReplaceTempView("temp")
df1 = sqlContext.sql("select * from temp limit 5")
df1.show()
df1.write.save('/<hdfs-path>/test.csv', format='csv', mode='append')
sc.stop()
Output : Its big so i am not pasting. But it runs perfect.
It seems that PySpark does not run in distributed mode using Spark/YARN - you need to use stand-alone Spark with a Spark Master server. In that case, my PySpark script ran very well across the cluster with a Python process per core/node.

Resources