Job are not shown on Spark WebUI - apache-spark

I a naive user of spark. I installed spark and using anaconda install pyspark, then run a basic code in the jupyter notebook that is given below. I then open the spark WebUI however I am unable to see any jobs either running or completed. Any comments are appreciated.
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.master("local")\
.appName("NQlabtop")\
.config('spark.ui.port', '4050')\
.getOrCreate()
sc = spark.sparkContext
input_file=sc.textFile("C:/Users/nqazi/NQ/anscombe.json")
map = input_file.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1))
counts = map.reduceByKey(lambda a, b: a + b)
print("counts",counts)
sc = spark.sparkContext
data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)
Please see the image of the Spark WebUI below. I am not sure why I cannot see any of the jobs as I think it should display completed the jobs.

There two types of functions in PySpark (Spark) transformations and actions. Transformations are lazily evaluated and PySpark doesn't perform any jobs until you call an action function like show, count, collect etc.

Related

SparkContext conflict with spark udf

Good morning
When running:
from pyspark.sql.types import IntegerType
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
class ETL:
def addone(x):
return x + 1
def job_run():
df = spark.sql('SELECT 1 one').withColumn('AddOne', udf_addone(F.col('one')))
df.show()
if (__name__ == '__main__'):
udf_addone = F.udf(lambda x: ETL.addone(x), returnType=IntegerType())
ETL.job_run()
I get the following error message:
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
I have reviewed the answers given at ERROR:SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063 and at Spark: Broadcast variables: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion with no success. I'd like to stick to using spark udf in my script.
Any help on this is appreciated.
Many thanks!

NameError: name 'SparkSession' is not defined

I'm new to cask cdap and Hadoop environment.
I'm creating a pipeline and I want to use a PySpark Program. I have all the script of the spark program and it works when I test it by command like, insted it doesn't if I try to copy- paste it in a cdap pipeline.
It gives me an error in the logs:
NameError: name 'SparkSession' is not defined
My script starts in this way:
from pyspark.sql import *
spark = SparkSession.builder.getOrCreate()
from pyspark.sql.functions import trim, to_date, year, month
sc= SparkContext()
How can I fix it?
Spark connects with the local running spark cluster through SparkContext. A better explanation can be found here https://stackoverflow.com/a/24996767/5671433.
To initialise a SparkSession, a SparkContext has to be initialized.
One way to do that is to write a function that initializes all your contexts and a spark session.
def init_spark(app_name, master_config):
"""
:params app_name: Name of the app
:params master_config: eg. local[4]
:returns SparkContext, SQLContext, SparkSession:
"""
conf = (SparkConf().setAppName(app_name).setMaster(master_config))
sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")
sql_ctx = SQLContext(sc)
spark = SparkSession(sc)
return (sc, sql_ctx, spark)
This can then be called as
sc, sql_ctx, spark = init_spark("App_name", "local[4]")

Spark Streaming: How to load a Pipeline on a Stream?

I am implementing a lambda architecture system for stream processing.
I have no issue creating a Pipeline with GridSearch in Spark Batch:
pipeline = Pipeline(stages=[data1_indexer, data2_indexer, ..., assembler, logistic_regressor])
paramGrid = (
ParamGridBuilder()
.addGrid(logistic_regressor.regParam, (0.01, 0.1))
.addGrid(logistic_regressor.tol, (1e-5, 1e-6))
...etcetera
).build()
cv = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=BinaryClassificationEvaluator(),
numFolds=4)
pipeline_cv = cv.fit(raw_train_df)
model_fitted = pipeline_cv.getEstimator().fit(raw_validation_df)
model_fitted.write().overwrite().save("pipeline")
However, I cant seem to find how to plug the pipeline in the Spark Streaming Process. I am using kafka as the DStream source and my code as of now is as follows:
import json
from pyspark.ml import PipelineModel
from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming import StreamingContext
ssc = StreamingContext(sc, 1)
kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181", "spark- streaming-consumer", {"kafka_topic": 1})
model = PipelineModel.load('pipeline/')
parsed_stream = kafkaStream.map(lambda x: json.loads(x[1]))
CODE MISSING GOES HERE
ssc.start()
ssc.awaitTermination()
and now I need to find someway of doing
Based on the documentation here (even though it looks very very outdated) it seems like your model needs to implement the method predict to be able to use it on an rdd object (and hopefully on a kafkastream?)
How could I use the pipeline on the Streaming context? The reloaded PipelineModel only seems to implement transform
Does that mean the only way to use batch models in a Streaming context is to use pure models ,and no pipelines?
I found a way to load a Spark Pipeline into spark streaming.
This solution works for Spark v2.0 , as further versions will probably implement a better solution.
The solution I found transforms the streaming RDDs into Dataframes using the toDF() method, in which you can then apply the pipeline.transform method.
This way of doing things is horribly inefficient though.
# we load the required libraries
from pyspark.sql.types import (
StructType, StringType, StructField, LongType
)
from pyspark.sql import Row
from pyspark.streaming.kafka import KafkaUtils
#we specify the dataframes schema, so spark does not have to do reflections on the data.
pipeline_schema = StructType(
[
StructField("field1",StringType(),True),
StructField("field2",StringType(),True),
StructField("field3", LongType(),True)
]
)
#We load the pipeline saved with spark batch
pipeline = PipelineModel.load('/pipeline')
#Setup usual spark context, and spark Streaming Context
sc = spark.sparkContext
ssc = StreamingContext(sc, 1)
#On my case I use kafka directKafkaStream as the DStream source
directKafkaStream = KafkaUtils.createDirectStream(ssc, suwanpos[QUEUE_NAME], {"metadata.broker.list": "localhost:9092"})
def handler(req_rdd):
def process_point(p):
#here goes the logic to do after applying the pipeline
print(p)
if req_rdd.count() > 0:
#Here is the gist of it, we turn the rdd into a Row, then into a df with the specified schema)
req_df = req_rdd.map(lambda r: Row(**r)).toDF(schema=pipeline_schema)
#Now we can apply the transform, yaaay
pred = pipeline.transform(req_df)
records = pred.rdd.map(lambda p: process_point(p)).collect()
Hope this helps.

Kafka integration with spark

I want to setup a streaming application using Apache Kafka and Spark streaming. Kafka is running on a seperate unix machine version 0.9.0.1 and spark v1.6.1 is a part of a hadoop cluster.
I have started the zookeeper and kafka server and want to stream in messages from a log file using console producer and consumed by spark streaming application using direct method (no receivers). I have written code in python and executing using the below command:
spark-submit --jars spark-streaming-kafka-assembly_2.10-1.6.1.jar streamingDirectKafka.py
getting below error:
/opt/mapr/spark/spark-1.6.1/python/lib/pyspark.zip/pyspark/streaming/kafka.py", line 152, in createDirectStream
py4j.protocol.Py4JJavaError: An error occurred while calling o38.createDirectStreamWithoutMessageHandler.
: java.lang.ClassCastException: kafka.cluster.BrokerEndPoint cannot be cast to kafka.cluster.Broker
Could you please help?
Thanks!!
from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
if __name__ == "__main__":
conf = SparkConf().setAppName("StreamingDirectKafka")
sc = SparkContext(conf = conf)
ssc = StreamingContext(sc, 1)
topic = ['test']
kafkaParams = {"metadata.broker.list": "apsrd7102:9092"}
lines = (KafkaUtils.createDirectStream(ssc, topic, kafkaParams)
.map(lambda x: x[1]))
counts = (lines.flatMap(lambda line: line.split(" "))
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a+b))
counts.pprint()
ssc.start()
ssc.awaitTermination()
Looks like you are using incompatible version of Kafka. From the documentation as of Spark 2.0 - Kafka 0.8.x is supported.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#advanced-sources

Reading data from HDFS on a cluster

I am trying to read data from HDFS on AWS EC2 cluster using Jupiter Notebook. It has 7 nodes. I am using HDP 2.4 and my code is below. The table has millions of rows but the code does not return any rows."ec2-xx-xxx-xxx-xx.compute-1.amazonaws.com" is the server (ambari-server).
from pyspark.sql import SQLContext
sqlContext = HiveContext(sc)
demography = sqlContext.read.load("hdfs://ec2-xx-xx-xxx-xx.compute-1.amazonaws.com:8020/tmp/FAERS/demography_2012q4_2016q1_duplicates_removed.csv", format="com.databricks.spark.csv", header="true", inferSchema="true")
demography.printSchema()
demography.cache()
print demography.count()
But using sc.textFile, I get the correct number of rows
data = sc.textFile("hdfs://ec2-xx-xxx-xxx-xx.compute-1.amazonaws.com:8020/tmp/FAERS/demography_2012q4_2016q1_duplicates_removed.csv")
schema= data.map(lambda x: x.split(",")).first() #get schema
header = data.first() # extract header
data=data.filter(lambda x:x !=header) # filter out header
data= data.map(lambda x: x.split(","))
data.count()
3641865
The answer by Indrajit given here solved my problem. The problem was with the spark-csv jar.

Resources