Spark Accumulator not working - apache-spark

I want to get the number of closed orders from this data using accumulators. But it is giving me incorrect answer, just zero(0). What is the problem? I am using Hortonworks Sandbox. The code is below. I am using spark-submit.
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName('closedcount')
sc = SparkContext(conf=conf)
rdd = sc.textFile("/tmp/fish/itversity/retail_db/orders/")
N_closed = sc.accumulator(0)
def is_closed(N_closed, line):
status =(line.split(",")[-1]=="CLOSED")
if status:
N_closed.add(1)
return status
closedRDD = rdd.filter(lambda x: is_closed(N_closed, x))
print('The answer is ' + str(N_closed.value))
But when I submit it, I get zero.
spark-submit --master yarn closedCounter.py
UpDate:
Now, when I change my code it works fine. Is this the right way to do it?
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName('closedcount')
sc = SparkContext(conf=conf)
rdd = sc.textFile("/tmp/fish/itversity/retail_db/orders/")
N_closed = sc.accumulator(0)
def is_closed(line):
global N_closed
status =(line.split(",")[-1]=="CLOSED")
if status:
N_closed.add(1)
rdd.foreach(is_closed)
print('The answer is ' + str(N_closed.value))
Second Update:
I understand it now, In Jupyter Notebook, without Yarn, it gives me the correct answer because I have called an action (count) before checking the value from the accumulator.

Computations inside transformations are evaluated lazily, so unless an action happens on an RDD the transformationsare not executed. As a result of this, accumulators used inside functions like map() or filter() wont get executed unless some action happen on the RDD
https://www.edureka.co/blog/spark-accumulators-explained
(Examples in Scala)
But basically, you need to perform an action on rdd.
For example
N_closed = sc.accumulator(0)
def is_closed(line):
status = line.split(",")[-1]=="CLOSED"
if status:
N_closed.add(1)
return status
rdd.foreach(is_closed)
print('The answer is ' + str(N_closed.value))

Related

Using Accumulator inside Pyspark UDF

I want to access accumulator inside pyspark udf :
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType
accum=spark.sparkContext.accumulator(0)
def prob(g,s):
if g=='M':
accum.add(1)
return 1
else:
accum.add(2)
return accum.value
convertUDF = udf(lambda g,s : prob(g,s),IntegerType())
problem i am getting :
raise Exception("Accumulator.value cannot be accessed inside tasks")
Exception: Accumulator.value cannot be accessed inside tasks
Please let me know how to access accumulator value and how can we change it inside
Pyspark UDF .
You cannot access the .value of the accumulator in the udf. From the documentation (see this answer too):
Worker tasks on a Spark cluster can add values to an Accumulator with the += operator, but only the driver program is allowed to access its value, using value.
It is unclear why you need to return accum.value in this case. I believe you only need to return 2 in the else block looking at your if block:
def prob(g,s):
if g=='M':
accum.add(1)
return 1
else:
accum.add(2)
return 2

pySpark --> NoSuchTableException: Table or view 'my_test_table_thread_1' not found in database 'default'

I run a spark job and it works without any error. In my pyspark code , I run 3 machine learning job sequently. But when I try them work in a thread concurently i got an error. It gives error on this part:
def run(.....):
(
......
sc = SparkContext.getOrCreate(conf=conf)
sc.setCheckpointDir("/tmp/ersing/")
spark = SparkSession(sc)
temp_name = "my_test_table_thread_"+str(thread_id)
my_table.createOrReplaceTempView(temp_name)
print(temp_name +" count(*) --> " + str(my_table.count()))
print("""spark.catalog.tableExists("""+temp_name+""") = """ + str(spark._jsparkSession.catalog().tableExists(temp_name)))
model_sql = """select id from {sample_table_name} where
id= {id} """.format(id=id, sample_table_name=temp_name)
my_df= spark.sql(model_sql).select("id",) #this part gives error --> no such table
my_df= broadcast(my_df)
......
)
my main code is :
....
from multiprocessing.pool import ThreadPool
import threading
def run_worker(job):
returned_sample_table= run('sampling',...) # i call run method twice. First run get df and I call second run for modeling
run('modeling',...,returned_sample_table)
def mp_handler():
p = ThreadPool(8)
p.map(run_worker, jobs)
p.join()
p.close()
mp_handler()
I run 3 jobs concurently and every time just one job createOrReplaceTempView works fine because i logged this : print("""spark.catalog.tableExists("""+temp_name+""") = """ + str(spark._jsparkSession.catalog().tableExists(temp_name))) and I saw one of jobs is exists and others not.
So what i am missing?
Thanks in advance.
Finally i got the solution.
the problem is spark context. When one of threads works done and it closes the context , others dont find the tables on spark.
What i did is I moved the spark context to the main like this :
def run_worker(job):
sc = SparkContext.getOrCreate(conf=conf)
sc.setCheckpointDir("/tmp/ersing/")
spark = SparkSession(sc)
returned_sample_table= run(spark ,'sampling',...) # i call run method twice. First run get df and I call second run for modeling
run(spark ,'modeling',...,returned_sample_table)

Status of structured streaming query in PySpark

I am following the book Spark - Definitive Guide and I was writing basic program that streams the data . The books says that I should use awaitTermination() method to process the query correctly. When I run the below code , it runs indefinitely until I press Ctrl+C and it ends with exception. My question is how can I monitor the status of my streaming query and as soon as my streaming completes , my program should exit after showing the output. Like in the example code below , as soon as it reads all the files and writes the file on the console , it should have ended but it didn't . I also tried inserting activityQuery.stop() but that also didn't work. How can I achieve the same . Any help be appreciated.
from pyspark import SparkConf
from pyspark.sql import *
from pyspark.sql.functions import *
from time import sleep
conf = SparkConf()
spark = SparkSession.builder.config(conf=conf).appName('testapp').getOrCreate()
spark.conf.set("spark.sql.shuffle.partitions", 2)
spark.conf.set("spark.sql.streaming.schemaInference", "true")
static = spark.read.format("json").load("/home/scom/.test/spark/Spark-The-Definitive-Guide/data/activity-data/")
dataSchema = static.schema
streaming = spark.readStream.schema(dataSchema).option("maxFilesPerTrigger", 1).json("/home/scom/.test/spark/Spark-The-Definitive-Guide/data/activity-data/")
activityCounts = streaming.groupBy("gt").count()
activityQuery = activityCounts.writeStream.queryName("activity_counts").format("console").outputMode("complete").start()
activityQuery.awaitTermination()
for x in range(5):
spark.sql("select * from activity_counts").show()
sleep(1)

Facing Issue while Loading Data in through PySpark and performing joins

My problem is as follows:
I have a large dataframe called customer_data_pk containing 230M rows and the other one containing 200M rows named customer_data_pk_isb.
Both have a column callID on which I would like to do a left join, the left dataframe being customer_data_pk.
What is the best possible way to achieve the join operation?
What have I tried?
The simple join i.e.
customer_data_pk.join(customer_data_pk_isb, customer_data_pk.btn ==
customer_data_pk_isb.btn, 'left')
gives out of memory or (just times out with Error: Removing executor driver with no recent heartbeats: 468990 ms exceeds timeout 120000 ms).
After all this, the join still doesn't work. I am still learning PySpark so I might have misunderstood the fundamentals. If someone could shed light on this, it would be great.
I have tried this as well but didn't work and code gets stuck:
customer_data_pk.persist(StorageLevel.DISK_ONLY)
Further more from configuration end I am using: --conf spark.sql.shuffle.partitions=5000
My complete code is as under:
from pyspark import SparkContext
from pyspark import SQLContext
import time
import pyspark
sc = SparkContext("local", "Example")
sqlContext = SQLContext(sc);
customer_data_pk = sqlContext.read.format('jdbc').options(
url='jdbc:mysql://localhost/matchingqueryautomation',
driver='com.mysql.jdbc.Driver',
dbtable='customer_pk',
user='XXXX',
password='XXXX').load()
customer_data_pk.persist(pyspark.StorageLevel.DISK_ONLY)
customer_data_pk_isb = sqlContext.read.format('jdbc').options(
url='jdbc:mysql://localhost/lookupdb',
driver='com.mysql.jdbc.Driver',
dbtable='customer_pk_isb',
user='XXXX',
password='XXXX').load()
print('###########################', customer_data_pk.join(customer_data_pk_isb, customer_data_pk.btn == customer_data_pk_isb.btn, 'left').count(),
'###########################')

How to print PythonTransformedDStream

I'm trying to run word count example integrating AWS Kinesis stream and Apache Spark. Random lines are put in Kinesis at regular intervals.
lines = KinesisUtils.createStream(...)
When I submit my application, lines.pprint() I don't see any values printed.
Tried to print the lines object and I see <pyspark.streaming.dstream.TransformedDStream object at 0x7fa235724950>
How to print the PythonTransformedDStream object? and check if the data is received.
I'm sure there is no credentials issue, if I use false credentials I get access exception.
Added the code for reference
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kinesis import KinesisUtils, InitialPositionInStream
if __name__ == "__main__":
sc = SparkContext(appName="SparkKinesisApp")
ssc = StreamingContext(sc, 1)
lines = KinesisUtils.createStream(ssc, "SparkKinesisApp", "myStream", "https://kinesis.us-east-1.amazonaws.com","us-east-1", InitialPositionInStream.LATEST, 2)
# lines.saveAsTextFiles('/home/ubuntu/logs/out.txt')
lines.pprint()
counts = lines.flatMap(lambda line: line.split(" "))
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b)
counts.pprint()
ssc.start()
ssc.awaitTermination()
Finally I got it working.
The example code which I referred on https://github.com/apache/spark/blob/master/external/kinesis-asl/src/main/python/examples/streaming/kinesis_wordcount_asl.py has a wrong command to submit application.
The correct command with which I got it working is
$ bin/spark-submit --jars external/spark-streaming-kinesis-asl_2.11-2.1.0.jar --packages org.apache.spark:spark-streaming-kinesis-asl_2.11:2.1.0 /home/ubuntu/my_pyspark/spark_kinesis.py
Since lines.pprint() doesn't print anything, can you please confirm that you execute:
ssc.start()
ssc.awaitTermination()
as mentioned in the example here: https://github.com/apache/spark/blob/v2.1.0/examples/src/main/python/streaming/network_wordcount.py
pprint() should work when the environment is configured correctly:
http://spark.apache.org/docs/2.1.0/streaming-programming-guide.html#output-operations-on-dstreams
Output Operations on DStreams
print() - Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This
is useful for development and debugging. Python API This is called
pprint() in the Python API.

Resources