I am following the book Spark - Definitive Guide and I was writing basic program that streams the data . The books says that I should use awaitTermination() method to process the query correctly. When I run the below code , it runs indefinitely until I press Ctrl+C and it ends with exception. My question is how can I monitor the status of my streaming query and as soon as my streaming completes , my program should exit after showing the output. Like in the example code below , as soon as it reads all the files and writes the file on the console , it should have ended but it didn't . I also tried inserting activityQuery.stop() but that also didn't work. How can I achieve the same . Any help be appreciated.
from pyspark import SparkConf
from pyspark.sql import *
from pyspark.sql.functions import *
from time import sleep
conf = SparkConf()
spark = SparkSession.builder.config(conf=conf).appName('testapp').getOrCreate()
spark.conf.set("spark.sql.shuffle.partitions", 2)
spark.conf.set("spark.sql.streaming.schemaInference", "true")
static = spark.read.format("json").load("/home/scom/.test/spark/Spark-The-Definitive-Guide/data/activity-data/")
dataSchema = static.schema
streaming = spark.readStream.schema(dataSchema).option("maxFilesPerTrigger", 1).json("/home/scom/.test/spark/Spark-The-Definitive-Guide/data/activity-data/")
activityCounts = streaming.groupBy("gt").count()
activityQuery = activityCounts.writeStream.queryName("activity_counts").format("console").outputMode("complete").start()
activityQuery.awaitTermination()
for x in range(5):
spark.sql("select * from activity_counts").show()
sleep(1)
Related
I run a spark job and it works without any error. In my pyspark code , I run 3 machine learning job sequently. But when I try them work in a thread concurently i got an error. It gives error on this part:
def run(.....):
(
......
sc = SparkContext.getOrCreate(conf=conf)
sc.setCheckpointDir("/tmp/ersing/")
spark = SparkSession(sc)
temp_name = "my_test_table_thread_"+str(thread_id)
my_table.createOrReplaceTempView(temp_name)
print(temp_name +" count(*) --> " + str(my_table.count()))
print("""spark.catalog.tableExists("""+temp_name+""") = """ + str(spark._jsparkSession.catalog().tableExists(temp_name)))
model_sql = """select id from {sample_table_name} where
id= {id} """.format(id=id, sample_table_name=temp_name)
my_df= spark.sql(model_sql).select("id",) #this part gives error --> no such table
my_df= broadcast(my_df)
......
)
my main code is :
....
from multiprocessing.pool import ThreadPool
import threading
def run_worker(job):
returned_sample_table= run('sampling',...) # i call run method twice. First run get df and I call second run for modeling
run('modeling',...,returned_sample_table)
def mp_handler():
p = ThreadPool(8)
p.map(run_worker, jobs)
p.join()
p.close()
mp_handler()
I run 3 jobs concurently and every time just one job createOrReplaceTempView works fine because i logged this : print("""spark.catalog.tableExists("""+temp_name+""") = """ + str(spark._jsparkSession.catalog().tableExists(temp_name))) and I saw one of jobs is exists and others not.
So what i am missing?
Thanks in advance.
Finally i got the solution.
the problem is spark context. When one of threads works done and it closes the context , others dont find the tables on spark.
What i did is I moved the spark context to the main like this :
def run_worker(job):
sc = SparkContext.getOrCreate(conf=conf)
sc.setCheckpointDir("/tmp/ersing/")
spark = SparkSession(sc)
returned_sample_table= run(spark ,'sampling',...) # i call run method twice. First run get df and I call second run for modeling
run(spark ,'modeling',...,returned_sample_table)
I'm having strange performance results when comparing the two APIs in pyspark 3.2.1 that provide ability to run pandas UDF on grouped results of Spark Dataframe:
df.groupBy().applyInPandas()
ps_df.groupby().apply() - a new way of apply introduced in Pandas-API-on-Spark AKA Koalas
First I run the following input generator code in local spark mode (Spark 3.2.1):
import pyspark.sql.types as types
from pyspark.sql.functions import col
from pyspark.sql import SparkSession
import pyspark.pandas as ps
spark = SparkSession.builder \
.config("spark.sql.execution.arrow.pyspark.enabled", True) \
.getOrCreate()
ps.set_option("compute.default_index_type", "distributed")
spark.range(1000000).withColumn('group', (col('id') / 10).cast('int')) \
.write.parquet('/tmp/sample_input', mode='overwrite')
Then I test the applyInPandas:
def getsum(pdf):
pdf['sum_in_group'] = pdf['id'].sum()
return pdf
df = spark.read.parquet(f'/tmp/sample_input')
output_schema = types.StructType(
df.schema.fields + [types.StructField('sum_in_group', types.FloatType())]
)
df.groupBy('group').applyInPandas(getsum, schema=output_schema) \
.write.parquet('/tmp/schematest', mode='overwrite')
And the code executes under 30 seconds (on i7-9750H CPU)
Then, I try the new API and - while I really appreciate how nice the code looks like:
def getsum(pdf) -> ps.DataFrame["id": int, "group": int, "sum_in_group": int]:
pdf['sum_in_group'] = pdf['id'].sum()
return pdf
df = ps.read_parquet(f'/tmp/sample_input')
df.groupby('group').apply(getsum) \
.to_parquet('/tmp/schematest', mode='overwrite')
... every time the execution time is at least 1m 40s on the same CPU, so more than 3x slower for this simple operation.
I am aware that adding sum_in_group can be done way more efficient with no panadas involvement, but this is just to provide a small minimal example. Any other operations is also at least 3 times slower.
Do you know what would be the reason to this slowdown? Maybe I'm lacking some context parameter that would make these execute in the similar time?
I want to get the number of closed orders from this data using accumulators. But it is giving me incorrect answer, just zero(0). What is the problem? I am using Hortonworks Sandbox. The code is below. I am using spark-submit.
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName('closedcount')
sc = SparkContext(conf=conf)
rdd = sc.textFile("/tmp/fish/itversity/retail_db/orders/")
N_closed = sc.accumulator(0)
def is_closed(N_closed, line):
status =(line.split(",")[-1]=="CLOSED")
if status:
N_closed.add(1)
return status
closedRDD = rdd.filter(lambda x: is_closed(N_closed, x))
print('The answer is ' + str(N_closed.value))
But when I submit it, I get zero.
spark-submit --master yarn closedCounter.py
UpDate:
Now, when I change my code it works fine. Is this the right way to do it?
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName('closedcount')
sc = SparkContext(conf=conf)
rdd = sc.textFile("/tmp/fish/itversity/retail_db/orders/")
N_closed = sc.accumulator(0)
def is_closed(line):
global N_closed
status =(line.split(",")[-1]=="CLOSED")
if status:
N_closed.add(1)
rdd.foreach(is_closed)
print('The answer is ' + str(N_closed.value))
Second Update:
I understand it now, In Jupyter Notebook, without Yarn, it gives me the correct answer because I have called an action (count) before checking the value from the accumulator.
Computations inside transformations are evaluated lazily, so unless an action happens on an RDD the transformationsare not executed. As a result of this, accumulators used inside functions like map() or filter() wont get executed unless some action happen on the RDD
https://www.edureka.co/blog/spark-accumulators-explained
(Examples in Scala)
But basically, you need to perform an action on rdd.
For example
N_closed = sc.accumulator(0)
def is_closed(line):
status = line.split(",")[-1]=="CLOSED"
if status:
N_closed.add(1)
return status
rdd.foreach(is_closed)
print('The answer is ' + str(N_closed.value))
I am using spark 2.0.0 to query hive table:
my sql is:
select * from app.abtestmsg_v limit 10
Yes, I want to get the first 10 records from a view app.abtestmsg_v.
When I run this sql in spark-shell,it is very fast, USE about 2 seconds .
But then the problem comes when I try to implement this query by my python code.
I am using Spark 2.0.0 and write a very simple pyspark program, code is:
Below is my pyspark code:
from pyspark.sql import HiveContext
from pyspark.sql.functions import *
import json
hc = HiveContext(sc)
hc.setConf("hive.exec.orc.split.strategy", "ETL")
hc.setConf("hive.security.authorization.enabled",false)
zj_sql = 'select * from app.abtestmsg_v limit 10'
zj_df = hc.sql(zj_sql)
zj_df.collect()
Below is my scala code:
val hive = new org.apache.spark.sql.hive.HiveContext(sc)
hive.setConf("hive.exec.orc.split.strategy", "ETL")
val df = hive.sql("select * from silver_ep.zj_v limit 10")
df.rdd.collect()
From the info log , I find:
although I use "limit 10" to tell spark that I just want the first 10 records , but spark still scan and read all files(in my case, the source data of this view contains 100 files and each file's size is about 1G) of the view , So , there are nearly 100 tasks , each task read a file , and all the task is executed serially. I use nearlly 15 minutes to finish these 100 tasks!!!!! but what I want is just to get the first 10 records.
So , I don't know what to do and what is wrong;
Anybode could give me some suggestions?
I tried to execute my scala code in Bluemix Spark service, once I can run it and get right result from my local virtual machine. When I ran it in Bluemix Spark, I can not get any response in notebook.
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.Matrix
val input = sc.textFile("swift://notebooks.spark/pca.csv")
val header = input.first()
val inputData = input.filter(x => x != header).map(line=>line.split(','))
val inputVector = input.map{d=>
Vectors.dense(
d(1).toDouble, d(2).toDouble, d(3).toDouble, d(4).toDouble, d(5).toDouble, d(6).toDouble,
d(7).toDouble, d(8).toDouble, d(9).toDouble, d(10).toDouble, d(11).toDouble)}
val rowMatrix = new RowMatrix(inputVector)
val pca: Matrix = rowMatrix.computePrincipalComponents(5)
When I execute the intput.take(2), I can get result well but no result for executing input.foreach(println). It's strange. How can I get result?
I have tested it on Bluemix in a Scala notebook.
val input = sc.textFile("swift://notebooks.spark/test.csv")
input.take(1) /** shows the first line */
input.foreach(println) /** nothing is displayed */
If you want to display the content of a RDD, then you can use the following code.
input.take(5).foreach(println) /** shows the first 5 lines */
input.collect().foreach(println) /** shows all lines */
I do not know how your local VM is set up, but I think you have to distinguish between running your code local or on a cluster.
Have look at this answer for more information: How to print the contents of RDD?