How to do parallel processing in pyspark - apache-spark

I want to do parallel processing in for loop using pyspark.
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('yarn').appName('myAppName').getOrCreate()
spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
data = [a,b,c]
for i in data:
try:
df = spark.read.parquet('gs://'+i+'-data')
df.createOrReplaceTempView("people")
df2=spark.sql("""select * from people """)
df.show()
except Exception as e:
print(e)
continue
Above mentioned script is working fine but i want to do parallel processing in pyspark and which is possible in scala

Spark itself runs job parallel but if you still want parallel execution in the code you can use simple python code for parallel processing to do it (this was tested on DataBricks Only link).
data = ["a","b","c"]
from multiprocessing.pool import ThreadPool
pool = ThreadPool(10)
def fun(x):
try:
df = sqlContext.createDataFrame([(1,2, x), (2,5, "b"), (5,6, "c"), (8,19, "d")], ("st","end", "ani"))
df.show()
except Exception as e:
print(e)
pool.map( fun,data)
I have changed your code a bit but this is basically how you can run parallel tasks,
If you have some flat files that you want to run parallel just make a list with their name and pass it into pool.map( fun,data).
Change the function fun as need be.
For more details on the multiprocessing module check the documentation.
Similarly, if you want to do it in Scala you will need the following modules
import scala.concurrent.{Future, Await}
For a more detailed understanding check this out.
The code is for Databricks but with a few changes, it will work with your environment.

Here's a parallel loop on pyspark using azure databricks.
import socket
def getsock(i):
s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
s.connect(("8.8.8.8", 80))
return s.getsockname()[0]
rdd1 = sc.parallelize(list(range(10)))
parallel=rdd1.map(getsock).collect()
On other platforms than azure you'll maybe need to create the spark context sc. On azure the variable exists by default.
Coding it up like this only makes sense if in the code that is executed parallelly (getsock here) there is no code that is already parallel. For instance, had getsock contained code to go through a pyspark DataFrame then that code is already parallel. So, it would probably not make sense to also "parallelize" that loop.

Related

How to use pytest and a common spark session for all tests?

I would like to set up a logic of tests in my project, but I am facing an issue with spark.
Consider several tests that are all using a spark session to read data, perform some processing and collecting it as a pandas.DataFrame object.
The problem I am facing is due to conflicts. Anytime I try to access to some columns, I get a KeyError exception, and my tests fail.
My conftest.py looks like :
#pytest.fixture(scope="session")
def spark_session(request):
# some code to create a spark config `conf`
spark = pyspark.sql.SparkSession.builder.config(conf=conf).getOrCreate()
yield spark
if "spark-warehouse" in os.listdir(os.getcwd()):
shutil.rmtree("./spark-warehouse", ignore_errors=True)
spark.stop()
In any test_*.py files, you have a function which aims to load data with the spark session :
# import some dependencies
def get_data(spark_session):
spark_context = spark_session.sparkContext
sql_context = SQLContext(spark_context)
data = sql_context.read.parquet("data/expected.parquet")
return data
# some tests
I add that I don't want to use unittest, only pytest, for some reasons.
Can someone help me find a solution to this issue ? I am looking for best practices, and simple solutions that enable to use one spark session across multiple tests.py files.
I would really appreciate any help ! If you need more context, please let me know.
Thank you.

Databricks : structure stream data assignment and display

I have following stream code in a databricks notebook (python).
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
spark = SparkSession \
.builder \
.appName("MyTest") \
.getOrCreate()
# Create a streaming DataFrame
lines = spark.readStream \
.format("delta") \
.table("myschema.streamTest")
In notebook 2, I have
def foreach_batch_function(df, epoch_id):
test = df
print(test['simplecolumn'])
display(test['simplecolumn'])
test['simplecolumn'].display
lines.writeStream.outputMode("append").foreachBatch(foreach_batch_function).format('console').start()
When I execute the above where can I see the output from the .display function? I looked inside the cluster driver logs and I don't see anything. I also don't see anything in the notebook itself when executed except a successfully initialized and executing stream. I do see that the dataframe parameter data is displayed in console but I am trying to see that assigning test was successful.
I am trying to carry out this manipulation as a precursor to time series operations over mini batches for real-time model scoring and in python - but I am struggling to get the basics right in the structured streaming world. A working model functions but executes every 10-15 minutes. I would like to make it realtime via streams and hence this question.
You're mixing different things together - I recommend to read initial parts of the structured streaming documentation or chapter 8 of Learning Spark, 2ed book (freely available from here).
You can use display function directly on the stream, like (better with checkpointLocation and maybe trigger parameters as described in documentation):
display(lines)
Regarding the scoring - usually it's done by defining the user defined function and applying it to stream either as select or withColumn functions of the dataframe. Easiest way is to register a model in the MLflow registry, and then load model with built-in functions, like:
import mlflow.pyfunc
pyfunc_udf = mlflow.pyfunc.spark_udf(spark, model_uri=model_uri)
preds = lines.withColumn("predictions", pyfunc_udf(params...))
Look into that notebook for examples.

How to run parallel threads in AWS Glue PySpark?

I have a spark job that will just pull data from multiple tables with the same transforms. Basically a for loop that iterates across a list of tables, queries the catalog table, adds a timestamp, then shoves into Redshift (example below).
This job take around 30 minutes to complete. Is there a way to run these in parallel under the same spark/glue context? I don't want to create separate glue jobs if I can avoid it.
import datetime
import os
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.dynamicframe import DynamicFrame
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.context import SparkContext
from pyspark.sql.functions import *
# query the runtime arguments
args = getResolvedOptions(
sys.argv,
["JOB_NAME", "redshift_catalog_connection", "target_database", "target_schema"],
)
# build the job session and context
glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# set the job execution timestamp
job_execution_timestamp = datetime.datetime.utcnow()
tables = []
for table in tables:
catalog_table = glueContext.create_dynamic_frame.from_catalog(
database="test", table_name=table, transformation_ctx=table
)
data_set = catalog_table.toDF().withColumn(
"batchLoadTimestamp", lit(job_execution_timestamp)
)
# covert back to glue dynamic frame
export_frame = DynamicFrame.fromDF(data_set, glueContext, "export_frame")
# remove null rows from dynamic frame
non_null_records = DropNullFields.apply(
frame=export_frame, transformation_ctx="non_null_records"
)
temp_dir = os.path.join(args["TempDir"], redshift_table_name)
stores_redshiftSink = glueContext.write_dynamic_frame.from_jdbc_conf(
frame=non_null_records,
catalog_connection=args["redshift_catalog_connection"],
connection_options={
"dbtable": f"{args['target_schema']}.{redshift_table_name}",
"database": args["target_database"],
"preactions": f"truncate table {args['target_schema']}.{redshift_table_name};",
},
redshift_tmp_dir=temp_dir,
transformation_ctx="stores_redshiftSink",
) ```
You can do the following things to make this process faster
Enable concurrent execution of job.
Allot sufficient number of DPU.
Pass the list of tables as a parameter
Execute the job in parallel using Glue workflows or step functions.
Now suppose you have 100 table's to ingest, you can divide the list in 10 table's each and run the job concurrently 10 times.
Since your data will be loaded parallely so time of Glue job run will be decreased hence less cost will be incurred.
Alternate approach that will be way faster is to use redshift utility direct.
Create table in redshift and keep the batchLoadTimestamp column as default to current_timestamp.
Now create the copy command and load data into the table directly from s3.
Run the copy command using Glue python shell job leveraging pg8000.
Why this approach will be faster??
Because the spark redshift jdbc connector first unloads the spark dataframe to s3 then prepares a copy command to the redshift table. And while running copy command directly you are removing the overhead of running unload command and also reading data into spark df.

How to get the output from console streaming sink in Zeppelin?

I'm struggling to get the console sink working with PySpark Structured Streaming when run from Zeppelin. Basically, I'm not seeing any results printed to the screen, or to any logfiles I've found.
My question: Does anyone have a working example of using PySpark Structured Streaming with a sink that produces output visible in Apache Zeppelin? Ideally it would also use the socket source, as that's easy to test with.
I'm using:
Ubuntu 16.04
spark-2.2.0-bin-hadoop2.7
zeppelin-0.7.3-bin-all
Python3
I've based my code on the structured_network_wordcount.py example. It works when run from the PySpark shell (./bin/pyspark --master local[2]); I see tables per batch.
%pyspark
# structured streaming
from pyspark.sql.functions import *
lines = spark\
.readStream\
.format('socket')\
.option('host', 'localhost')\
.option('port', 9999)\
.option('includeTimestamp', 'true')\
.load()
# Split the lines into words, retaining timestamps
# split() splits each line into an array, and explode() turns the array into multiple rows
words = lines.select(
explode(split(lines.value, ' ')).alias('word'),
lines.timestamp
)
# Group the data by window and word and compute the count of each group
windowedCounts = words.groupBy(
window(words.timestamp, '10 seconds', '1 seconds'),
words.word
).count().orderBy('window')
# Start running the query that prints the windowed word counts to the console
query = windowedCounts\
.writeStream\
.outputMode('complete')\
.format('console')\
.option('truncate', 'false')\
.start()
print("Starting...")
query.awaitTermination(20)
I'd expect to see printouts of results for each batch, but instead I just see Starting..., and then False, the return value of query.awaitTermination(20).
In a separate terminal I'm entering some data into a nc -lk 9999 netcat session while the above is running.
Console sink is not a good choice for interactive notebook-based workflow. Even in Scala, where the output can be captured, it requires awaitTermination call (or equivalent) in the same paragraph, effectively blocking the note.
%spark
spark
.readStream
.format("socket")
.option("host", "localhost")
.option("port", "9999")
.option("includeTimestamp", "true")
.load()
.writeStream
.outputMode("append")
.format("console")
.option("truncate", "false")
.start()
.awaitTermination() // Block execution, to force Zeppelin to capture the output
Chained awaitTermination could be replaced with standalone call in the same paragraph would work as well:
%spark
val query = df
.writeStream
...
.start()
query.awaitTermination()
Without it, Zeppelin has no reason to wait for any output. PySpark just adds another problem on top of that - indirect execution. Because of that, even blocking the query won't help you here.
Moreover continuous output from the stream can cause rendering issues and memory problems when browsing the note (it might be possible to use Zeppelin display system via InterpreterContext or REST API, to achieve a bit more sensible behavior, where the output is overwritten or periodically cleared).
A much better choice for testing with Zeppelin is memory sink. This way you can start a query without blocking:
%pyspark
query = (windowedCounts
.writeStream
.outputMode("complete")
.format("memory")
.queryName("some_name")
.start())
and query the result on demand in another paragraph:
%pyspark
spark.table("some_name").show()
It can be coupled with reactive streams or similar solution to provide interval based updates.
It is also possible to use StreamingQueryListener with Py4j callbacks to couple rx with onQueryProgress events, although query listeners are not supported in PySpark, and require a bit of code, to glue things together. Scala interface:
package com.example.spark.observer
import org.apache.spark.sql.streaming.StreamingQueryListener
import org.apache.spark.sql.streaming.StreamingQueryListener._
trait PythonObserver {
def on_next(o: Object): Unit
}
class PythonStreamingQueryListener(observer: PythonObserver)
extends StreamingQueryListener {
override def onQueryProgress(event: QueryProgressEvent): Unit = {
observer.on_next(event)
}
override def onQueryStarted(event: QueryStartedEvent): Unit = {}
override def onQueryTerminated(event: QueryTerminatedEvent): Unit = {}
}
build a jar, adjusting build definition to reflect desired Scala and Spark version:
scalaVersion := "2.11.8"
val sparkVersion = "2.2.0"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-sql" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion
)
put it on the Spark classpath, patch StreamingQueryManager:
%pyspark
from pyspark.sql.streaming import StreamingQueryManager
from pyspark import SparkContext
def addListener(self, listener):
jvm = SparkContext._active_spark_context._jvm
jlistener = jvm.com.example.spark.observer.PythonStreamingQueryListener(
listener
)
self._jsqm.addListener(jlistener)
return jlistener
StreamingQueryManager.addListener = addListener
start callback server:
%pyspark
sc._gateway.start_callback_server()
and add listener:
%pyspark
from rx.subjects import Subject
class StreamingObserver(Subject):
class Java:
implements = ["com.example.spark.observer.PythonObserver"]
observer = StreamingObserver()
spark.streams.addListener(observer)
Finally you can use subscribe and block execution:
%pyspark
(observer
.map(lambda p: p.progress().name())
# .filter() can be used to print only for a specific query
.subscribe(lambda n: spark.table(n).show() if n else None))
input() # Block execution to capture the output
The last step should be executed after you started streaming query.
It is also possible to skip rx and use minimal observer like this:
class StreamingObserver(object):
class Java:
implements = ["com.example.spark.observer.PythonObserver"]
def on_next(self, value):
try:
name = value.progress().name()
if name:
spark.table(name).show()
except: pass
It gives a bit less control than the Subject (one caveat is that this can interfere with other code printing to stdout and can be stopped only by removing listener. With Subject you can easily dispose subscribed observer, once you're done), but otherwise should work more or less the same.
Note that any blocking action will be sufficient to capture the output from the listener and it doesn't have to be executed in the same cell. For example
%pyspark
observer = StreamingObserver()
spark.streams.addListener(observer)
and
%pyspark
import time
time.sleep(42)
would work in a similar way, printing table for a defined time interval.
For completeness you can implement StreamingQueryManager.removeListener.
zeppelin-0.7.3-bin-all uses Spark 2.1.0 (so no rate format to test Structured Streaming with unfortunately).
Make sure that when you start a streaming query with socket source nc -lk 9999 has already been started (as the query simply stops otherwise).
Also make sure that the query is indeed up and running.
val lines = spark
.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load
val q = lines.writeStream.format("console").start
It's indeed true that you won't be able to see the output in a Zeppelin notebook possibly because:
Streaming queries start on their own threads (that seems to be outside Zeppelin's reach)
console sink writes to standard output (uses Dataset.show operator on that separate thread).
All this makes "intercepting" the output not available in Zeppelin.
So we come to answer the real question:
Where is the standard output written to in Zeppelin?
Well, with a very limited understanding of the internals of Zeppelin, I thought it could be logs/zeppelin-interpreter-spark-[hostname].log, but unfortunately could not find the output from the console sink. That's where you can find the logs from Spark (and Structured Streaming in particular) that use log4j but console sink does not use.
It looks as if your only long-term solution were to write your own console-like custom sink and use a log4j logger. Honestly, that is not that hard as it sounds. Follow the sources of console sink.

Luigi doesn't work as expected with Spark & Redshift

I'm running an EMR Spark cluster (uses YARN) and I'm running Luigi tasks directly from the EMR master. I have a chain of jobs that depends on data in S3 and after a few SparkSubmitTasks will eventually end up in Redshift.
import luigi
import luigi.format
from luigi.contrib.spark import SparkSubmitTask
from luigi.contrib.redshift import RedshiftTarget
class SomeSparkTask(SparkSubmitTask):
# Stored in /etc/luigi/client.cfg
host = luigi.Parameter(default='host')
database = luigi.Parameter(default='database')
user = luigi.Parameter(default='user')
password = luigi.Parameter(default='password')
table = luigi.Parameter(default='table')
<add-more-params-here>
app = '<app-jar>.jar'
entry_class = '<path-to-class>'
def app_options(self):
return <list-of-options>
def output(self):
return RedshiftTarget(host=self.host, database=self.database, user=self.user, password=self.password,
table=self.table, update_id=<some-unique-identifier>)
def requires(self):
return AnotherSparkSubmitTask(<params>)
I'm running into two main problems:
1) Sometimes luigi isn't able to determine when a SparkSubmitTask is done - for example, I'll see that luigi submits a job, then check YARN, which will say that the application is running, but once it's done, luigi just hangs and isn't able to determine that the job is done.
2) If for whatever reason the SparkSubmitTasks are able to run and the Task I've placed above finishes the Spark Job, the output task is never run and the marker-table is never created nor populated. However, the actual table is created in the Spark Job that's run. Am I misunderstanding how I'm supposed to call the RedshiftTarget?
In the meantime I'm trying to get acquainted with the source code.
Thanks!
Dropped the use of Luigi in my Spark applications because all my data is now streamed into S3 and I only need one large monolithic application to run all my Spark aggregations so I can take advantage of intermediate results/caching.

Resources