Executing local (driver) code iteratively on Spark DataFrame items - apache-spark

I am using Spark, Dataframes and Python.
Let say I have a quite huge dataframe, with every Row containing some JPG images as binary data. I want to build some kind of browser to display every image sequentially.
I have a view function that take a single row as input and does something like this:
def view(row):
windows = popup_window_that_display_image(row.image)
waitKey()
destroy_window(window)
The following code works fine with spark-submit option --master local[*]:
df = load_and_compute_dataframe(context, some_arguments)
df.foreach(view)
Obviously, the view function cannot run on remote Spark executors. So the above code fails in yarn-client mode.
I can use the following code to work in yarn-client mode:
df = load_and_compute_dataframe(context, some_arguments)
data = df.limit(10).collect();
for x in data:
view(w)
The drawback is that I can only collect a few items. Data is too huge to get more than 10 or 100 items at once.
So my questions are:
Is there a mean to have some DF/RDD operation executes locally on the driver, instead of the executors ?
Is there something that allows me to collect 10 items from a DF, starting from the 11th ? Should I try to add an "ID" column to my DF and iterate over it (ugly) ?
Any other way to achieve this result ?
Thanks for help !

Related

For Loop keeps restarting in EMR (pyspark)

I have a nested for loop that performs operations on a data frame 10 times in the inner loop and joins the resulting 10 data frames into a single data frame once it finishes the inner loop.
UPDATE: I use a dictionary to create a list of dataframes to store each operation in and then union them at the end of the inner loop.
It then writes it to a parquet file with the iteration number of the outloop.
The outerloop has 6 iterations and therefore should result in 6 parquet files.
It goes something like this:
train=0
for i in range(0,6):
train=train+30
#For loop to aggregate input and create 10 output dataframes
dfnames={}
for j in range(0,10):
ident="_"+str(j)
#Load dataframe of around 1M rows
df=spark.read.parquet("s3://path")
dfnames['df'+ident]= #Perform aggregations and operations
#Combine the 10 datframes into a single df
df_out=df_1.uniionByName(d_2).unionByName(df_3)...unionByName(df_10)
#Write to output parquet file
df_out.write.mode('overwrite').parquet("s3://path/" + str(train) +".parquet"
It seems to be working fine until it finishes the 3rd iteration of the outer loop. Then for some reason, it restarts the loop with another attempt id.
So I get the first 3 files, but instead of going to the 4th iteration, it restarts to give the first file all over again. I dont get any failed stages or jobs.
I have tried running the for loops alone with dummy variables and print statements (without loading the large data frames etc) and they work fine to completion.
I am thinking it has something to do with the way the memory is being flushed after a loop.
These are my EMR Spark running conditions:
I am running this on an EMR cluster with 5 executors, 5 driver nodes, and 10 instances with a total of 50 cores. The spark executor and driver memory is 45G each with a total of about 583G.
The typical shuffle read is 250G and shuffle write is 331G.
Some of the pertinent Spark environment variables are shown below:
Is there something I am doing wrong with regards to the loop or memory management?
Any insight would be greatly appreciated!
Try to not combine Python data structures with Spark data structures.
You want to convert the for loops into a map-reduce, foreach form of design.
Along with this, you can create a cache/ spark checkpoint in each iteration to avoid rerunning the entire DAG from scratch.
To cache your data:
df.cache()
for checkpointing
spark.sparkContext.setCheckpointDir('<some path>')
df.checkpoint()
These will show performance and scale improvement once you use the spark constructs instead of python constructs. For example, replace your for loop by foreach, replace union of a list by map reduce.
How are you getting your df1, df2... before this line?
#Combine the 10 datframes into a single df df_out=df1.uniionByName(d2).unionByName(df3)...unionByName(df10)
My guess is, your dataframe plan is growing big and that might be causing issues.
I would suggest creating a list of dataframes in the inner loop and use reduce method to union them.
Something like below
from functools import reduce
from pyspark.sql import DataFrame
df_list = []
for j in range(0,10):
#Load dataframe of around 1M rows
df = spark.read.parquet("s3://path")
transformed_df = #do your transforms
df_list.append(transformed_df)
final_df = reduce(DataFrame.unionByName, df_list)

DropDuplicates in PySpark gives stackoverflowerror

I have a PySpark program which reads a json files of size around 350-400 MB and created a dataframe out of it.
In my next step, I create a Spark SQL query using createOrReplaceTempView and select few columns as required
Once this is done, I filter my dataframe with some conditions. It was working fine until this point of time.
Now, I needed to remove some duplicate values using a column. So, I introduced,
dropDuplicates in next step and it suddenly started giving me StackoverflowError
Below is the sample code:-
def create_some_df(initial_df):
initial_df.createOrReplaceTempView('data')
original_df = spark.sql('select c1,c2,c3,c4 from data')
## Filter out some events
original_df = original_df.filter(filter1condition)
original_df = original_df.filter(filter2condition)
original_df = original_df.dropDuplicates(['c1'])
return original_df
It worked fine until I added dropDuplicates method.
I am using 3 node AWS EMR cluster c5.2xlarge
I am running PySpark using spark-submit command in YARN client mode
What I have tried
I tried adding persist and cache before calling filter, but it didn't help
EDIT - Some more details
I realise that the error appears when I invoke my write function after multiple transformation i.e first action.
If I have dropDuplicates in my transformation before I write, it fails with error.
If I do not have dropDuplicates in my transformation, write works fine.

Parallelize SparkSession in PySpark

I would like to do calculations for getting top 5 keywords in each country and inside the method to get top 5 keywords, is there any way I can parallelize SparkSessions?
Now I am doing
country_mapping_df.rdd.map(lambda country_tuple: get_top_5_keywords(country_tuple))
def get_top_5_keywords(country_tuple):
result1 = spark.sql("""sample""")
result.write_to_s3
which is not working! Anyone knows how to make this work?
Spark does not support two contexts/Sessions running concurrently in the same program, Hence you can not parallelize SparkSessions.
source: https://spark.apache.org/docs/2.4.0/rdd-programming-guide.html#unit-testing

How to run query for each record in a dataframe?

I have scenario where need to run a query for each record from a dataframe. I am running in spark-shell, Spark 1.6. I tried it like df.rdd.map( row => sqlContext.sql("...")), but it is not working. Any thoughts on this?
Use RDD.collect to collect the data (to the driver) and map over every row to execute an SQL query with.
df.rdd.collect.map(row => sqlContext.sql("..."))
That may or may not work given the size of the data and the memory available on the driver.
The reason df.rdd.map( row => sqlContext.sql("...")) didn't work was that you wanted to submit a query on executors as part of map that won't work since it has to be executed on the driver.

Not able to set number of shuffle partition in pyspark

I know that by default, the number of partition for tasks is set to 200 in spark. I can't seem to change this. I'm running jupyter with spark 1.6.
I'm loading a fairly small table with about 37K rows from hive using the following in my notebook
from pyspark.sql.functions import *
sqlContext.sql("set spark.sql.shuffle.partitions=10")
test= sqlContext.table('some_table')
print test.rdd.getNumPartitions()
print test.count()
The output confirms 200 tasks. From the activity log, it's spinning up 200 tasks, which is an overkill. it seems like line number 2 above is ignored. So, I tried the following:
test = sqlContext.table('gfcctdmn_work.icgdeskrev_emma_cusip_activity_bw').repartition(5)
and create a new cell:
print test.rdd.getNumPartitions()
print test.count()
The output shows 5 partitions, but the log shows 200 tasks being spun up for the count, and then repartition to 5 took place after. However, if I convert it first to RDD, and back to DataFrame as follow:
test = sqlContext.table('gfcctdmn_work.icgdeskrev_emma_cusip_activity_bw').repartition(5).rdd
and create a new cell:
print test.getNumPartitions()
print test.toDF().count()
The very first time I ran the new cell, it's still running with 200 tasks. However, the second time I ran the new cell, it ran with 5 tasks.
How can I make the code run with 5 tasks the very first time it's running?
Would you mind explaining why this behaves this way(specifying number of partition, but it's still running under default settings)? Is it because the defauly Hive table was created using 200 partitions?
At the beginning of your notebook, do something like this:
from pyspark.conf import SparkConf
sc.stop()
conf = SparkConf().setAppName("test")
conf.set("spark.default.parallelism", 10)
sc = SparkContext(conf=conf)
When the notebook starts you have already a SparkContext created for you, but still you can change configuration and recreate it.
As for spark.default.parallelism, I understand it is what you need, take a look here:
Default number of partitions in RDDs returned by transformations like
join, reduceByKey, and parallelize when not set by user.

Resources