Parallelize SparkSession in PySpark - apache-spark

I would like to do calculations for getting top 5 keywords in each country and inside the method to get top 5 keywords, is there any way I can parallelize SparkSessions?
Now I am doing
country_mapping_df.rdd.map(lambda country_tuple: get_top_5_keywords(country_tuple))
def get_top_5_keywords(country_tuple):
result1 = spark.sql("""sample""")
result.write_to_s3
which is not working! Anyone knows how to make this work?

Spark does not support two contexts/Sessions running concurrently in the same program, Hence you can not parallelize SparkSessions.
source: https://spark.apache.org/docs/2.4.0/rdd-programming-guide.html#unit-testing

Related

How to use Pyspark's csv reader on every element of Pyspark RDD? (without "reference SparkContext from a broadcast variable")

I want to use Pyspark to read in hundreds of csv files, create a single dataframe that is (roughly) the concatenation of all the csvs. Since each csv can fit in memory, but not more than one or two at a time, this seems like a good fit for Pyspark. My strategy is not working, and I think it is because I want to make a Pyspark dataframe in the kernel function of my map function resulting in an error:
# initiate spark session and other variables
sc = SparkSession.builder.master("local").appName("Test").config(
"spark.driver.bindAddress", "127.0.0.1").getOrCreate()
file_path_list = [path1, path2] ## list of string path variables
# make an rdd object so i can use .map:
rdd = sc.sparkContext.parallelize(file_path_list)
# make a kernel function for my future .map() application
def kernel_f(path):
df = sc.read.options(delimiter=",", header=True).csv(path)
return df
# apply .map
rdd2 = rdd.map(kernel_f)
# see first dataframe (so excited)
rdd2.take(2)[0].show(3)
this throws an error:
PicklingError: Could not serialize object: RuntimeError: It appears
that you are attempting to reference SparkContext from a broadcast
variable, action, or transformation. SparkContext can only be used on
the driver, not in code that it run on workers. For more information,
see SPARK-5063.
My next step (supposing no error had appeared) was to use a reduce step to concatenate all the members (dataframes with same schema) of that rdd2
It seems related to this post but I don't understand the answer.
Questions:
I think this means is that since my kernel_f calls sc. methods, it is against the rules. Is that right?
I (think I) could use plain-old python (not pyspark) function map to apply the kernel_f to my file_path_list, then use plain-old functools.reduce to concatenate all these into a single pyspark dataframe, but that doesn't seem to leverage pyspark much. Does this seem like a good route?
Can you teach me a good, ideally a "tied-for-best" way to do this?
I don't have a definitive answer but just comments that might help. First off, I think the easiest way to do this is read the CSVs with a wildcard like shown here
A Spark cluster is composed of the scheduler and the workers. You use the SparkSession to pass work to the scheduler. It seems they don't allow workers sending work to the scheduler, which seems like it can be an anti-pattern in a lot of use cases.
The design pattern is also weird here because you are not actually passing a DataFrame back. Spark operations are lazy unlike Pandas so that read is not happening immediately. I feel like if it worked, it would pass a DAG back, not data.
It doesn't sound good because you want loading of files to be lazy. Given you can't use spark to read on a worker, you'd have to use Pandas/Python which evaluate immediately. You will run out of memory trying this even more.
Speaking of memory, Spark lets you perform out-of-memory computation but there are limits to how big can be out-of-memory relative to the memory available. You will inevitably run into errors if you really don't have enough memory by a considerable margin.
I think you should use the wildcard as shown above.

Pyspark UDF in Java Spark Program

Is there any way by which I can use UDF's created in pyspark into Java Spark job
I know there is a way to use Java UDF into pyspark, but I am looking for other way round
First, I have to say that I don’t recommend you to do that. It sounds like a huge latency for the UDF, and I really suggest you to try write the UDF in Scala / Java.
If you still want to do that, here is how:
you should write a UDF that creates a Python interpreter and executes your code.
Here is a Scala code example:
System.setProperty("python.import.site", "false")
val interpreter = new PythonInterpreter
interpreter.exec("from __builtin__ import *")
// execute a function that takes a string and returns its length
val someFunc = interpreter.get("len")
val result = someFunc.__call__(new PyString("Test!"))
val realResult = result.__tojava__(classOf[Integer]).asInstanceOf[Int]
print(realResult)
This code call the len Python function and returns its result on the string "Test!".
I really think it’ll cause a bad performance for your job, and you should reconsider this plan again.

Multiples computations in one iteration with Spark

How can I iterate over a big collection of files producing different results in just one step with Spark? For example:
val tweets : RDD[Tweet] = ...
val topWords : RDD[String] = getTopWords(tweets)
val topHashtags : RDD[String] = getTopHashtags(tweets)
topWords.collect().foreach(println)
topHashtags.collect().foreach(println)
It looks like Spark is going to iterate twice over the tweets dataset. Is there any way to prevent this? Is Spark smart enough to make this kind of optimizations?
Thanks in advance,
Spark will keep data loaded into CPU cache as long as it can, but that's not something you should rely on, so your best bet is to tweets.cache so that after the initial load then it will be working off of a memory store. The only other solution you would have is to combine your two functions and return a tuple of (resultType1, resultType2)

Not able to set number of shuffle partition in pyspark

I know that by default, the number of partition for tasks is set to 200 in spark. I can't seem to change this. I'm running jupyter with spark 1.6.
I'm loading a fairly small table with about 37K rows from hive using the following in my notebook
from pyspark.sql.functions import *
sqlContext.sql("set spark.sql.shuffle.partitions=10")
test= sqlContext.table('some_table')
print test.rdd.getNumPartitions()
print test.count()
The output confirms 200 tasks. From the activity log, it's spinning up 200 tasks, which is an overkill. it seems like line number 2 above is ignored. So, I tried the following:
test = sqlContext.table('gfcctdmn_work.icgdeskrev_emma_cusip_activity_bw').repartition(5)
and create a new cell:
print test.rdd.getNumPartitions()
print test.count()
The output shows 5 partitions, but the log shows 200 tasks being spun up for the count, and then repartition to 5 took place after. However, if I convert it first to RDD, and back to DataFrame as follow:
test = sqlContext.table('gfcctdmn_work.icgdeskrev_emma_cusip_activity_bw').repartition(5).rdd
and create a new cell:
print test.getNumPartitions()
print test.toDF().count()
The very first time I ran the new cell, it's still running with 200 tasks. However, the second time I ran the new cell, it ran with 5 tasks.
How can I make the code run with 5 tasks the very first time it's running?
Would you mind explaining why this behaves this way(specifying number of partition, but it's still running under default settings)? Is it because the defauly Hive table was created using 200 partitions?
At the beginning of your notebook, do something like this:
from pyspark.conf import SparkConf
sc.stop()
conf = SparkConf().setAppName("test")
conf.set("spark.default.parallelism", 10)
sc = SparkContext(conf=conf)
When the notebook starts you have already a SparkContext created for you, but still you can change configuration and recreate it.
As for spark.default.parallelism, I understand it is what you need, take a look here:
Default number of partitions in RDDs returned by transformations like
join, reduceByKey, and parallelize when not set by user.

Executing local (driver) code iteratively on Spark DataFrame items

I am using Spark, Dataframes and Python.
Let say I have a quite huge dataframe, with every Row containing some JPG images as binary data. I want to build some kind of browser to display every image sequentially.
I have a view function that take a single row as input and does something like this:
def view(row):
windows = popup_window_that_display_image(row.image)
waitKey()
destroy_window(window)
The following code works fine with spark-submit option --master local[*]:
df = load_and_compute_dataframe(context, some_arguments)
df.foreach(view)
Obviously, the view function cannot run on remote Spark executors. So the above code fails in yarn-client mode.
I can use the following code to work in yarn-client mode:
df = load_and_compute_dataframe(context, some_arguments)
data = df.limit(10).collect();
for x in data:
view(w)
The drawback is that I can only collect a few items. Data is too huge to get more than 10 or 100 items at once.
So my questions are:
Is there a mean to have some DF/RDD operation executes locally on the driver, instead of the executors ?
Is there something that allows me to collect 10 items from a DF, starting from the 11th ? Should I try to add an "ID" column to my DF and iterate over it (ugly) ?
Any other way to achieve this result ?
Thanks for help !

Resources