Pyspark - Loop n times - Each loop gets gradually slower - apache-spark

So basically i want to loop n times through my dataframe and apply a function in each loop
(perform a join).
My test-Dataframe is like 1000 rows and in each iteration, exactly one column will be added.
The first three loops perform instantly and from then its gets really really slow.
The 10th loop e.g. needs more than 10 minutes.
I dont understand why this happens because my Dataframe wont grow larger in terms of rows.
If i call my functions with n=20 e.g., the join performs instantly.
But when i loop iteratively 20 times, it gets stucked soon.
You have any idea what can potentially cause this problem?

Examble Code from Evaluating Spark DataFrame in loop slows down with every iteration, all work done by controller
import time
from pyspark import SparkContext
sc = SparkContext()
def push_and_pop(rdd):
# two transformations: moves the head element to the tail
first = rdd.first()
return rdd.filter(
lambda obj: obj != first
).union(
sc.parallelize([first])
)
def serialize_and_deserialize(rdd):
# perform a collect() action to evaluate the rdd and create a new instance
return sc.parallelize(rdd.collect())
def do_test(serialize=False):
rdd = sc.parallelize(range(1000))
for i in xrange(25):
t0 = time.time()
rdd = push_and_pop(rdd)
if serialize:
rdd = serialize_and_deserialize(rdd)
print "%.3f" % (time.time() - t0)
do_test()

I have fixed this issue with converting the df every n times to a rdd and back to df.
Code runs fast now. But i dont understand what exactly is the reason for that. The explain plan seems to rise very fast during iterations if i dont do the conversion.
This fix is also issued in the book "High Performance Spark" with this workaround.
While the Catalyst optimizer is quite powerful, one of the cases where
it currently runs into challenges is with very large query plans.
These query plans tend to be the result of iterative algorithms, like
graph algorithms or machine learning algorithms. One simple workaround
for this is converting the data to an RDD and back to
DataFrame/Dataset at the end of each iteration

Related

avoid shuffling and long plans on multiple joins in pyspark

i am doing multiple joins with the same data frame
the data frames i am joining with are result of group by on my original data frame.
listOfCols = ["a","b","c",....]
for c in listOfCols:
means=df.groupby(col(c)).agg(mean(target).alias(f"{c}_mean_encoding"))
df=df.join(means,c,how="left")
this code produces more than 100000 tasks and takes forever to finish.
i see in the dag a lot of shuffling happening.
how can i optimize this code ?
well, after a LOT of tries and failures , i came up with the fastest solution .
instead of 1.5 hours for this job it ran for 5 minutes....
i will put it here so if someone will stumble into it - he/she won't suffer as i did...
the solution was to use spark sql , it must be much more optimized internally than using data frame API:
df.registerTempTable("df")
for c in listOfCols:
left_join_string += f" left join means_{c} on df.{c} = means_{c}.{c}"
means = df.groupby(F.col(c)).agg(F.mean(target).alias(f"{c}_mean_encoding"))
means.registerTempTable(f"means_{c}")
df = sqlContext.sql("SELECT * FROM df "+left_join_string)

Spark and isolating time taken for tasks

I recently began to use Spark to process huge amount of data (~1TB). And have been able to get the job done too. However I am still trying to understand its working. Consider the following scenario:
Set reference time (say tref)
Do any one of the following two tasks:
a. Read large amount of data (~1TB) from tens of thousands of files using SciSpark into RDDs (OR)
b. Read data as above and do additional preprossing work and store the results in a DataFrame
Print the size of the RDD or DataFrame as applicable and time difference wrt to tref (ie, t0a/t0b)
Do some computation
Save the results
In other words, 1b creates a DataFrame after processing RDDs generated exactly as in 1a.
My query is the following:
Is it correct to infer that t0b – t0a = time required for preprocessing? Where can I find an reliable reference for the same?
Edit: Explanation added for the origin of question ...
My suspicion stems from Spark's lazy computation approach and its capability to perform asynchronous jobs. Can/does it initiate subsequent (preprocessing) tasks that can be computed while thousands of input files are being read? The origin of the suspicion is in the unbelievable performance (with results verified okay) I see that look too fantastic to be true.
Thanks for any reply.
I believe something like this could assist you (using Scala):
def timeIt[T](op: => T): Float = {
val start = System.currentTimeMillis
val res = op
val end = System.currentTimeMillis
(end - start) / 1000f
}
def XYZ = {
val r00 = sc.parallelize(0 to 999999)
val r01 = r00.map(x => (x,(x,x,x,x,x,x,x)))
r01.join(r01).count()
}
val time1 = timeIt(XYZ)
// or like this on next line
//val timeN = timeIt(r01.join(r01).count())
println(s"bla bla $time1 seconds.")
You need to be creative and work incrementally with Actions that cause actual execution. This has limitations thus. Lazy evaluation and such.
On the other hand, Spark Web UI records every Action, and records Stage duration for the Action.
In general: performance measuring in shared environments is difficult. Dynamic allocation in Spark in a noisy cluster means that you hold on to acquired resources during the Stage, but upon successive runs of the same or next Stage you may get less resources. But this is at least indicative and you can run in a less busy period.

How to optimize this short Spark Apache DataFrame function

Following function is supposed to join two DataFrames and return the number of checkouts per location. It is based on the Seattle Public Library data set.
def topKCheckoutLocations(checkoutDF: DataFrame, libraryInventoryDF: DataFrame, k: Int): DataFrame = {
checkoutDF
.join(libraryInventoryDF, "ItemType")
.groupBy("ItemBarCode", "ItemLocation") //grouping by ItemBarCode and ItemLocation
.agg(count("ItemBarCode")) //counting number of ItemBarCode for each ItemLocation
.withColumnRenamed("count(ItemBarCode)", "NumCheckoutItemsAtLocation")
.select($"ItemLocation", $"NumCheckoutItemsAtLocation")
}
When I run this, it takes ages to finish (40+ minutes), and I'm pretty sure it is not supposed to take more than a couple of minutes. Can I change the order of the calls to decrease computation time?
As I never managed to finish computation I never actually got to check whether the output is correct. I assume it is.
The checkoutDF has 3 mio. rows.
For spark job performance
Select the required column from the dataset before joins to
decrease data size
Partition your both dataset by join column ("ItemType") to avoid shuffling

How to reduce time taken by to convert dask dataframe to pandas dataframe

I have a function to read large csv files using dask dataframe and then convert to pandas dataframe, which takes quite a lot time. The code is:
def t_createdd(Path):
dataframe = dd.read_csv(Path, sep = chr(1), encoding = "utf-16")
return dataframe
#Get the latest file
Array_EXT = "Export_GTT_Tea2Array_*.csv"
array_csv_files = sorted([file
for path, subdir, files in os.walk(PATH)
for file in glob(os.path.join(path, Array_EXT))])
latest_Tea2Array=array_csv_files[(len(array_csv_files)-(58+25)):
(len(array_csv_files)-58)]
Tea2Array_latest = t_createdd(latest_Tea2Array)
#keep only the required columns
Tea2Array = Tea2Array_latest[['Parameter_Id','Reading_Id','X','Value']]
P1MI3 = Tea2Array.loc[Tea2Array['parameter_id']==168566]
P1MI3=P1MI3.compute()
P1MJC_main = Tea2Array.loc[Tea2Array['parameter_id']==168577]
P1MJC_old=P1MJC_main.compute()
P1MI3=P1MI3.compute() and P1MJC_old=P1MJC_main.compute() takes around 10 and 11 mins respectively to execute. Is there any way to reduce the time.
I would encourage you to consider, with reference to the Dask documentation, why you would expect the process to be any faster than using Pandas alone.
Consider:
file access may be from several threads, but you only have one disc interface bottleneck, and likely performs much better reading sequentially than trying to read several files in parallel
reading CSVs is CPU-heavy, and needs the python GIL. The multiple threads will not actually be running in parallel
when you compute, you materialise the whole dataframe. It is true that you appear to be selecting a single row in each case, but Dask has no way to know in which file/part it is.
you call compute twice, but could have combined them: Dask works hard to evict data from memory which is not currently needed by any computation, so you do double the work. By calling compute on both outputs, you would halve the time.
Further remarks:
obviously you would do much better if you knew which partition contained what
you can get around the GIL using processes, e.g., Dask's distributed scheduler
if you only need certain columns, do not bother to load everything and then subselect, include those columns right in the read_csv function, saving a lot of time and memory (true for pandas or Dask).
To compute both lazy things at once:
dask.compute(P1MI3, P1MJC_main)

Lazy foreach on a Spark RDD

I have a big RDD of Strings (obtained through a union of several sc.textFile(...)).
I now want to search for a given string in that RDD, and I want the search to stop when a "good enough" match has been found.
I could retrofit foreach, or filter, or mapfor this purpose, but all of these will iterate through every element in that RDD, regardless of whether the match has been reached.
Is there a way to short-circuit this process and avoid iterating through the whole RDD?
I could retrofit foreach, or filter, or map for this purpose, but all of these will iterate through every element in that RDD
Actually, you're wrong. Spark engine is smart enough to optimize computations if you limit the results (using take or first):
import numpy as np
from __future__ import print_function
np.random.seed(323)
acc = sc.accumulator(0)
def good_enough(x, threshold=7000):
global acc
acc += 1
return x > threshold
rdd = sc.parallelize(np.random.randint(0, 10000) for i in xrange(1000000))
x = rdd.filter(good_enough).first()
Now lets check accum:
>>> print("Checked {0} items, found {1}".format(acc.value, x))
Checked 6 items, found 7109
and just to be sure if everything works as expected:
acc = sc.accumulator(0)
rdd.filter(lambda x: good_enough(x, 100000)).take(1)
assert acc.value == rdd.count()
Same thing could be done, probably in a more efficient manner using data frames and udf.
Note: In some cases it is even possible to use an infinite sequence in Spark and still get a result. You can check my answer to Spark FlatMap function for huge lists for an example.
Not really. There is no find method, as in the Scala collections that inspired the Spark APIs, which would stop looking once an element is found that satisfies a predicate. Probably your best bet is to use a data source that will minimize excess scanning, like Cassandra, where the driver pushes down some query parameters. You might also look at the more experimental Berkeley project called BlinkDB.
Bottom line, Spark is designed more for scanning data sets, like MapReduce before it, rather than traditional database-like queries.

Resources