Spark driver memory for rdd.saveAsNewAPIHadoopFile and workarounds - apache-spark

I'm having issues with a particular spark method, saveAsNewAPIHadoopFile. The context is that I'm using pyspark, indexing RDDs with 1k, 10k, 50k, 500k, 1m records into ElasticSearch (ES).
For a variety of reasons, the Spark context is quite underpowered with a 2gb driver, and single 2gb executor.
I've had no problem until about 500k, when I'm getting java heap size problems. Increasing the spark.driver.memory to about 4gb, and I'm able to index more. However, there is a limit to how long this will work, and we would like to index in upwards of 500k, 1m, 5m, 20m records.
Also constrained to using pyspark, for a variety of reasons. The bottleneck and breakpoint seems to be a spark stage called take at SerDeUtil.scala:233, that no matter how many partitions the RDD has going in, it drops down to one, which I'm assuming is the driver collecting the partitions and preparing for indexing.
Now - I'm wondering if there is an efficient way to still use an approach like the following, given that constraint:
to_index_rdd.saveAsNewAPIHadoopFile(
path='-',
outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf={
"es.resource":"%s/record" % index_name,
"es.nodes":"192.168.45.10:9200",
"es.mapping.exclude":"temp_id",
"es.mapping.id":"temp_id",
}
)
In pursuit of a good solution, I might as well air some dirty laundry. I've got a terribly inefficient workaround that uses zipWithIndex to chunk an RDD, and send those subsets to the indexing function above. Looks a bit like this:
def index_chunks_to_es(spark=None, job=None, kwargs=None, rdd=None, chunk_size_limit=10000):
# zip with index
zrdd = rdd.zipWithIndex()
# get count
job.update_record_count(save=False)
count = job.record_count
# determine number of chunks
steps = count / chunk_size_limit
if steps % 1 != 0:
steps = int(steps) + 1
# evenly distribute chunks, while not exceeding chunk_limit
dist_chunk_size = int(count / steps) + 1
# loop through steps, appending subset to list for return
for step in range(0, steps):
# determine bounds
lower_bound = step * dist_chunk_size
upper_bound = (step + 1) * dist_chunk_size
print(lower_bound, upper_bound)
# select subset
rdd_subset = zrdd.filter(lambda x: x[1] >= lower_bound and x[1] < upper_bound).map(lambda x: x[0])
# index to ElasticSearch
ESIndex.index_job_to_es_spark(
spark,
job=job,
records_df=rdd_subset.toDF(),
index_mapper=kwargs['index_mapper']
)
It's slow, if I'm understanding correctly, because that zipWithIndex, filter, and map are evaluated for each resulting RDD subset. However, it's also memory efficient in that 500k, 1m, 5m, etc. records are never sent to saveAsNewAPIHadoopFile, instead, these smaller RDDs that a relatively small spark driver can handle.
Any suggestions for different approaches would be greatly appreciated. Perhaps that means now using the Elasticsearch-Hadoop connector, but instead sending raw JSON to ES?
Update:
Looks like I'm still getting java heap space errors with this workaround, but leaving here to demonstrate thinking for a possible workaround. Had not anticipated that zipWithIndex would collect everything on the driver (which I'm assuming is the case here)
Update #2
Here is a debug string of the RDD I'ma attempting to run through saveAsNewAPIHadoopFile:
(32) PythonRDD[6] at RDD at PythonRDD.scala:48 []
| MapPartitionsRDD[5] at javaToPython at NativeMethodAccessorImpl.java:-2 []
| MapPartitionsRDD[4] at javaToPython at NativeMethodAccessorImpl.java:-2 []
| ShuffledRowRDD[3] at javaToPython at NativeMethodAccessorImpl.java:-2 []
+-(1) MapPartitionsRDD[2] at javaToPython at NativeMethodAccessorImpl.java:-2 []
| MapPartitionsRDD[1] at javaToPython at NativeMethodAccessorImpl.java:-2 []
| JDBCRDD[0] at javaToPython at NativeMethodAccessorImpl.java:-2 []
Update #3
Below is a DAG visualization for the take at SerDeUtil.scala:233 that appears to run on driver/localhost:
And a DAG for the saveAsNewAPIHadoopFile for a much smaller job (around 1k rows), as the 500k rows attempts never actually fire as the SerDeUtil stage above is what appears to trigger the java heap size problem for larger RDDs:

I'm still a bit confused as to why this addresses the problem, but it does. When reading rows from MySQL with spark.jdbc.read, by passing bounds, the resulting RDD appears to be partitioned in such a way that saveAsNewAPIHadoopFile is successful for large RDDs.
Have a Django model for the DB rows, so can get first and last column IDs:
records = records.order_by('id')
start_id = records.first().id
end_id = records.last().id
Then, pass those to spark.read.jdbc:
sqldf = spark.read.jdbc(
settings.COMBINE_DATABASE['jdbc_url'],
'core_record',
properties=settings.COMBINE_DATABASE,
column='id',
lowerBound=bounds['lowerBound'],
upperBound=bounds['upperBound'],
numPartitions=settings.SPARK_REPARTITION
)
The debug string for the RDD shows that the originating RDD now has 10 partitions:
(32) PythonRDD[11] at RDD at PythonRDD.scala:48 []
| MapPartitionsRDD[10] at javaToPython at NativeMethodAccessorImpl.java:-2 []
| MapPartitionsRDD[9] at javaToPython at NativeMethodAccessorImpl.java:-2 []
| ShuffledRowRDD[8] at javaToPython at NativeMethodAccessorImpl.java:-2 []
+-(10) MapPartitionsRDD[7] at javaToPython at NativeMethodAccessorImpl.java:-2 []
| MapPartitionsRDD[6] at javaToPython at NativeMethodAccessorImpl.java:-2 []
| JDBCRDD[5] at javaToPython at NativeMethodAccessorImpl.java:-2 []
Where my understanding breaks down, is that you can see there is a manual/explicit repartitioning to 32, both in the debug string from the question, and this one above, which I thought would be enough to ease memory pressure on the saveAsNewAPIHadoopFile call, but apparently the Dataframe (turned into an RDD) from the original spark.jdbc.read matters even downstream.

Related

Why is UDF not running in parallel on available executors?

I have a tiny spark Dataframe that essentially pushes a string into a UDF. I'm expecting, because of .repartition(3), which is the same length as targets, for the processing inside run_sequential to be applied on available executors - i.e. applied to 3 different executors.
The issue is that only 1 executor is used. How can I parallelise this processing to force my pyspark script to assign each element of target to a different executor?
import pandas as pd
import pyspark.sql.functions as F
def run_parallel(config):
def run_sequential(target):
#process with target variable
pass
return F.udf(run_sequential)
targets = ["target_1", "target_2", "target_3"]
config = {}
pdf = spark.createDataFrame(pd.DataFrame({"targets": targets})).repartition(3)
pdf.withColumn(
"apply_udf", run_training_parallel(config)("targets")
).collect()
The issue here is that repartitioning a DataFrame does not guarantee that all the created partitions will be of the same size. With such a small number of records there is a pretty high chance that some of them will map into the same partition. Spark is not meant to process such small datasets and its algorithms are tailored to work efficiently with large amounts of data - if your dataset has 3 million records and you split it in 3 partitions of approximately 1 million records each, a difference of several records per partition will be insignificant in most cases. This is obviously not the case when repartitioning 3 records.
You can use df.rdd.glom().map(len).collect() to examine the size of the partitions before and after repartitioning to see how the distribution changes.
$ pyspark --master "local[3]"
...
>>> pdf = spark.createDataFrame([("target_1",), ("target_2",), ("target_3",)]).toDF("targets")
>>> pdf.rdd.glom().map(len).collect()
[1, 1, 1]
>>> pdf.repartition(3).rdd.glom().map(len).collect()
[0, 2, 1]
As you can see, the resulting partitioning is uneven and the first partition in my case is actually empty. The irony here is that the original dataframe has the desired property and that one is getting destroyed by repartition().
While your particular case is not what Spark typically targets, it is still possible to forcefully distribute three records in three partitions. All you need to do is to provide an explicit partition key. RDDs have the zipWithIndex() method that extends each record with its ID. The ID is the perfect partition key since its value starts with 0 and increases by 1.
>>> new_df = (pdf
.coalesce(1) # not part of the solution - see below
.rdd # Convert to RDD
.zipWithIndex() # Append ID to each record
.map(lambda x: (x[1], x[0])) # Make record ID come first
.partitionBy(3) # Repartition
.map(lambda x: x[1]) # Remove record ID
.toDF()) # Turn back into a dataframe
>>> new_df.rdd.glom().map(len).collect()
[1, 1, 1]
In the above code, coalesce(1) is added only to demonstrate that the final partitioning is not influenced by the fact that pdf initially has one record in each partition.
A DataFrame-only solution is to first coalesce pdf to a single partition and then use repartition(3). With no partitioning column(s) provided, DataFrame.repartition() uses the round-robin partitioner and hence the desired partitioning will be achieved. You cannot simply do pdf.coalesce(1).repartition(3) since Catalyst (the Spark query optimisation engine) optimises out the coalesce operation, so a partitioning-dependent operation must be inserted in between. Adding a column containing F.monotonically_increasing_id() is a good candidate for such an operation.
>>> new_df = (pdf
.coalesce(1)
.withColumn("id", F.monotonically_increasing_id())
.repartition(3))
>>> new_df.rdd.glom().map(len).collect()
[1, 1, 1]
Note that, unlike in the RDD-based solution, coalesce(1) is required as part of the solution.

Avoid repartition costs when filtering and then coalescing

I am implementing a range query on an RDD of (x,y) points in pyspark. I partitioned the xy space into a 16*16 grid (256 cells) and assigned each point in my RDD to one of these cells.
The gridMappedRDD is a PairRDD: (cell_id, Point object)
I partitioned this RDD to 256 partitions, using:
gridMappedRDD.partitionBy(256)
The range query is a rectangular box. I have a method for my Grid object which can return the list of cell ids which overlap with the query range. So, I used this as a filter to prune the unrelated cells:
filteredRDD = gridMappedRDD.filter(lambda x: x[0] in candidateCells)
But the problem is that when running the query and then collecting the results, all the 256 partitions are evaluated; A task is created for each partition.
To avoid this problem, I tried coalescing the filteredRDD to the length of candidateCell list and I hoped this could solve the problem.
filteredRDD.coalesce(len(candidateCells))
In fact the resulting RDD has len(candidateCells) partitions but the partitions are not the same as gridMappedRDD.
As stated in the coalesce documentation, the shuffle parameter is False and no shuffle should be performed among partitions but I can see (with the help of glom()) that this is not the case.
For example after a coalesce(4) with candidateCells=[62, 63, 78, 79] the partitions are like this:
[[(62, P), (62, P) .... , (63, P)],
[(78, P), (78, P) .... , (79, P)],
[], []
]
Actually, by coalescing, I have a shuffle read which equals to the size of my whole dataset for every task, which takes a significant time. What I need is an RDD with only partitions related to cells in candidateCells, without any shuffles.
So, my question is that is it possible to filter only some partitions without reshuffling? For the above example, my filteredRDD would have 4 partitions with exactly the same data as originalRDD's 62, 63, 78, 79th partitions. Doing so, the query could be directed to affecting partitions only.
You made a few incorrect assumptions here:
The shuffle is not related to coalesce (nor coalesce is useful here). It is caused by partitionBy. Partitioning by definition requires shuffle.
Partitioning cannot be used to optimize filter. Spark knows nothing about the function you use (it is a black box).
Partitioning doesn't uniquely map keys to partitions. Multiple keys can be placed on the same partition - How does HashPartitioner work?
What can you do:
If resulting subset is small repartition and apply lookup for each key:
from itertools import chain
partitionedRDD = gridMappedRDD.partitionBy(256)
chain.from_iterable(
((c, x) for x in partitionedRDD.lookup(c))
for c in candidateCells
)
If data is large you can try to skip scanning partitions (number of tasks won't change, but some task can be short circuited):
candidatePartitions = [
partitionedRDD.partitioner.partitionFunc(c) for c in candidateCells
]
partitionedRDD.mapPartitionsWithIndex(
lambda i, xs: (x for x in xs if x[0] in candidateCells) if i in candidatePartitions else []
)
This two methods make sense only if you perform multiple "lookups". If it is one-off operation, it is better to perform linear filter:
It is cheaper than shuffle and repartitioning.
If initial data is uniformly distributed downstream processing will be able to better utilize available resources.

Is Dataset.rdd an action or transformation?

one of the way to evaluate if a dataframe is empty or not is to do df.rdd.isEmpty(), however, I see rdd at mycode.scala:123 in sparkUI executions. which makes me wonder if this rdd() function is actually an action is instead of a transformation.
I know that isEmpty() is an action, but I do see a separate stage where isEmpty() at mycode.scala:234, so I think they are different actions?
rdd is generated to represent a structured query in "RDD terms" so Spark can execute it. It is an RDD of JVM objects of your type T. If used incorrectly can cause memory problems since:
Transfers internally-managed optimized rows that live outside JVM to the memory space in JVM
Transforms the binary rows to your business objects (the JVM "true" representation)
The first will increase the JVM memory required for the computation while the latter is an extra transformation step.
For such a simple calculation where you count the number of rows, you'd rather stick to count as the optimized and fairly cheap computation (that can avoid copying objects and applying schema).
Internally, Dataset keeps rows in their InternalRow. That decreases JVM memory requirement for your Spark application. The RDD (from rdd) is computed to represent the Spark transformations that are going to be executed once a Spark action is executed.
Please note that executing rdd creates a RDD and does require some calculations too.
So, yes, rdd might be considered an action as it "executes" the query (i.e. the physical plan of the Dataset that sits behind), but in the end it just gives an RDD (so it can't be an action by definition since Spark actions return a non-RDD value).
As you can see in the code:
lazy val rdd: RDD[T] = {
val objectType = exprEnc.deserializer.dataType
val deserialized = CatalystSerde.deserialize[T](logicalPlan) // <-- HERE see explanation below
sparkSession.sessionState.executePlan(deserialized).toRdd.mapPartitions { rows =>
rows.map(_.get(0, objectType).asInstanceOf[T])
}
}
rdd is computed lazily and only once.
one of the way to evaluate if a dataframe is empty or not is to do df.rdd.isEmpty()
I wonder where did you find it. I'd just count:
count(): Long Returns the number of rows in the Dataset.
toRdd Lazy Value
If you insist on going fairly low-level to check whether your Dataset is empty or not, I'd rather use Dataset.queryExecution.toRdd instead. That's almost like rdd without this extra copying and applying schema.
df.queryExecution.toRdd.isEmpty
Compare the following RDD lineages and think which may seem better.
val dataset = spark.range(5).withColumn("group", 'id % 2)
scala> dataset.rdd.toDebugString
res1: String =
(8) MapPartitionsRDD[8] at rdd at <console>:26 [] // <-- extra deserialization step
| MapPartitionsRDD[7] at rdd at <console>:26 []
| MapPartitionsRDD[6] at rdd at <console>:26 []
| MapPartitionsRDD[5] at rdd at <console>:26 []
| ParallelCollectionRDD[4] at rdd at <console>:26 []
// Compare with a more memory-optimized alternative
// Avoids copies and has no schema
scala> dataset.queryExecution.toRdd.toDebugString
res2: String =
(8) MapPartitionsRDD[11] at toRdd at <console>:26 []
| MapPartitionsRDD[10] at toRdd at <console>:26 []
| ParallelCollectionRDD[9] at toRdd at <console>:26 []
From Spark perspective, the transformations are fairly cheap since they don't cause any shuffles, but given the memory requirements change between the computation I'd use the latter (with toRdd).
rdd Lazy Value
rdd represents the content of the Dataset as a (lazily-created) RDD with rows of the JVM type T.
rdd: RDD[T]
As you can see in the source code (pasted above), requesting rdd in the end will trigger one extra computation just to get the RDD.
Creates a new logical plan to deserialize the Dataset’s logical plan, i.e. you get extra deserialization from internal binary row format that is managed outside JVM to its corresponding representation as JVM objects living inside JVM (think of GC that you should avoid at all cost)

Apache Spark OutOfMemoryError (HeapSpace)

I have a dataset with ~5M rows x 20 columns, containing a groupID and a rowID. My goal is to check whether (some) columns contain more than a fixed fraction (say, 50%) of missing (null) values within a group. If this is found, the entire column is set to missing (null), for that group.
df = spark.read.parquet('path/to/parquet/')
check_columns = {'col1': ..., 'col2': ..., ...} # currently len(check_columns) = 8
for col, _ in check_columns.items():
total = (df
.groupBy('groupID').count()
.toDF('groupID', 'n_total')
)
missing = (df
.where(F.col(col).isNull())
.groupBy('groupID').count()
.toDF('groupID', 'n_missing')
)
# count_missing = count_missing.persist() # PERSIST TRY 1
# print('col {} found {} missing'.format(col, missing.count())) # missing.count() is b/w 1k-5k
poor_df = (total
.join(missing, 'groupID')
.withColumn('freq', F.col('n_missing') / F.col('n_total'))
.where(F.col('freq') > 0.5)
.select('groupID')
.toDF('poor_groupID')
)
df = (df
.join(poor_df, df['groupID'] == poor_df['poor_groupID'], 'left_outer')
.withColumn(col, (F.when(F.col('poor_groupID').isNotNull(), None)
.otherwise(df[col])
)
)
.select(df.columns)
)
stats = (missing
.withColumnRenamed('n_missing', 'cnt')
.collect() # FAIL 1
)
# df = df.persist() # PERSIST TRY 2
print(df.count()) # FAIL 2
I initially assigned 1G of spark.driver.memory and 4G of spark.executor.memory, eventually increasing the spark.driver.memory up to 10G.
Problem(s):
The loop runs like a charm during the first iterations, but towards the end,
around the 6th or 7th iteration I see my CPU utilization dropping (using 1
instead of 6 cores). Along with that, execution time for one iteration
increases significantly.
At some point, I get an OutOfMemory Error:
spark.driver.memory < 4G: at collect() (FAIL 1)
4G <= spark.driver.memory < 10G: at the count() step (FAIL 2)
Stack Trace for FAIL 1 case (relevant part):
[...]
py4j.protocol.Py4JJavaError: An error occurred while calling o1061.collectToPython.
: java.lang.OutOfMemoryError: Java heap space
[...]
The executor UI does not reflect excessive memory usage (it shows a <50k used
memory for the driver and <1G for the executor). The Spark metrics system
(app-XXX.driver.BlockManager.memory.memUsed_MB) does not either: it shows
600M to 1200M of used memory, but always >300M remaining memory.
(This would suggest that 2G driver memory should do it, but it doesn't.)
It also does not matter which column is processed first (as it is a loop over
a dict(), it can be in arbitrary order).
My questions thus:
What causes the OutOfMemory Error and why are not all available CPU cores
used towards the end?
And why do I need 10G spark.driver.memory when I am transferring only a few kB from the executors to the driver?
A few (general) questions to make sure I understand things properly:
If I get an OOM error, the right place to look at is almost always the driver
(b/c the executor spills to disk)?
Why would count() cause an OOM error - I thought this action would only
consume resources on the exector(s) (delivering a few bytes to the driver)?
Are the memory metrics (metrics system, UI) mentioned above the correct
places to look at?
BTW: I run Spark 2.1.0 in standalone mode.
UPDATE 2017-04-28
To drill down further, I enabled a heap dump for the driver:
cfg = SparkConfig()
cfg.set('spark.driver.extraJavaOptions', '-XX:+HeapDumpOnOutOfMemoryError')
I ran it with 8G of spark.driver.memory and I analyzed the heap dump with
Eclipse MAT. It turns out there are two classes of considerable size (~4G each):
java.lang.Thread
- char (2G)
- scala.collection.IndexedSeqLike
- scala.collection.mutable.WrappedArray (1G)
- java.lang.String (1G)
org.apache.spark.sql.execution.ui.SQLListener
- org.apache.spark.sql.execution.ui.SQLExecutionUIData
(various of up to 1G in size)
- java.lang.String
- ...
I tried to turn off the UI, using
cfg.set('spark.ui.enabled', 'false')
which made the UI unavailable, but didn't help on the OOM error. Also, I tried
to have the UI to keep less history, using
cfg.set('spark.ui.retainedJobs', '1')
cfg.set('spark.ui.retainedStages', '1')
cfg.set('spark.ui.retainedTasks', '1')
cfg.set('spark.sql.ui.retainedExecutions', '1')
cfg.set('spark.ui.retainedDeadExecutors', '1')
This also did not help.
UPDATE 2017-05-18
I found out about Spark's pyspark.sql.DataFrame.checkpoint method. This is like persist but gets rid of the dataframe's lineage. Thus it helps to circumvent the above mentioned issues.

How does Spark manage stages?

I am trying to understand how jobs and stages are defined in spark, and for that I am now using the code that I found here and spark UI. In order to see it on spark UI I had to copy and paste the text on the files several times so it takes more time to process.
Here is the output of spark UI:
Now, I understand that there are three jobs because there are three actions and also that the stages are generated because of shuffle actions, but what I don't understand is why in the Job 1 stages 4, 5 and 6 are the same as stages 0, 1 and 2 of Job 0 and the same happens for Job 2.
How can I know what stages will be in more than a job only seeing the java code (before executing anything)? And also, why are stage 4 and 9 skipped and how can I know it will happen before executing?
I understand that there are three jobs because there are three actions
I'd even say that there could have been more Spark jobs but the minimum number is 3. It all depends on the implementation of transformations and the action used.
I don't understand is why in the Job 1 stages 4, 5 and 6 are the same as stages 0, 1 and 2 of Job 0 and the same happens for Job 2.
Job 1 is the result of some action that ran on a RDD, finalRdd. That RDD was created using (going backwards): join, textFile, map, and distinct.
val people = sc.textFile("people.csv").map { line =>
val tokens = line.split(",")
val key = tokens(2)
(key, (tokens(0), tokens(1))) }.distinct
val cities = sc.textFile("cities.csv").map { line =>
val tokens = line.split(",")
(tokens(0), tokens(1))
}
val finalRdd = people.join(cities)
Run the above and you'll see the same DAG.
Now, when you execute leftOuterJoin or rightOuterJoin actions, you'll get the other two DAGs. You're using the previously-used RDDs to run new Spark jobs and so you'll see the same stages.
why are stage 4 and 9 skipped
Often, Spark will skip execution of some stages. The grayed-out stages are ones already computed so Spark will reuse them and so make performance better.
How can I know what stages will be in more than a job only seeing the java code (before executing anything)?
That's what RDD lineage (graph) offers.
scala> people.leftOuterJoin(cities).toDebugString
res15: String =
(3) MapPartitionsRDD[99] at leftOuterJoin at <console>:28 []
| MapPartitionsRDD[98] at leftOuterJoin at <console>:28 []
| CoGroupedRDD[97] at leftOuterJoin at <console>:28 []
+-(2) MapPartitionsRDD[81] at distinct at <console>:27 []
| | ShuffledRDD[80] at distinct at <console>:27 []
| +-(2) MapPartitionsRDD[79] at distinct at <console>:27 []
| | MapPartitionsRDD[78] at map at <console>:24 []
| | people.csv MapPartitionsRDD[77] at textFile at <console>:24 []
| | people.csv HadoopRDD[76] at textFile at <console>:24 []
+-(3) MapPartitionsRDD[84] at map at <console>:29 []
| cities.csv MapPartitionsRDD[83] at textFile at <console>:29 []
| cities.csv HadoopRDD[82] at textFile at <console>:29 []
As you can see yourself, you will end up with 4 stages since there are 3 shuffle dependencies (the edges with the numbers of partitions).
Numbers in the round brackets are the number of partitions that DAGScheduler will eventually use to create task sets with the exact number of tasks. One TaskSet per stage.

Resources