PySpark Block Matrix multiplication fails with OOM - apache-spark

I'm trying to do a matrix multiplication chain of size 67584*67584 using Pyspark but it constantly runs out of memory or OOM error.Here are the details:
Input is matlab file(.mat file) which has the matrix in a single file. I load the file using scipy loadmat, split the file into multiple files of block size (1024*1024) and store them back in .mat format.
Now mapper loads each file using filelist and create a rdd of blocks.
filelist = sc.textFile(BLOCKS_DIR + 'filelist.txt',minPartitions=200)
blocks_rdd = filelist.map(MapperLoadBlocksFromMatFile).cache()
MapperLoadBlocksFromMatFile is a function as below:
def MapperLoadBlocksFromMatFile(filename):
data = loadmat(filename)
G = data['G']
id = data['block_id'].flatten()
n = G.shape[0]
if(not(isinstance(G,sparse.csc_matrix))):
sub_matrix = Matrices.dense(n, n, G.transpose().flatten())
else:
sub_matrix = Matrices.dense(n,n,np.array(G.todense()).transpose().flatten())
return ((id[0], id[1]), sub_matrix)
Now once i have this rdd, i create a BlockMatrix from it. and Do a matrix multiplication with it.
adjacency_mat = BlockMatrix(blocks_rdd, block_size, block_size, adj_mat.shape[0], adj_mat.shape[1])
I'm using the multiply method from BlockMatrix implementation and it runs out of memory every single time.
Result = adjacency_mat.multiply(adjacency_mat)
Below are the cluster configuration details:
50 nodes of 64gb Memory and 20 cores processors.
worker-> 60gb and 16 cores
executors-> 15gb and 4 cores each
driver.memory -> 60gb and maxResultSize->10gb
i even tried with rdd.compress. Inspite of having enough memory and cores, i run out of memory every time. Every time a different node runs out of memory and i don't have an option of using visualVM in the cluster . What am i doing wrong? Is the way blockmatrix is created wrong? Or am i not accounting for enough memory?
OOM Error Stacktrace

Related

KilledWorker error in dask when doing embarrassingly parallel data concatenation

I have an embarrassingly parallel workload where I am reading a group of parquet files, concatenating them into bigger parquet files, and then writing it back to the disk. I am running this in a distributed computer (with distributed filesystem) with some ~300 workers, with each worker having 20GB of RAM. Each individual piece of work should only be consuming 2-3 GB of RAM but somehow the workers are crashing due to memory error (getting: distributed.scheduler.KilledWorker exception). I can see the following on the worker's output log:
Memory use is high but worker has no data to store to disk. Perhaps
some other process is leaking memory. Process memory: 18.20 GB
with open('ts_files_list.txt', 'r') as f:
all_files = f.readlines()
# There are about 500K files
all_files = [f.strip() for f in all_files]
# grouping them into groups of 50.
# The concatenated df should be about 1GB in memory
npart = 10000
file_pieces = np.array_split(all_files, npart)
def read_and_combine(filenames, group_name):
dfs = [pd.read_parquet(f) for f in filenames]
grouped_df = pd.concat(dfs)
grouped_df.to_parquet(f, engine='pyarrow')
group_names = [f'group{i} for i in range(npart)]
delayed_func = dask.delayed(read_and_combine)
# the following line shouldn't have resulted in a memory error, but it does
dask.compute(map(delayed_func, file_pieces, group_names))
Am I missing something obvious here?
Thanks!
Dask version: 2021.01.0, pyarrow version: 2.0.0, distributed version: 2021.01.0
There are a couple of syntactic errors, but overall the workflow seems reasonable.
with open('ts_files_list.txt', 'r') as f:
all_files = f.readlines()
all_files = [f.strip() for f in all_files]
npart = 10000
file_pieces = np.array_split(all_files, npart)
def read_and_combine(filenames, group_name):
grouped_df = pd.concat(pd.read_parquet(f) for f in filenames)
grouped_df.to_parquet(group_name, engine='pyarrow')
del grouped_df # this is optional (in principle dask should clean up these objects)
group_names = [f'group{i}' for i in range(npart)]
delayed_func = dask.delayed(read_and_combine)
dask.compute(map(delayed_func, file_pieces, group_names))
One more thing to keep in mind is that parquet files are compressed by default, so when unpacked they could occupy much more memory than their compressed file size. Not sure if this applies to your workflow, but something to keep in mind when experiencing memory problems.

Compute size of Spark dataframe - SizeEstimator gives unexpected results

I am trying to find a reliable way to compute the size (in bytes) of a Spark dataframe programmatically.
The reason is that I would like to have a method to compute an "optimal" number of partitions ("optimal" could mean different things here: it could mean having an optimal partition size, or resulting in an optimal file size when writing to Parquet tables - but both can be assumed to be some linear function of the dataframe size). In other words, I would like to call coalesce(n) or repartition(n) on the dataframe, where n is not a fixed number but rather a function of the dataframe size.
Other topics on SO suggest using SizeEstimator.estimate from org.apache.spark.util to get the size in bytes of the dataframe, but the results I'm getting are inconsistent.
First of all, I'm persisting my dataframe to memory:
df.cache().count
The Spark UI shows a size of 4.8GB in the Storage tab. Then, I run the following command to get the size from SizeEstimator:
import org.apache.spark.util.SizeEstimator
SizeEstimator.estimate(df)
This gives a result of 115'715'808 bytes =~ 116MB. However, applying SizeEstimator to different objects leads to very different results. For instance, I try computing the size separately for each row in the dataframe and sum them:
df.map(row => SizeEstimator.estimate(row.asInstanceOf[ AnyRef ])).reduce(_+_)
This results in a size of 12'084'698'256 bytes =~ 12GB. Or, I can try to apply SizeEstimator to every partition:
df.mapPartitions(
iterator => Seq(SizeEstimator.estimate(
iterator.toList.map(row => row.asInstanceOf[ AnyRef ]))).toIterator
).reduce(_+_)
which results again in a different size of 10'792'965'376 bytes =~ 10.8GB.
I understand there are memory optimizations / memory overhead involved, but after performing these tests I don't see how SizeEstimator can be used to get a sufficiently good estimate of the dataframe size (and consequently of the partition size, or resulting Parquet file sizes).
What is the appropriate way (if any) to apply SizeEstimator in order to get a good estimate of a dataframe size or of its partitions? If there isn't any, what is the suggested approach here?
Unfortunately, I was not able to get reliable estimates from SizeEstimator, but I could find another strategy - if the dataframe is cached, we can extract its size from queryExecution as follows:
df.cache.foreach(_ => ())
val catalyst_plan = df.queryExecution.logical
val df_size_in_bytes = spark.sessionState.executePlan(
catalyst_plan).optimizedPlan.stats.sizeInBytes
For the example dataframe, this gives exactly 4.8GB (which also corresponds to the file size when writing to an uncompressed Parquet table).
This has the disadvantage that the dataframe needs to be cached, but it is not a problem in my case.
EDIT: Replaced df.cache.foreach(_=>_) by df.cache.foreach(_ => ()), thanks to #DavidBenedeki for pointing it out in the comments.
SizeEstimator returns the number of bytes an object takes up on the JVM heap. This includes objects referenced by the object, the actual object size will almost always be much smaller.
The discrepancies in sizes you've observed are because when you create new objects on the JVM the references take up memory too, and this is being counted.
Check out the docs here 🤩
https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.util.SizeEstimator$
Apart from Size estimator, which you have already tried(good insight)..
below is another option
RDDInfo[] getRDDStorageInfo()
Return information about what RDDs are cached, if they are in mem or on both, how much space they take, etc.
actually spark storage tab uses this.Spark docs
Below is the implementation from spark
/**
* :: DeveloperApi ::
* Return information about what RDDs are cached, if they are in mem or on disk, how much space
* they take, etc.
*/
#DeveloperApi
def getRDDStorageInfo: Array[RDDInfo] = {
getRDDStorageInfo(_ => true)
}
private[spark] def getRDDStorageInfo(filter: RDD[_] => Boolean): Array[RDDInfo] = {
assertNotStopped()
val rddInfos = persistentRdds.values.filter(filter).map(RDDInfo.fromRdd).toArray
rddInfos.foreach { rddInfo =>
val rddId = rddInfo.id
val rddStorageInfo = statusStore.asOption(statusStore.rdd(rddId))
rddInfo.numCachedPartitions = rddStorageInfo.map(_.numCachedPartitions).getOrElse(0)
rddInfo.memSize = rddStorageInfo.map(_.memoryUsed).getOrElse(0L)
rddInfo.diskSize = rddStorageInfo.map(_.diskUsed).getOrElse(0L)
}
rddInfos.filter(_.isCached)
}
yourRDD.toDebugString from RDD also uses this. code here
General Note :
In my opinion, to get optimal number of records in each partition and check your repartition is correct and they are uniformly distributed, I would suggest to try like below... and adjust your re-partition number. and then measure the size of partition... would be more sensible. to address this kind of problems
yourdf.rdd.mapPartitionsWithIndex{case (index,rows) => Iterator((index,rows.size))}
.toDF("PartitionNumber","NumberOfRecordsPerPartition")
.show
or existing spark functions (based on spark version)
import org.apache.spark.sql.functions._
df.withColumn("partitionId", sparkPartitionId()).groupBy("partitionId").count.show
My suggestion is
from sys import getsizeof
def compare_size_two_object(one, two):
'''compare size of two files in bites'''
print(getsizeof(one), 'versus', getsizeof(two))

Spark Pipe function utilizes CPU only 50%

I have a legacy code in C++ that gets a file path on HDFS as input, runs and writes its output to local HDD.
Following is how I call it:
val trainingRDD = pathsRdd.pipe(command = commandSeq, env = Map(), printPipeContext = _ => (),
printRDDElement = (kV, printFn) => {
val hdfsPath = kV._2
printFn(hdfsPath)
}, separateWorkingDir = false)
I see CPU utilization around 50% on Ganglia. spark.task.cpus setting is equal to 1. So, each task gets 1 core. But my question is, when I call the binary with pipe, does that binary gets all cores available on the host just like any other executable or is it restricted to how many cores that pipe task has? So far, increasing spark.task.cpus to 2 didn't increase usage.
Spark it could read files parallel and also there have partition machilism in spark ,more partition = more parallel. if you want increase your CPU utilization you could configure your SparkContext to
sc = new SparkConf().setMaster("local[*]")
More details acess https://spark.apache.org/docs/latest/configuration.html

Apache Spark OutOfMemoryError (HeapSpace)

I have a dataset with ~5M rows x 20 columns, containing a groupID and a rowID. My goal is to check whether (some) columns contain more than a fixed fraction (say, 50%) of missing (null) values within a group. If this is found, the entire column is set to missing (null), for that group.
df = spark.read.parquet('path/to/parquet/')
check_columns = {'col1': ..., 'col2': ..., ...} # currently len(check_columns) = 8
for col, _ in check_columns.items():
total = (df
.groupBy('groupID').count()
.toDF('groupID', 'n_total')
)
missing = (df
.where(F.col(col).isNull())
.groupBy('groupID').count()
.toDF('groupID', 'n_missing')
)
# count_missing = count_missing.persist() # PERSIST TRY 1
# print('col {} found {} missing'.format(col, missing.count())) # missing.count() is b/w 1k-5k
poor_df = (total
.join(missing, 'groupID')
.withColumn('freq', F.col('n_missing') / F.col('n_total'))
.where(F.col('freq') > 0.5)
.select('groupID')
.toDF('poor_groupID')
)
df = (df
.join(poor_df, df['groupID'] == poor_df['poor_groupID'], 'left_outer')
.withColumn(col, (F.when(F.col('poor_groupID').isNotNull(), None)
.otherwise(df[col])
)
)
.select(df.columns)
)
stats = (missing
.withColumnRenamed('n_missing', 'cnt')
.collect() # FAIL 1
)
# df = df.persist() # PERSIST TRY 2
print(df.count()) # FAIL 2
I initially assigned 1G of spark.driver.memory and 4G of spark.executor.memory, eventually increasing the spark.driver.memory up to 10G.
Problem(s):
The loop runs like a charm during the first iterations, but towards the end,
around the 6th or 7th iteration I see my CPU utilization dropping (using 1
instead of 6 cores). Along with that, execution time for one iteration
increases significantly.
At some point, I get an OutOfMemory Error:
spark.driver.memory < 4G: at collect() (FAIL 1)
4G <= spark.driver.memory < 10G: at the count() step (FAIL 2)
Stack Trace for FAIL 1 case (relevant part):
[...]
py4j.protocol.Py4JJavaError: An error occurred while calling o1061.collectToPython.
: java.lang.OutOfMemoryError: Java heap space
[...]
The executor UI does not reflect excessive memory usage (it shows a <50k used
memory for the driver and <1G for the executor). The Spark metrics system
(app-XXX.driver.BlockManager.memory.memUsed_MB) does not either: it shows
600M to 1200M of used memory, but always >300M remaining memory.
(This would suggest that 2G driver memory should do it, but it doesn't.)
It also does not matter which column is processed first (as it is a loop over
a dict(), it can be in arbitrary order).
My questions thus:
What causes the OutOfMemory Error and why are not all available CPU cores
used towards the end?
And why do I need 10G spark.driver.memory when I am transferring only a few kB from the executors to the driver?
A few (general) questions to make sure I understand things properly:
If I get an OOM error, the right place to look at is almost always the driver
(b/c the executor spills to disk)?
Why would count() cause an OOM error - I thought this action would only
consume resources on the exector(s) (delivering a few bytes to the driver)?
Are the memory metrics (metrics system, UI) mentioned above the correct
places to look at?
BTW: I run Spark 2.1.0 in standalone mode.
UPDATE 2017-04-28
To drill down further, I enabled a heap dump for the driver:
cfg = SparkConfig()
cfg.set('spark.driver.extraJavaOptions', '-XX:+HeapDumpOnOutOfMemoryError')
I ran it with 8G of spark.driver.memory and I analyzed the heap dump with
Eclipse MAT. It turns out there are two classes of considerable size (~4G each):
java.lang.Thread
- char (2G)
- scala.collection.IndexedSeqLike
- scala.collection.mutable.WrappedArray (1G)
- java.lang.String (1G)
org.apache.spark.sql.execution.ui.SQLListener
- org.apache.spark.sql.execution.ui.SQLExecutionUIData
(various of up to 1G in size)
- java.lang.String
- ...
I tried to turn off the UI, using
cfg.set('spark.ui.enabled', 'false')
which made the UI unavailable, but didn't help on the OOM error. Also, I tried
to have the UI to keep less history, using
cfg.set('spark.ui.retainedJobs', '1')
cfg.set('spark.ui.retainedStages', '1')
cfg.set('spark.ui.retainedTasks', '1')
cfg.set('spark.sql.ui.retainedExecutions', '1')
cfg.set('spark.ui.retainedDeadExecutors', '1')
This also did not help.
UPDATE 2017-05-18
I found out about Spark's pyspark.sql.DataFrame.checkpoint method. This is like persist but gets rid of the dataframe's lineage. Thus it helps to circumvent the above mentioned issues.

GraphX does not work with relatively big graphs

I cannot process graph with 230M edges.
I cloned apache.spark, built it and then tried it on cluster.
I use Spark Standalone Cluster:
-5 machines (each has 12 cores/32GB RAM)
-'spark.executor.memory' == 25g
-'spark.driver.memory' == 3g
Graph has 231359027 edges. And its file weights 4,524,716,369 bytes.
Graph is represented in text format:
sourceVertexId destinationVertexId
My code:
object Canonical {
def main(args: Array[String]) {
val numberOfArguments = 3
require(args.length == numberOfArguments, s"""Wrong argument number. Should be $numberOfArguments .
|Usage: <path_to_grpah> <partiotioner_name> <minEdgePartitions> """.stripMargin)
var graph: Graph[Int, Int] = null
val nameOfGraph = args(0).substring(args(0).lastIndexOf("/") + 1)
val partitionerName = args(1)
val minEdgePartitions = args(2).toInt
val sc = new SparkContext(new SparkConf()
.setSparkHome(System.getenv("SPARK_HOME"))
.setAppName(s" partitioning | $nameOfGraph | $partitionerName | $minEdgePartitions parts ")
.setJars(SparkContext.jarOfClass(this.getClass).toList))
graph = GraphLoader.edgeListFile(sc, args(0), false, edgeStorageLevel = StorageLevel.MEMORY_AND_DISK,
vertexStorageLevel = StorageLevel.MEMORY_AND_DISK, minEdgePartitions = minEdgePartitions)
graph = graph.partitionBy(PartitionStrategy.fromString(partitionerName))
println(graph.edges.collect.length)
println(graph.vertices.collect.length)
}
}
After I run it I encountered number of java.lang.OutOfMemoryError: Java heap space errors and of course I did not get a result.
Do I have problem in the code? Or in cluster configuration?
Because it works fine for relatively small graphs. But for this graph it never worked. (And I do not think that 230M edges is too big data)
Thank you for any advise!
RESOLVED
I did not put enough memory for driver program.
I've changed cluster configuration to:
-4 workers (each has 12 cores/32GB RAM)
-1 master with driver program (each has 12 cores/32GB RAM)
-'spark.executor.memory' == 25g
-'spark.driver.memory' == 25g
And also it was not good idea to collect all vertices and edges to count them. It is easy to do just this: graph.vertices.count and graph.edges.count
What I suggest is you do a binary search to find the maximal size of data the cluster can handle. Take 50% of the graph and see if that works. If it does, try 75%. Etc.
My rule of thumb is you need 20–30× the memory for a given size of input. For 4.5 GB this suggests the limit would be around 100 GB. You have exactly that amount. I have no experience with GraphX: it probably adds another multiplier to the memory use. In my opinion you simply don't have enough memory.

Resources