I have an embarrassingly parallel workload where I am reading a group of parquet files, concatenating them into bigger parquet files, and then writing it back to the disk. I am running this in a distributed computer (with distributed filesystem) with some ~300 workers, with each worker having 20GB of RAM. Each individual piece of work should only be consuming 2-3 GB of RAM but somehow the workers are crashing due to memory error (getting: distributed.scheduler.KilledWorker exception). I can see the following on the worker's output log:
Memory use is high but worker has no data to store to disk. Perhaps
some other process is leaking memory. Process memory: 18.20 GB
with open('ts_files_list.txt', 'r') as f:
all_files = f.readlines()
# There are about 500K files
all_files = [f.strip() for f in all_files]
# grouping them into groups of 50.
# The concatenated df should be about 1GB in memory
npart = 10000
file_pieces = np.array_split(all_files, npart)
def read_and_combine(filenames, group_name):
dfs = [pd.read_parquet(f) for f in filenames]
grouped_df = pd.concat(dfs)
grouped_df.to_parquet(f, engine='pyarrow')
group_names = [f'group{i} for i in range(npart)]
delayed_func = dask.delayed(read_and_combine)
# the following line shouldn't have resulted in a memory error, but it does
dask.compute(map(delayed_func, file_pieces, group_names))
Am I missing something obvious here?
Thanks!
Dask version: 2021.01.0, pyarrow version: 2.0.0, distributed version: 2021.01.0
There are a couple of syntactic errors, but overall the workflow seems reasonable.
with open('ts_files_list.txt', 'r') as f:
all_files = f.readlines()
all_files = [f.strip() for f in all_files]
npart = 10000
file_pieces = np.array_split(all_files, npart)
def read_and_combine(filenames, group_name):
grouped_df = pd.concat(pd.read_parquet(f) for f in filenames)
grouped_df.to_parquet(group_name, engine='pyarrow')
del grouped_df # this is optional (in principle dask should clean up these objects)
group_names = [f'group{i}' for i in range(npart)]
delayed_func = dask.delayed(read_and_combine)
dask.compute(map(delayed_func, file_pieces, group_names))
One more thing to keep in mind is that parquet files are compressed by default, so when unpacked they could occupy much more memory than their compressed file size. Not sure if this applies to your workflow, but something to keep in mind when experiencing memory problems.
Related
I'm having issues with a particular spark method, saveAsNewAPIHadoopFile. The context is that I'm using pyspark, indexing RDDs with 1k, 10k, 50k, 500k, 1m records into ElasticSearch (ES).
For a variety of reasons, the Spark context is quite underpowered with a 2gb driver, and single 2gb executor.
I've had no problem until about 500k, when I'm getting java heap size problems. Increasing the spark.driver.memory to about 4gb, and I'm able to index more. However, there is a limit to how long this will work, and we would like to index in upwards of 500k, 1m, 5m, 20m records.
Also constrained to using pyspark, for a variety of reasons. The bottleneck and breakpoint seems to be a spark stage called take at SerDeUtil.scala:233, that no matter how many partitions the RDD has going in, it drops down to one, which I'm assuming is the driver collecting the partitions and preparing for indexing.
Now - I'm wondering if there is an efficient way to still use an approach like the following, given that constraint:
to_index_rdd.saveAsNewAPIHadoopFile(
path='-',
outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf={
"es.resource":"%s/record" % index_name,
"es.nodes":"192.168.45.10:9200",
"es.mapping.exclude":"temp_id",
"es.mapping.id":"temp_id",
}
)
In pursuit of a good solution, I might as well air some dirty laundry. I've got a terribly inefficient workaround that uses zipWithIndex to chunk an RDD, and send those subsets to the indexing function above. Looks a bit like this:
def index_chunks_to_es(spark=None, job=None, kwargs=None, rdd=None, chunk_size_limit=10000):
# zip with index
zrdd = rdd.zipWithIndex()
# get count
job.update_record_count(save=False)
count = job.record_count
# determine number of chunks
steps = count / chunk_size_limit
if steps % 1 != 0:
steps = int(steps) + 1
# evenly distribute chunks, while not exceeding chunk_limit
dist_chunk_size = int(count / steps) + 1
# loop through steps, appending subset to list for return
for step in range(0, steps):
# determine bounds
lower_bound = step * dist_chunk_size
upper_bound = (step + 1) * dist_chunk_size
print(lower_bound, upper_bound)
# select subset
rdd_subset = zrdd.filter(lambda x: x[1] >= lower_bound and x[1] < upper_bound).map(lambda x: x[0])
# index to ElasticSearch
ESIndex.index_job_to_es_spark(
spark,
job=job,
records_df=rdd_subset.toDF(),
index_mapper=kwargs['index_mapper']
)
It's slow, if I'm understanding correctly, because that zipWithIndex, filter, and map are evaluated for each resulting RDD subset. However, it's also memory efficient in that 500k, 1m, 5m, etc. records are never sent to saveAsNewAPIHadoopFile, instead, these smaller RDDs that a relatively small spark driver can handle.
Any suggestions for different approaches would be greatly appreciated. Perhaps that means now using the Elasticsearch-Hadoop connector, but instead sending raw JSON to ES?
Update:
Looks like I'm still getting java heap space errors with this workaround, but leaving here to demonstrate thinking for a possible workaround. Had not anticipated that zipWithIndex would collect everything on the driver (which I'm assuming is the case here)
Update #2
Here is a debug string of the RDD I'ma attempting to run through saveAsNewAPIHadoopFile:
(32) PythonRDD[6] at RDD at PythonRDD.scala:48 []
| MapPartitionsRDD[5] at javaToPython at NativeMethodAccessorImpl.java:-2 []
| MapPartitionsRDD[4] at javaToPython at NativeMethodAccessorImpl.java:-2 []
| ShuffledRowRDD[3] at javaToPython at NativeMethodAccessorImpl.java:-2 []
+-(1) MapPartitionsRDD[2] at javaToPython at NativeMethodAccessorImpl.java:-2 []
| MapPartitionsRDD[1] at javaToPython at NativeMethodAccessorImpl.java:-2 []
| JDBCRDD[0] at javaToPython at NativeMethodAccessorImpl.java:-2 []
Update #3
Below is a DAG visualization for the take at SerDeUtil.scala:233 that appears to run on driver/localhost:
And a DAG for the saveAsNewAPIHadoopFile for a much smaller job (around 1k rows), as the 500k rows attempts never actually fire as the SerDeUtil stage above is what appears to trigger the java heap size problem for larger RDDs:
I'm still a bit confused as to why this addresses the problem, but it does. When reading rows from MySQL with spark.jdbc.read, by passing bounds, the resulting RDD appears to be partitioned in such a way that saveAsNewAPIHadoopFile is successful for large RDDs.
Have a Django model for the DB rows, so can get first and last column IDs:
records = records.order_by('id')
start_id = records.first().id
end_id = records.last().id
Then, pass those to spark.read.jdbc:
sqldf = spark.read.jdbc(
settings.COMBINE_DATABASE['jdbc_url'],
'core_record',
properties=settings.COMBINE_DATABASE,
column='id',
lowerBound=bounds['lowerBound'],
upperBound=bounds['upperBound'],
numPartitions=settings.SPARK_REPARTITION
)
The debug string for the RDD shows that the originating RDD now has 10 partitions:
(32) PythonRDD[11] at RDD at PythonRDD.scala:48 []
| MapPartitionsRDD[10] at javaToPython at NativeMethodAccessorImpl.java:-2 []
| MapPartitionsRDD[9] at javaToPython at NativeMethodAccessorImpl.java:-2 []
| ShuffledRowRDD[8] at javaToPython at NativeMethodAccessorImpl.java:-2 []
+-(10) MapPartitionsRDD[7] at javaToPython at NativeMethodAccessorImpl.java:-2 []
| MapPartitionsRDD[6] at javaToPython at NativeMethodAccessorImpl.java:-2 []
| JDBCRDD[5] at javaToPython at NativeMethodAccessorImpl.java:-2 []
Where my understanding breaks down, is that you can see there is a manual/explicit repartitioning to 32, both in the debug string from the question, and this one above, which I thought would be enough to ease memory pressure on the saveAsNewAPIHadoopFile call, but apparently the Dataframe (turned into an RDD) from the original spark.jdbc.read matters even downstream.
I have a legacy code in C++ that gets a file path on HDFS as input, runs and writes its output to local HDD.
Following is how I call it:
val trainingRDD = pathsRdd.pipe(command = commandSeq, env = Map(), printPipeContext = _ => (),
printRDDElement = (kV, printFn) => {
val hdfsPath = kV._2
printFn(hdfsPath)
}, separateWorkingDir = false)
I see CPU utilization around 50% on Ganglia. spark.task.cpus setting is equal to 1. So, each task gets 1 core. But my question is, when I call the binary with pipe, does that binary gets all cores available on the host just like any other executable or is it restricted to how many cores that pipe task has? So far, increasing spark.task.cpus to 2 didn't increase usage.
Spark it could read files parallel and also there have partition machilism in spark ,more partition = more parallel. if you want increase your CPU utilization you could configure your SparkContext to
sc = new SparkConf().setMaster("local[*]")
More details acess https://spark.apache.org/docs/latest/configuration.html
I have a dataset with ~5M rows x 20 columns, containing a groupID and a rowID. My goal is to check whether (some) columns contain more than a fixed fraction (say, 50%) of missing (null) values within a group. If this is found, the entire column is set to missing (null), for that group.
df = spark.read.parquet('path/to/parquet/')
check_columns = {'col1': ..., 'col2': ..., ...} # currently len(check_columns) = 8
for col, _ in check_columns.items():
total = (df
.groupBy('groupID').count()
.toDF('groupID', 'n_total')
)
missing = (df
.where(F.col(col).isNull())
.groupBy('groupID').count()
.toDF('groupID', 'n_missing')
)
# count_missing = count_missing.persist() # PERSIST TRY 1
# print('col {} found {} missing'.format(col, missing.count())) # missing.count() is b/w 1k-5k
poor_df = (total
.join(missing, 'groupID')
.withColumn('freq', F.col('n_missing') / F.col('n_total'))
.where(F.col('freq') > 0.5)
.select('groupID')
.toDF('poor_groupID')
)
df = (df
.join(poor_df, df['groupID'] == poor_df['poor_groupID'], 'left_outer')
.withColumn(col, (F.when(F.col('poor_groupID').isNotNull(), None)
.otherwise(df[col])
)
)
.select(df.columns)
)
stats = (missing
.withColumnRenamed('n_missing', 'cnt')
.collect() # FAIL 1
)
# df = df.persist() # PERSIST TRY 2
print(df.count()) # FAIL 2
I initially assigned 1G of spark.driver.memory and 4G of spark.executor.memory, eventually increasing the spark.driver.memory up to 10G.
Problem(s):
The loop runs like a charm during the first iterations, but towards the end,
around the 6th or 7th iteration I see my CPU utilization dropping (using 1
instead of 6 cores). Along with that, execution time for one iteration
increases significantly.
At some point, I get an OutOfMemory Error:
spark.driver.memory < 4G: at collect() (FAIL 1)
4G <= spark.driver.memory < 10G: at the count() step (FAIL 2)
Stack Trace for FAIL 1 case (relevant part):
[...]
py4j.protocol.Py4JJavaError: An error occurred while calling o1061.collectToPython.
: java.lang.OutOfMemoryError: Java heap space
[...]
The executor UI does not reflect excessive memory usage (it shows a <50k used
memory for the driver and <1G for the executor). The Spark metrics system
(app-XXX.driver.BlockManager.memory.memUsed_MB) does not either: it shows
600M to 1200M of used memory, but always >300M remaining memory.
(This would suggest that 2G driver memory should do it, but it doesn't.)
It also does not matter which column is processed first (as it is a loop over
a dict(), it can be in arbitrary order).
My questions thus:
What causes the OutOfMemory Error and why are not all available CPU cores
used towards the end?
And why do I need 10G spark.driver.memory when I am transferring only a few kB from the executors to the driver?
A few (general) questions to make sure I understand things properly:
If I get an OOM error, the right place to look at is almost always the driver
(b/c the executor spills to disk)?
Why would count() cause an OOM error - I thought this action would only
consume resources on the exector(s) (delivering a few bytes to the driver)?
Are the memory metrics (metrics system, UI) mentioned above the correct
places to look at?
BTW: I run Spark 2.1.0 in standalone mode.
UPDATE 2017-04-28
To drill down further, I enabled a heap dump for the driver:
cfg = SparkConfig()
cfg.set('spark.driver.extraJavaOptions', '-XX:+HeapDumpOnOutOfMemoryError')
I ran it with 8G of spark.driver.memory and I analyzed the heap dump with
Eclipse MAT. It turns out there are two classes of considerable size (~4G each):
java.lang.Thread
- char (2G)
- scala.collection.IndexedSeqLike
- scala.collection.mutable.WrappedArray (1G)
- java.lang.String (1G)
org.apache.spark.sql.execution.ui.SQLListener
- org.apache.spark.sql.execution.ui.SQLExecutionUIData
(various of up to 1G in size)
- java.lang.String
- ...
I tried to turn off the UI, using
cfg.set('spark.ui.enabled', 'false')
which made the UI unavailable, but didn't help on the OOM error. Also, I tried
to have the UI to keep less history, using
cfg.set('spark.ui.retainedJobs', '1')
cfg.set('spark.ui.retainedStages', '1')
cfg.set('spark.ui.retainedTasks', '1')
cfg.set('spark.sql.ui.retainedExecutions', '1')
cfg.set('spark.ui.retainedDeadExecutors', '1')
This also did not help.
UPDATE 2017-05-18
I found out about Spark's pyspark.sql.DataFrame.checkpoint method. This is like persist but gets rid of the dataframe's lineage. Thus it helps to circumvent the above mentioned issues.
I want to load a 12GB csv file into python and then do analysis.
I attempted to use this method
file_input_to_system = pd.read_csv(usrinput)
, but it failed because the method consumed all my RAM.
My goal now is to read the file from hard disk but not read it from RAM. I googled it and found out this sample
f = open("file_path","r")
for row in csv.reader(f):
df = pd.DataFrame(row)
print(df)
f.close()
But I am not sure how to modify it such that it can read a csv and parse it into dataframe.
When I try this one, it can read file and not consume all my memory.
However, when I parse it to dataframe, all my memory is consumed.
chunksize = 100
df = pd.read_csv("C:/Users/user/Documents/GitHub/MyfirstRep/export_lage.csv",iterator=True,chunksize=chunksize)
df = pd.concat(df, ignore_index=True)
print(df)
I'm trying to do a matrix multiplication chain of size 67584*67584 using Pyspark but it constantly runs out of memory or OOM error.Here are the details:
Input is matlab file(.mat file) which has the matrix in a single file. I load the file using scipy loadmat, split the file into multiple files of block size (1024*1024) and store them back in .mat format.
Now mapper loads each file using filelist and create a rdd of blocks.
filelist = sc.textFile(BLOCKS_DIR + 'filelist.txt',minPartitions=200)
blocks_rdd = filelist.map(MapperLoadBlocksFromMatFile).cache()
MapperLoadBlocksFromMatFile is a function as below:
def MapperLoadBlocksFromMatFile(filename):
data = loadmat(filename)
G = data['G']
id = data['block_id'].flatten()
n = G.shape[0]
if(not(isinstance(G,sparse.csc_matrix))):
sub_matrix = Matrices.dense(n, n, G.transpose().flatten())
else:
sub_matrix = Matrices.dense(n,n,np.array(G.todense()).transpose().flatten())
return ((id[0], id[1]), sub_matrix)
Now once i have this rdd, i create a BlockMatrix from it. and Do a matrix multiplication with it.
adjacency_mat = BlockMatrix(blocks_rdd, block_size, block_size, adj_mat.shape[0], adj_mat.shape[1])
I'm using the multiply method from BlockMatrix implementation and it runs out of memory every single time.
Result = adjacency_mat.multiply(adjacency_mat)
Below are the cluster configuration details:
50 nodes of 64gb Memory and 20 cores processors.
worker-> 60gb and 16 cores
executors-> 15gb and 4 cores each
driver.memory -> 60gb and maxResultSize->10gb
i even tried with rdd.compress. Inspite of having enough memory and cores, i run out of memory every time. Every time a different node runs out of memory and i don't have an option of using visualVM in the cluster . What am i doing wrong? Is the way blockmatrix is created wrong? Or am i not accounting for enough memory?
OOM Error Stacktrace