MLlib model (RandomForestModel) saves model with numerous small parquet files

MLlib model (RandomForestModel) saves model with numerous small parquet files - apache-spark

I'm trying to train the MLlib RandomForestRegression Model using the RandomForest.trainRegressor API.
After training, when I try to save the model the resulting model folder has a size of 6.5MB on disk, but there are 1120 small parquet files in the data folder that seem to be unnecessary and slow to upload/download to s3.
Is this the expected behavior? I'm certainly repartitioning the labeledPoints to have 1 partition but this is happening regardless.

Repartitioning with rdd.repartition(1) before training does not help much. It make training potentially slower because all parallel operation are effectively sequential as whole parallelism stuff is based on partitions.
Instead of that I've come up with simple hack and set spark.default.parallelism to 1 as the save procedure use sc.parallelize method to create stream to save.
Keep in mind that it will affect countless places in your app like groupBy and join. My suggestion is to extract train&save of the model to separate application and run it in isolation.

Related

Issues with long lineages (DAG) in Spark

We usually use Spark as processing engines for data stored on S3 or HDFS. We use Databricks and EMR platforms.
One of the issues I frequently face is when the task size grows, the job performance is degraded severely. For example, let's say I read data from five tables with different levels of transformation like (filtering, exploding, joins, etc), union subset of data from these transformations, then do further processing (ex. remove some rows based on a criteria that requires windowing functions etc) and then some other processing stages and finally save the final output to a destination s3 path. If we run this job without it takes very long time. However, if we save(stage) temporary intermediate dataframes to S3 and use this saved (on S3) dataframe for the next steps of queries, the job finishes faster. Does anyone have similar experience? Is there a better way to handle this kind of long tasks lineages other than checkpointing?
What is even more strange is for longer lineages spark throws an expected error like column not found, while the same code works if intermediate results are temporarily staged.

Writing the intermediate data by saving the dataframe, or using a checkpoint is the only way to fix it. You're probably running into an issue where the optimizer is taking a really long time to generate the plan. The quickest/most efficient way to fix this is to use localCheckpoint. This materializes a checkpoint locally.
val df = df.localCheckpoint()

How does Parquet handles SparseVector Columns?

I'm very new to PySpark. I was building a tfidf and want to store it in disk as an intermediate result. Now the IDF scoring gives me a SparseVector representation.
However when trying to save it as Parquet, I'm getting OOM. I'm not sure if it internally converts the SparseVector to Dense as in that case it will lead to some 25k columns and according to this thread, saving such big data in columnar format can lead to OOM.
So, any idea on what can be the case? I'm having executor memory as 8g and operating on a 2g CSV file.
Should I try increasing the memory or save it in CSV instead of Parquet? Any help is appreciated. Thanks in advance.
Update 1
As pointed out, that Spark performs lazy evaluation, the error can be because of an upstream stage, I tried a show and a collect before the write. They seemed to run fine without throwing errors. So, is it still some issue related to Parquet or I need some other debugging?

Parquet doesn't provide native support for Spark ML / MLlib Vectors and neither are these first class citizens in Spark SQL.
Instead, Spark represents Vectors using struct fields with three fields:
type - ByteType
size - IntegerType (optional, only for SparseVectors)
indices - ArrayType(IntegerType) (optional, only for SparseVectors)
values - ArrayType(DoubleType)
and uses metadata to distinguish these from plain structs and UDT wrappers to map back to external types. No conversion between sparse and dense representation is needed. Nonetheless, depending on the data, such representation might require comparable memory, to the full dense array.
Please note that that OOM on write are not necessarily related to the writing process itself. Since Spark is in general lazy, the exception can be caused by any of the upstream stages.

Partitioning strategy in Parquet and Spark

I have a job that reads csv files , converts it into data frames and writes in Parquet. I am using append mode while writing the data in Parquet. With this approach, in each write a separate Parquet file is getting generated. My questions are :
1) If every time I write the data to Parquet schema ,a new file gets
appended , will it impact read performance (as the data is now
distributed in varying length of partitioned Parquet files)
2) Is there a way to generate the Parquet partitions purely based on
the size of the data ?
3) Do we need to think to a custom partitioning strategy to implement
point 2?
I am using Spark 2.3

It will affect read performance if
spark.sql.parquet.mergeSchema=true.
In this case, Spark needs to visit each file and grab schema from
it.
In other cases, I believe it does not affect read performance much.
There is no way generate purely on data size. You may use
repartition or coalesce. Latter will created uneven output
files, but much performant.
Also, you have config spark.sql.files.maxRecordsPerFile or option
maxRecordsPerFile to prevent big size of files, but usually it is
not an issue.
Yes, I think Spark has not built in API to evenly distribute by data
size. There are Column
Statistics
and Size
Estimator may help with this.

how to cache random forest models in spark

My platform is spark 2.1.0, using python language.
Now I have about 100 random forest multiclassification models ,I have saved them in the HDFS.There are 100 datasets saved in the HDFS too.
I want to predict the dataset using corresponding model.If the models and datasets are cache in memory,the predict will be more than 10 times faster.
But I do not know how to cache models because the model is not RDD or Dataframe.
Thanks!

TL;DR Just cache the data, if it is ever reused outside prediction process, and if not you can even skip that.
RandomForestModel is a local object not backed by distributed data structures, there is no DAG to recompute, and prediction process is a simple, map-only job. Therefore model cannot be cached and even if it could, the operation would be meaningless.
See also (Why) do we need to call cache or persist on a RDD

How to do Incremental MapReduce in Apache Spark

In CouchDB and system designs like Incoop, there's a concept called "Incremental MapReduce" where results from previous executions of a MapReduce algorithm are saved and used to skip over sections of input data that haven't been changed.
Say I have 1 million rows divided into 20 partitions. If I run a simple MapReduce over this data, I could cache/store the result of reducing each separate partition, before they're combined and reduced again to produce the final result. If I only change data in the 19th partition then I only need to run the map & reduce steps on the changed section of the data, and then combine the new result with the saved reduce results from the unchanged partitions to get an updated result. Using this sort of catching I'd be able to skip almost 95% of the work for re-running a MapReduce job on this hypothetical dataset.
Is there any good way to apply this pattern to Spark? I know I could write my own tool for splitting up input data into partitions, checking if I've already processed those partitions before, loading them from a cache if I have, and then running the final reduce to join all the partitions together. However, I suspect that there's an easier way to approach this.
I've experimented with checkpointing in Spark Streaming, and that is able to store results between restarts, which is almost what I'm looking for, but I want to do this outside of a streaming job.
RDD caching/persisting/checkpointing almost looks like something I could build off of - it makes it easy to keep intermediate computations around and reference them later, but I think cached RDDs are always removed once the SparkContext is stopped, even if they're persisted to disk. So caching wouldn't work for storing results between restarts. Also, I'm not sure if/how checkpointed RDDs are supposed to be loaded when a new SparkContext is started... They seem to be stored under a UUID in the checkpoint directory that's specific to a single instance of the SparkContext.

Both use cases suggested by the article (incremental logs processing and incremental query processing) can be generally solved by Spark Streaming.
The idea is that you have incremental updates coming in using DStreams abstraction. Then, you can process new data, and join it with previous calculation either using time window based processing or using arbitrary stateful operations as part of Structured Stream Processing. Results of the calculation can be later dumped to some sort of external sink like database or file system, or they can be exposed as an SQL table.
If you're not building an online data processing system, regular Spark can be used as well. It's just a matter of how incremental updates get into the process, and how intermediate state is saved. For example, incremental updates can appear under some path on a distributed file system, while intermediate state containing previous computation joined with new data computation can be dumped, again, to the same file system.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string