Understanding why smaller executors fail and larger succeeds in spark - apache-spark

I have a job that parses approximate a terabyte of json formatted data split in 20 mb files (this is because each minute gets a 1gb dataset essentially).
The job parses, filters, and transforms this data and writes it back out to another path. However, whether it runs depends on the spark configuration.
The cluster consists of 46 nodes with 96 cores and 768 gb memory per node. The driver has the same specs.
I submit the job in standalone mode and:
Using 22g and 3 cores per executor, the job fails due to gc and OOM
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError19/04/13 01:35:32 WARN TransportChannelHandler: Exception in connection from /10.0.118.151:34014
java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.sun.security.sasl.digest.DigestMD5Base$DigestIntegrity.getHMAC(DigestMD5Base.java:1060)
at com.sun.security.sasl.digest.DigestMD5Base$DigestPrivacy.unwrap(DigestMD5Base.java:1470)
at com.sun.security.sasl.digest.DigestMD5Base.unwrap(DigestMD5Base.java:213)
at org.apache.spark.network.sasl.SparkSaslServer.unwrap(SparkSaslServer.java:150)
at org.apache.spark.network.sasl.SaslEncryption$DecryptionHandler.decode(SaslEncryption.java:126)
at org.apache.spark.network.sasl.SaslEncryption$DecryptionHandler.decode(SaslEncryption.java:101)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:748)
: An error occurred while calling o54.json.
: java.lang.OutOfMemoryError: GC overhead limit exceeded
Using 120g and 15 cores per executor, the job succeeds.
Why would the job fail on the smaller memory/core setup?
Notes:
There is an explode operation that possibly may be related as well. Edit: Unrelated. Tested the code, did a simple spark.read.json().count().show() and it gc'd and OOM'd.
My current pet theory at the moment is the the large number of small files results in high shuffle overhead. Is this what's going on and is there a way around this (outside of re-aggregating the files separately)?
Code as requested:
Launcher
./bin/spark-submit --master spark://0.0.0.0:7077 \
--conf "spark.executor.memory=90g" \
--conf "spark.executor.cores=12" \
--conf 'spark.default.parallelism=7200' \
--conf 'spark.sql.shuffle.partitions=380' \
--conf "spark.network.timeout=900s" \
--conf "spark.driver.extraClassPath=$LIB_JARS" \
--conf "spark.executor.extraClassPath=$LIB_JARS" \
--conf "spark.executor.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \
launcher.py
Code
spark = SparkSession.builder \
.appName('Rewrites by Frequency') \
.getOrCreate()
spark.read.json("s3://path/to/file").count()

Related

Spark app failing with error org.apache.spark.shuffle.FetchFailedException

I am running spark 2.4.0 on EMR. I am trying to process huge data (1TB) on EMR using 100 NODES with memory 122G and 16core each. i am getting below exceptions after sometime. Here are the parameters I've set.
--executor-memory 80g
--exeuctor-cores 4
--driver-memory 80g
--driver-cores 1
spark = (SparkSession
.builder
.master("yarn")
.config("spark.shuffle.service.enabled","true")
.config("spark.dynamicAllocation.shuffleTracking.enabled","true")
.config("spark.dynamicAllocation.enabled", "true")
.config("spark.dynamicAllocation.minExecutors","50")
#.config("spark.dynamicAllocation.maxExecutors", "500")
.config("spark.dynamicAllocation.executorIdleTimeout","2m")
.config("spark.driver.maxResultSize", "16g")
.config("spark.kryoserializer.buffer.max", "2047")
.config("spark.rpc.message.maxSize", "2047")
.config("spark.memory.offHeap.enabled","true")
.config("spark.memory.offHeap.size","50g")
.config("spark.sql.autoBroadcastJoinThreshold", "-1")
.config("spark.sql.broadcastTimeout","1200")
.config("spark.sql.shuffle.partitions","200")
.config("spark.memory.storageFraction","0.3")
.config("spark.yarn.executor.memoryOverhead","2g")
.enableHiveSupport()
.getOrCreate())
Here is the three type of executor failures I've been getting and eventually causing the corresponding Stage to rerun. Sometimes the rerun succeeds and sometimes they go on retrying forever.
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 8
org.apache.spark.shuffle.FetchFailedException: Failure while fetching StreamChunkId{streamId=45963765394, chunkIndex=0}: java.lang.RuntimeException: Executor is not registered (appId=application_1625085506598_0885, execId=137)
org.apache.spark.shuffle.FetchFailedException: Failed to connect to ip-10-40-6-235.ap-south-1.compute.internal/10.40.6.235:7337
attaching the screenshot of spark dag

java.lang.OutOfMemoryError: Java heap space spark streaming job

I've a spark streaming job. It operates in batches of 10 minutes. The driver machine is m4X4x (64GB) ec2 instance.
The job stalled after 18 hours. It crashes on the following exception. As I read the other posts it seems that the driver may have run out of memory. How can I check this? My pyspark config is as follows
Also, how do i check the memory in spark-ui ? I only see the 11 tasks nodes i have, not the driver.
export PYSPARK_SUBMIT_ARGS='--master yarn --deploy-mode client
--driver-memory 10g
--executor-memory 10g
--executor-cores 4
--conf spark.driver.cores=5
--packages "org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2"
--conf spark.driver.maxResultSize=2g
--conf spark.shuffle.spill=true
--conf spark.yarn.driver.memoryOverhead=2048
--conf spark.yarn.executor.memoryOverhead=2048
--conf "spark.broadcast.blockSize=512M"
--conf "spark.memory.storageFraction=0.5"
--conf "spark.kryoserializer.buffer.max=1024"
--conf "spark.default.parallelism=600"
--conf "spark.sql.shuffle.partitions=600"
--driver-java-options - Dlog4j.configuration=file:///usr/lib/spark/conf/log4j.properties pyspark-shell'
[Stage 3507:> (0 + 0) / 600]Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$4.apply(TorrentBroadcast.scala:231)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$4.apply(TorrentBroadcast.scala:231)
at org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87)
at org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75)
at net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205)
at net.jpountz.lz4.LZ4BlockOutputStream.finish(LZ4BlockOutputStream.java:235)
at net.jpountz.lz4.LZ4BlockOutputStream.close(LZ4BlockOutputStream.java:175)
at java.io.ObjectOutputStream$BlockDataOutputStream.close(ObjectOutputStream.java:1828)
at java.io.ObjectOutputStream.close(ObjectOutputStream.java:742)
at org.apache.spark.serializer.JavaSerializationStream.close(JavaSerializer.scala:57)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$1.apply$mcV$sp(TorrentBroadcast.scala:238)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1319)
at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:237)
at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:107)
at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:86)
at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:56)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1387)
at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1012)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:933)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:936)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:935)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:935)
at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:873)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1630)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
[Stage 3507:> (0 + 0) / 600]18/02/23 12:59:33 ERROR TransportRequestHandler: Error sending result RpcResponse{requestId=8388437576763608177, body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=156]}} to /172.23.56.231:58822; closing connection

Spark java.io.IOException due to unknown reason

I'm getting the following exception while running a Spark job. The job gets stuck at the same stage every time. The stage is a SQL query. I don't see any other exception in either Driver or Executor logs
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:748)
This exception is wrapped between these errors:
ERROR client.TransportResponseHandler: Still have 1 requests outstanding when connection from hostname.domain.com/ip is closed
The only thing I could find in the executor logs was:
INFO memory.TaskMemoryManager: Memory used in task 12302
INFO memory.TaskMemoryManager: Acquired by org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter#462e08e3: 32.0 MB
INFO memory.TaskMemoryManager: Acquired by org.apache.spark.unsafe.map.BytesToBytesMap#41bed570: 2.4 GB
INFO memory.TaskMemoryManager: 0 bytes of memory were used by task 12302 but are not associated with specific consumers
INFO memory.TaskMemoryManager: 2634274570 bytes of memory are used for execution and 1826540 bytes of memory are used for storage
INFO sort.UnsafeExternalSorter: Thread 197 spilling sort data of 512.0 MB to disk (0 time so far)
But I don't believe this is an issue due to memory. The job completes successfully in a different environment with the same amount of data.
Here's my spark-submit :
spark-submit --master yarn-cluster\
--conf spark.speculation=true \
--conf spark.default.parallelism=200 \
--conf spark.executor.memory=16G \
--conf spark.memory.storageFraction=$0.3 \
--conf spark.executor.cores=5 \
--conf spark.driver.memory=2G \
--conf spark.driver.cores=4 \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.initialExecutors=10 \
--conf spark.yarn.executor.memoryOverhead=1638 \
--conf spark.driver.maxResultSize=1G \
--conf spark.sql.autoBroadcastJoinThreshold=-1 \
--class com.test.TestClass Test.jar
I did read some articles here and there regarding a similar exception which point out towards increasing the heartbeat interval and network timeout. But I couldn't find a definitive answer.
How can I run this job successfully?
This was being caused due to an issue with the data.
The driving table for all the left joins, had empty string '' as data in one of the columns which was being used to join to another table. Similariy, the other table also had a lot of empty strings for that particular column.
This was leading in a cross-join and since the number of rows were too many, the job was getting hung indefinitely.
Adding a filter to the right table, helped in solving the issue:
SELECT
*
FROM
LEFT_TABLE LT
LEFT JOIN
( SELECT
*
FROM
RIGHT_TABLE
WHERE LENGTH(TRIM(PROBLEMATIC_COLUMN)) <> 0 ) RT
ON
LT.PROBLEMATIC_COLUMN = RT.PROBLEMATIC_COLUMN

Spark job fails due to stackoverflow error

My spark job is using Mllib to train a LogisticRegression on a data but it fails due to the Stackoverflow error, here is the error message shown in the spark-shell
java.lang.StackOverflowError
at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:176)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
at scala.collection.generic.GenericCompanion.apply(GenericCompanion.scala:48)
...
when I check the Spark UI, there is no failed stage or job! This is how I run my spark-shell
spark-shell --num-executors 100 --driver-memory 20g --conf spark.driver.maxResultSize=5g --executor-memory 8g --executor-cores 3
I even tried to increase the size of the stack by adding the following line when running the spark-shell, but it didn't help
--conf "spark.driver.extraJavaOptions='-XX:ThreadStackSize=81920'"
What the issue is?

Java Out-of-Memory Error in pyspark

My problem is fairly simple: the JVM is running out of memory when I run RandomForest.trainRegressor in pyspark. I'm using about 3GB of training data, 77 features, and with numTrees set to 15. If I set numTrees to 15 or more, the training fails with an out-of-memory error (like below):
error info:
model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo={},numTrees=15, featureSubsetStrategy="sqrt",impurity='variance', maxDepth=20, maxBins=14) #variance
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/pyspark.zip/pyspark/mllib/tree.py", line 412, in trainRegressor
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/pyspark.zip/pyspark/mllib/tree.py", line 270, in _train
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/pyspark.zip/pyspark/mllib/common.py", line 130, in callMLlibFunc
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/pyspark.zip/pyspark/mllib/common.py", line 123, in callJavaFunc
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o81.trainRandomForestModel.
: java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2032)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:703)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:702)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:702)
at org.apache.spark.mllib.tree.DecisionTree$.findBestSplits(DecisionTree.scala:625)
at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:235)
at org.apache.spark.mllib.tree.RandomForest$.trainRegressor(RandomForest.scala:380)
at org.apache.spark.mllib.api.python.PythonMLLibAPI.trainRandomForestModel(PythonMLLibAPI.scala:744)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
I'm using spark 1.5 and my parameters are:
spark-submit --master yarn-client --conf spark.cassandra.connection.host=x.x.x.x \
--jars /home/retail/packages/spark_cassandra_tool-1.0.jar,/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/hive/lib/HiveAuthHook.jar,/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/hbase/hbase-common.jar,/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/hbase/hbase-client.jar,/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/hbase/hbase-server.jar,/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/lib/spark-examples.jar \
--num-executors 30 --executor-cores 4 \
--executor-memory 20000M --driver-memory 200g \
--conf spark.yarn.executor.memoryOverhead=5000 \
--conf spark.kryoserializer.buffer.max=2000m \
--conf spark.kryoserializer.buffer=40m \
--conf spark.driver.extraJavaOptions=\"-Xms2048m -Xmx2048m -XX:+DisableExplicitGC -Dcom.sun.management.jmxremote -XX:PermSize=512m -XX:MaxPermSize=2048m -XX:MaxDirectMemorySize=5g\" \
--conf spark.driver.maxResultSize=10g \
--conf spark.port.maxRetries=100
In my opinion, the trees should be trained in series, so why does the number of them create a higher memory load?
If I want to train 300 trees successfully, how should I set the spark parameters to use the RandomForest.trainRegressor function properly?

Resources