Efficient grouping by key using mapPartitions or partitioner in Spark

Efficient grouping by key using mapPartitions or partitioner in Spark - apache-spark

So, I have a data like the following,
[ (1, data1), (1, data2), (2, data3), (1, data4), (2, data5) ]
which I want to convert to the following, for further processing.
[ (1, [data1, data2, data4]), (2, [data3, data5]) ]
I used groupByKey and reduceByKey, but due to really large amount of data it fails. The data is not tall but wide. In other words, keys are from 1 upto 10000, but, value list ranges from 100k to 900k.
I am struggling with this issue and plan to apply mapPartitions or (Hash)partitioner.
So, if one of these may work, I'd like to know
Using mapPartions, could you please give some code snippet?
Using (Hash)partitioner, could you please give some example how to control partitions by some element like key.. e.g. is there a way to create each partition based on key (i.e. 1,2,.. above) with no need to shuffle.
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 9 (flatMap at TSUMLR.scala:209) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:542)
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:538)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:538)
at org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:155)
at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:47)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

None of the proposed method would work. Partitioner by definition have to shuffle the data and will suffer from the same limitations as groupByKey. mapPartitions cannot move data to another partition so it is completely useless. Since your description of the problem is rather vague it is hard to give a specific advice but in general I would try following steps:
try to rethink the problem. Do you really need all the values at once? How do you plan to utilize these? Can you obtain the same results without collecting to a single partition?
is it possible to reduce the traffic? How many unique values do you expect? Is it possible to compress the data before the shuffle (for example count values or use RLE)?
consider using larger executors. Spark has to keep in memory only the values for a single key and can spill processed keys to disk.
split your data by key:
val keys = rdd.keys.distinct.collect
val rdds = keys.map(k => rdd.filter(_._1 == k))
and process each RDD separatelly.

Related

Is there any parameter partitioning when Spark reads RDBMS through JDBC?

When I run the spark application for table synchronization, the error message is as follows:
19/10/16 01:37:40 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 51)
com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
at com.mysql.cj.jdbc.exceptions.SQLError.createCommunicationsException(SQLError.java:590)
at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:57)
at com.mysql.cj.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:1606)
at com.mysql.cj.jdbc.ConnectionImpl.<init>(ConnectionImpl.java:633)
at com.mysql.cj.jdbc.ConnectionImpl.getInstance(ConnectionImpl.java:347)
at com.mysql.cj.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:219)
at org.apache.spark.sql.execution.datasources.jdbc.DriverWrapper.connect(DriverWrapper.scala:45)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:63)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:54)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:272)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I think this is caused by the large amount of data in the table. I used the parameters related to the mongo partition before,such as:spark.mongodb.input.partitioner,spark.mongodb.input.partitionerOptions.partitionSizeMB
I want to know if Spark has similar parameters for partitioning when reading RDBMS via JDBC?

Below are the parameters along with their description which we can use while reading the RDBMS table using spark jdbc.
partitionColumn, lowerBound, upperBound -These options must all be specified if any of them is specified. In addition, numPartitions must be specified. They describe how to partition the table when reading in parallel from multiple workers. partitionColumn must be a numeric, date, or timestamp column from the table in question. Notice that lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in table. So all rows in the table will be partitioned and returned. This option applies only to reading.
numPartitions-The maximum number of partitions that can be used for parallelism in table reading and writing. This also determines the maximum number of concurrent JDBC connections. If the number of partitions to write exceeds this limit, we decrease it to this limit by calling coalesce(numPartitions) before writing.
fetchsize - The JDBC fetch size, which determines how many rows to fetch per round trip. This can help performance on JDBC drivers which default to low fetch size (eg. Oracle with 10 rows). This option applies only to reading.
Please note that all the above parameters should be used in together. Below is an example:-
spark.read.format("jdbc").
option("driver", driver).
option("url",url ).
option("partitionColumn",column name).
option("lowerBound", 10).
option("upperBound", 10000).
option("numPartitions", 10).
option("fetchsize",1000).
option("dbtable", query).
option("user", user).
option("password",password).load()

Encounter SparkException "Cannot broadcast the table that is larger than 8GB"

I am using Spark 2.2.0 to do data processing. I am using Dataframe.join to join 2 dataframes together, however I encountered this stack trace:
18/03/29 11:27:06 INFO YarnAllocator: Driver requested a total number of 0 executor(s).
18/03/29 11:27:09 ERROR FileFormatWriter: Aborting job null.
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:123)
at org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:248)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:126)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:197)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:82)
at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155)
...........
Caused by: org.apache.spark.SparkException: Cannot broadcast the table that is larger than 8GB: 10 GB
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:86)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:73)
at org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:103)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:72)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:72)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I searched on Internet for this error, but didn't get any hint or solution how to fix this.
Does Spark automatically broadcast Dataframe as part of the join? I am very surprise with this 8GB limit because I would have thought Dataframe supports "big data" and 8GB is not very big at all.
Thank you very much in advance for your advice on this.
Linh

After some reading, I've tried to disable the auto-broadcast and it seemed to work. Change Spark config with:
'spark.sql.autoBroadcastJoinThreshold': '-1'

Currently it is a hard limit in spark that the broadcast variable size should be less than 8GB. See here.
The 8GB size is generally big enough. If you consider that you re running a job with 100 executors, spark driver needs to send the 8GB data to 100 Nodes resulting 800GB network traffic. This cost will be much less if you don't broadcast and use simple join.

Apache Beam - GroupbyKeyOnly Method causes OutofMemory Exception

We are using Apache Beam which is executed on Spark runner. Our Case is the following. Both the 2 use cases causes OutofMemory error.
1) Join - 2 Big Tables using Apache Beam - One table of size 120GB and the other is of 60GB. This causes OutofMemory error when groupByKeyOnly() is called internally in GroupCombineFunctions.java.
2) GroupByKey - We are grouping the dataset based on a key like the following.
PCollection>> costBasisRecords = masterDataResult.apply(GroupByKey.create());
This GroupbyKey operation also causes OutOfmemory errors.
Could you please give us suggestions such that we can achieve result.
From online, We saw that reduceByKey method - Could you please guide us how we can implement that functionality for Spark runners.
Error Message:
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.reflect.Array.newInstance(Array.java:75)
at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1897)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1529)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.serializer.DeserializationStream.readValue(Serializer.scala:171)
at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:201)
at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:198)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:152)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:45)
at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:89)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

If possible, I would definitely recommend using a Combine.perKey as Lukasz suggests.
If you are unable to do that or if you still run into OOMs, try to decrease partition size by increasing the number of partitions. You can increase the number of shuffle partitions by manually setting the spark.default.parallelism configuration. This is explicitly used to determine the partitioning scheme for groupByKeyOnly shuffles.
It looks like the way to plumb configurations through is via a manually-constructed SparkContextOptions. There's a test case that shows how to do this. Note that this requires your pipeline program to directly link against Spark. For example:
SparkConf conf = new SparkConf().set("spark.default.parallelism", parallelism);
JavaSparkContext jsc = new JavaSparkContext(conf);
SparkContextOptions options = PipelineOptionsFactory.as(SparkContextOptions.class);
options.setUsesProvidedSparkContext(true);
options.setProvidedSparkContext(jsc);
Pipeline p = Pipeline.create(options);
// ...
NOTE: Spark has its own limitation that all grouped values for a given key must fit in memory on the machine processing that key. If this does not hold for your datasets (i.e., you have very strong key skew), then you will need to combine rather than group by key.

reduceByKey in Spark is similar to Combine.perKey in Apache Beam, see the Programming Guide for examples.
Note that reduceByKey and Combine.perKey will only work if there is a reduction per key otherwise your just going to hit the same out of memory problem. For example, combining all integers per key into a list will not reduce the amount of memory usage but summing the integers per key will.

FileNotFoundException in apache spark(1.6) job during shuffle files

I am working on spark 1.6, it is failing my job with following error
java.io.FileNotFoundException: /data/05/dfs/dn/yarn/nm/usercache/willir31/appcache/application_1413512480649_0108/spark-local-20141028214722-43f1/26/shuffle_0_312_0.index (No such file or directory)
java.io.FileOutputStream.open(Native Method)
java.io.FileOutputStream.(FileOutputStream.java:221)
org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123)
org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:192)
org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4$$anonfun$apply$2.apply(ExternalSorter.scala:733)
org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4$$anonfun$apply$2.apply(ExternalSorter.scala:732)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
org.apache.spark.util.collection.ExternalSorter$IteratorForPartition.foreach(ExternalSorter.scala:790)
org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4.apply(ExternalSorter.scala:732)
org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$4.apply(ExternalSorter.scala:728)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:728)
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:70)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
I am performing join operations. When i carefully look into the error and check my code i found it is failing while it is writing back to CSV from dataFrame. But i am not able to get rid of it. I am not using hdp, i have separate installation for all components.

This types of errors typically occur when there are deeper problems with some tasks, like significant data skew. Since you don't provide enough details (please be sure to read How To Ask and How to create a Minimal, Complete, and Verifiable example) and job statistics the only approach that I can think off is to significantly increase number of shuffle partitions:
sqlContext.setConf("spark.sql.shuffle.partitions", 2048)

Running Spark Mllib Decision Tree gives block size error

I'm running a decision tree on a dataframe of about 2000 points and 500 features. Maxbins is 182. No matter how i increase the shuffling block size from 200 up to 4000 i keep getting a failure at stage 3 of the decision tree training saying "max integer reached" referring to Spark block size shuffling size. Note my dataframes are not rdds but spark sql dataframes.
Here is the error:
...
Caused by: java.lang.RuntimeException: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828)
at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125)
at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:113)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1206)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:127)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:134)
at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:522)
at org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:312)
at org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:58)
at org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:58)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
...
Here is the code producing it:
val assembled = assembler.transform(features)
val dt = new DecisionTreeClassifier().setLabelCol("indexedLabel").setFeaturesCol("indexedFeatures").setImpurity(impurity).setMaxBins(maxBins).setMaxDepth(maxDepth)
val pipeline = new Pipeline().setStages(Array(labelIndexer, dt))
val model = pipeline.fit(assembled)
Thank you for any pointers on what might be causing this and how to fix it.
Thank you.

Try to increase number of partitions - try repartition() method.
The reason for this error is that spark uses memory mapped files to handle partition data blocks and it's currently not possible to memory mapped something that's more, than 2GB (Integer.MAX_VALUE) - not the Spark issue, though.
The workaround would be increase number of partitions. That would reduce block size of particular partition and might help with the issue.
And there is also some activity to workaround that in Spark itself - to process partitions blocks in chunks.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Efficient grouping by key using mapPartitions or partitioner in Spark - apache-spark

Related

Is there any parameter partitioning when Spark reads RDBMS through JDBC?

Encounter SparkException "Cannot broadcast the table that is larger than 8GB"

Apache Beam - GroupbyKeyOnly Method causes OutofMemory Exception

FileNotFoundException in apache spark(1.6) job during shuffle files

Running Spark Mllib Decision Tree gives block size error

Categories

Resources