dataset groupByKey mapGroups only using 2 executors out 50 assigned - apache-spark

I have a job that loads some data from Hive and then it does some processing and ends writing data to Cassandra. At some point it was working fine but then all of the sudden one of the Spark operations has a bottleneck where only 2 cores are used even though the partition count it is set to 2000 across the pipeline. I am running Spark version: spark-core_2.11-2.0.0
My Spark configuration is as follows:
spark.executor.instances = "50"
spark.executor.cores = "4"
spark.executor.memory = "6g"
spark.driver.memory = "8g"
spark.memory.offHeap.enabled = "true"
spark.memory.offHeap.size = "4g"
spark.yarn.executor.memoryOverhead = "6096"
hive.exec.dynamic.partition.mode = "nonstrict"
spark.sql.shuffle.partitions = "3000"
spark.unsafe.sorter.spill.reader.buffer.size = "1m"
spark.file.transferTo = "false"
spark.shuffle.file.buffer = "1m"
spark.shuffle.unsafe.file.ouput.buffer = "5m"
When I do a thread dump of the executor that is running I see:
com.*.MapToSalaryRow.buildSalaryRow(SalaryTransformer.java:110)
com.*.MapToSalaryRow.call(SalaryTransformer.java:126)
com.*.MapToSalaryRow.call(SalaryTransformer.java:88)
org.apache.spark.sql.KeyValueGroupedDataset$$anonfun$mapGroups$1.apply(KeyValueGroupedDataset.scala:220)
A simplified version of the code that is having the problem is:
sourceDs.createOrReplaceTempView("salary_ds")
sourceDs.repartition(2000);
System.out.println("sourceDsdataset partition count = "+sourceDs.rdd().getNumPartitions());
Dataset<Row> salaryDs = sourceDs.groupByKey(keyByUserIdFunction, Encoders.LONG()).mapGroups(
new MapToSalaryRow( props), RowEncoder.apply(getSalarySchema())).
filter((FilterFunction<Row>) (row -> row != null));
salaryDs.persist(StorageLevel.MEMORY_ONLY_SER());
salaryDs.repartition(2000);
System.out.println("salaryDs dataset partition count = "+salaryDs.rdd().getNumPartitions());
Both of the above print statements show the partition count being 2000
The relevant code of the function MapGroups is:
class MapToSalaryInsightRow implements MapGroupsFunction<Long, Row, Row> {
private final Properties props;
#Override
public Row call(Long userId, Iterator<Row> iterator) throws Exception {
return buildSalaryRow(userId, iterator, props);
}
}
If anybody can point where the problem might be is highly appreciated.
Thanks

The problem was that there was a column of type array and for one of the rows this array was humongous therefore. Even though the partitions sizes were about the same two of them had 40 times the size. In those cases the task that got those rows would take significantly longer.

Related

Spark-Cassandra: repartitionByCassandraReplica or converting dataset to JavaRDD and back do not maintain number of partitions?

So, I have a 16 node cluster where every node has Spark and Cassandra installed with a replication factor of 3 and spark.sql.shuffle.partitions of 96. I am using the Spark-Cassandra Connector 3.0.0 and I am trying to join a dataset with a cassandra table on the partition key, while also using .repartitionByCassandraReplica.
However repartitionByCassandraReplica is implemented only on RDDs so I am converting my dataset to JavaRDD, do the repartitionByCassandraReplica, then converting it back to dataset and do a Direct Join with the cassandra table. It seems though, that in the process of that the number of partitions is "changing" or is not as expected.
I am doing a PCA on 4 partition keys which have some thousands of rows and for which I know the nodes where they are stored according to nodetool getendpoints . It looks like not only the number of partitions is changing but also the nodes where data are pulled are not the ones that actually have the data. Below is the code.
//FYI experimentlist is a List<String> which is converted to Dataset,then JavaRDD, then partitioned
//according to repartitionByCassandraReplica and then back to Dataset. The table with which I want to
//join it, is called experiment.
List<ExperimentForm> tempexplist = experimentlist.stream()
.map(s -> { ExperimentForm p = new ExperimentForm(); p.setExperimentid(s); return p; })
.collect(Collectors.toList());
Encoder<ExperimentForm> ExpEncoder = Encoders.bean(ExperimentForm.class);
Dataset<ExperimentForm> dfexplistoriginal = sp.createDataset(tempexplist, Encoders.bean(ExperimentForm.class));
//Below prints DATASET: PartNum 4
System.out.println("DATASET: PartNum "+dfexplistoriginal.rdd().getNumPartitions());
JavaRDD<ExperimentForm> predf = CassandraJavaUtil.javaFunctions(dfexplistoriginal.javaRDD()).repartitionByCassandraReplica("mdb","experiment",experimentlist.size(),CassandraJavaUtil.someColumns("experimentid"),CassandraJavaUtil.mapToRow(ExperimentForm.class));
//Below prints RDD: PartNum 64
System.out.println("RDD: PartNum "+predf.getNumPartitions());
Dataset<ExperimentForm> newdfexplist = sp.createDataset(predf.rdd(), ExpEncoder);
Dataset<Row> readydfexplist = newdfexplist.as(Encoders.STRING()).toDF("experimentid");
//Below prints DATASET: PartNum 64
System.out.println("DATASET: PartNum "+readydfexplist.rdd().getNumPartitions());
//and finally the DirectJoin which for some reason is not mentioned as DirectJoin in DAGs like other times
Dataset<Row> metlistinitial = sp.read().format("org.apache.spark.sql.cassandra")
.options(new HashMap<String, String>() {
{
put("keyspace", "mdb");
put("table", "experiment");
}
})
.load().select(col("experimentid"), col("description"), col("intensity")).join(readydfexplist,"experimentid");
Is the code wrong? Below are also some images from SparkUI the Stages Tab with DAGs. At first I have 4 tasks/partitions and after repartitionByCassandraReplica I get 64 or more. Why?
All the Stages:
Stage 0 DAG
Stage 0 Metrics
Stage 1 DAG
Stage 1 Some Metrics
Looks like the code I wrote above is not entirely correct! I managed to get repartitionByCassandraReplica working by just converting dataset to RDD, performing the repartitionByCassandraReplica doing the join with JoinWithCassandraTable and THEN converting back to dataset! Now it is indeed repartitioned on the nodes that actually have the data! Partitions are maintained between these conversions!

Dataproc spark job not able to scan records from bigtable

We are using newAPIHadoopRDD to scan a bigtable and add the records in Rdd. Rdd gets populated using newAPIHadoopRDD for a smaller (say less than 100K records) bigtable. However, it fails to load records into Rdd from larger(say 6M records) bigtable.
SparkConf sparkConf = new SparkConf().setAppName("mc-bigtable-sample-scan")
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
Configuration hbaseConf = HBaseConfiguration.create();
hbaseConf.set(TableInputFormat.INPUT_TABLE, "listings");
Scan scan = new Scan();
scan.addColumn(COLUMN_FAMILY_BASE, COLUMN_COL1);
hbaseConf.set(TableInputFormat.SCAN, TableMapReduceUtil.convertScanToString(scan));
JavaPairRDD<ImmutableBytesWritable, Result> source = jsc.newAPIHadoopRDD(hbaseConf, TableInputFormat.class,
ImmutableBytesWritable.class, Result.class);
System.out.println("source count " + source.count());
The count show properly for smaller table. But it shows zero for larger table.
Tried many different configuration options like increasing driver memory, number of executors, number of workers but nothing works.
Could someone help please?
My bad. Found the issue in my code. The column COLUMN_COL1 which I was trying to scan was not available in bigger bigtable and hence my count was appearing 0.

Spark RDD do not get processed in multiple nodes

I have a use case where in i create rdd from a hive table. I wrote a business logic that operates on every row in the hive table. My assumption was that when i create rdd and span a map process on it, it then utilises all my spark executors. But, what i see in my log is only one node process the rdd while rest of my 5 nodes sitting idle. Here is my code
val flow = hiveContext.sql("select * from humsdb.t_flow")
var x = flow.rdd.map { row =>
< do some computation on each row>
}
Any clue where i go wrong?
As specify here by #jaceklaskowski
By default, a partition is created for each HDFS partition, which by
default is 64MB (from Spark’s Programming Guide).
If your input data is less than 64MB (and you are using HDFS) then by default only one partition will be created.
Spark will use all nodes when using big data
Could there be a possibility that your data is skewed?
To rule out this possibility, do the following and rerun the code.
val flow = hiveContext.sql("select * from humsdb.t_flow").repartition(200)
var x = flow.rdd.map { row =>
< do some computation on each row>
}
Further if in your map logic you are dependent on a particular column you can do below
val flow = hiveContext.sql("select * from humsdb.t_flow").repartition(col("yourColumnName"))
var x = flow.rdd.map { row =>
< do some computation on each row>
}
A good partition column could be date column

Converting Dataframe to RDD reduces partitions

In our code, Dataframe was created as :
DataFrame DF = hiveContext.sql("select * from table_instance");
When I convert my dataframe to rdd and try to get its number of partitions as
RDD<Row> newRDD = Df.rdd();
System.out.println(newRDD.getNumPartitions());
It reduces the number of partitions to 1(1 is printed in the console). Originally my dataframe has 102 partitions .
UPDATE:
While reading , I repartitoned the dataframe :
DataFrame DF = hiveContext.sql("select * from table_instance").repartition(200);
and then converted to rdd , so it gave me 200 partitions only.
Does
JavaSparkContext
has a role to play in this? When we convert a dataframe to rdd , is default minimum partitions flag also considered at the spark context level?
UPDATE:
I made a seperate sample program in which I read the exact same table into dataframe and converted to rdd. No extra stage was created for RDD conversion and the partition count was also correct. I am now wondering what different am I doing in my main program.
Please let me know if my understanding is wrong here.
It basically depends on the implementation of hiveContext.sql(). Since I am new to Hive, my guess is hiveContext.sql doesn't know OR is not able to split the data present in the table.
For example, when you read a text file from HDFS, spark context considers the number of blocks used by that file to determine the partitions.
What you did with repartition is the obvious solution for these kinds of problems.(Note: repartition may cause a shuffle operation if proper partitioner is not used, hash Partitioner is used by default)
Coming to your doubt, hiveContext may consider the default minimum partition property. But, relying on default property is not going to
solve all your problems. For instance, if your hive table's size increases, your program still uses the default number of partitions.
Update: Avoid shuffle during repartition
Define your custom partitioner:
public class MyPartitioner extends HashPartitioner {
private final int partitions;
public MyPartitioner(int partitions) {
super();
this.partitions = partitions;
}
#Override
public int numPartitions() {
return this.partitions;
}
#Override
public int getPartition(Object key) {
if (key instanceof String) {
return super.getPartition(key);
} else if (key instanceof Integer) {
return (Integer.valueOf(key.toString()) % this.partitions);
} else if (key instanceof Long) {
return (int)(Long.valueOf(key.toString()) % this.partitions);
}
//TOD ... add more types
}
}
Use your custom partitioner:
JavaPairRDD<Long, SparkDatoinDoc> pairRdd = hiveContext.sql("select * from table_instance")
.mapToPair( //TODO ... expose the column as key)
rdd = rdd.partitionBy(new MyPartitioner(200));
//... rest of processing

Why is huge data shuffling in Spark when using union()/coalesce(1,false) on DataFrame?

I have Spark job which does some processing on ORC data and stores back ORC data using DataFrameWriter save() API introduced in Spark 1.4.0. I have the following piece of code which is using heavy shuffle memory. How do I optimize below code? Is there anything wrong with it? It is working fine as expected only causing slowness because of GC pause and shuffles lots of data so hitting memory issues. I am new to Spark.
JavaRDD<Row> updatedDsqlRDD = orderedFrame.toJavaRDD().coalesce(1, false).map(new Function<Row, Row>() {
#Override
public Row call(Row row) throws Exception {
List<Object> rowAsList;
Row row1 = null;
if (row != null) {
rowAsList = iterate(JavaConversions.seqAsJavaList(row.toSeq()));
row1 = RowFactory.create(rowAsList.toArray());
}
return row1;
}
}).union(modifiedRDD);
DataFrame updatedDataFrame = hiveContext.createDataFrame(updatedDsqlRDD,renamedSourceFrame.schema());
updatedDataFrame.write().mode(SaveMode.Append).format("orc").partitionBy("entity", "date").save("baseTable");
Edit
As per suggestion I tried to convert above code into the following using mapPartitionsWithIndex() but I still see data shuffling it is better than above code but still it fails by hitting GC limit and throws OOM or goes into GC pause for long and timeout and YARN will kill executor.
I am using spark.storage.memoryFraction as 0.5 and spark.shuffle.memoryFraction as 0.4; I tried to use default and changed many combinations, but nothing helped.
JavaRDD<Row> indexedRdd = sourceRdd.cache().mapPartitionsWithIndex(new Function2<Integer, Iterator<Row>, Iterator<Row>>() {
#Override
public Iterator<Row> call(Integer ind, Iterator<Row> rowIterator) throws Exception {
List<Row> rowList = new ArrayList<>();
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
List<Object> rowAsList = iterate(JavaConversions.seqAsJavaList(row.toSeq()));
Row updatedRow = RowFactory.create(rowAsList.toArray());
rowList.add(updatedRow);
}
return rowList.iterator();
}
}, true).coalesce(200,true);
Coalescing an RDD or Dataframe to a single partition means that all your processing is happening on a single machine. This is not a good thing for a variety of reasons: all of the data has to be shuffled across the network, there is no more parallelism, etc. Instead you should look at other operators like reduceByKey, mapPartitions, or really pretty much anything besides coalescing the data to a single machine.
Note: looking are your code I don't see why you are bringing it down to a single machine, you can probably just remove that part.

Resources