How partitions are created in Spark - apache-spark

I'm looking for detailed description on how partitions are created in Spark. I assume its created based on the number of available cores in the cluster. But take for example if I have 512 MB file that needs to be processed and this file will be stored in my storage (which can be HDFS or S3 bucket) with block size of either 64 MB or 128 MB. For this case we can assume my cluster cores is 8. But when the file is getting processed by spark program how the partitions will happen on this. Hope 512 MB file will be divided into 8 different partitions and executed in 8 cores. Pls provide suggestions on this.

I find something in source code of FilePartition.scala .
It seems the number of partitions is related to the configuration parameters "maxSplitBytes" and "filesOpenCostInBytes"
def getFilePartitions(
sparkSession: SparkSession,
partitionedFiles: Seq[PartitionedFile],
maxSplitBytes: Long): Seq[FilePartition] = {
val partitions = new ArrayBuffer[FilePartition]
val currentFiles = new ArrayBuffer[PartitionedFile]
var currentSize = 0L
/** Close the current partition and move to the next. */
def closePartition(): Unit = {
if (currentFiles.nonEmpty) {
// Copy to a new Array.
val newPartition = FilePartition(partitions.size, currentFiles.toArray)
partitions += newPartition
}
currentFiles.clear()
currentSize = 0
}
val openCostInBytes = sparkSession.sessionState.conf.filesOpenCostInBytes
// Assign files to partitions using "Next Fit Decreasing"
partitionedFiles.foreach { file =>
if (currentSize + file.length > maxSplitBytes) {
closePartition()
}
// Add the given file to the current partition.
currentSize += file.length + openCostInBytes
currentFiles += file
}
closePartition()
partitions.toSeq
}

Related

Write >1 files (limited by size) from a spark partition

I am fetching an RDBMS table using JDBC with some 10-20 partitions using ROW_NUM. Then from each of these partitions I want to process/format the data, and write one or more files out to file storage based on the file size Each file must be less 500MB. How do I write multiple files out from a single partition? This spark config property 'spark.sql.files.maxRecordsPerFile' wont work for me because each row can be of different size as there is blob data in the row. The size of this blob may vary anywhere from a few hundred bytes to 50MB. So I cannot really limit the write by maxRecordsPerFile.
How do I further split each DB partition, into smaller partitions and then write out the files?
If I do a repartition, it does shuffle across all executors. I am trying to keep all data within the same executor to avoid shuffle. Is it possible to repartition within the same executor core (repartition the current partition), and then write a single file from each?
I tried this to atleats calculate the total size of my payload and then repartition. but this fails with "Local variable numOfPartitions defined in an enclosing scope must be final or effectively final". what am I doing wrong. How can I fix this code?
...
int numOfPartitions = 1;
JavaRDD<String> tgg = xmlDataSet.toJavaRDD().mapPartitions ( xmlRows -> {
long totalSize = 0;
List<String> strLst = new ArrayList<String>();
while (xmlRows.hasNext()) {
String xmlString = xmlString = blobToString(xmlRows.next());
totalSize = totalSize + xmlString.getBytes().length;
strLst.add(xmlString);
if (totalSize > 10000) {
numOfPartitions++;
}
}
return strLst.iterator();
});
...

dataset groupByKey mapGroups only using 2 executors out 50 assigned

I have a job that loads some data from Hive and then it does some processing and ends writing data to Cassandra. At some point it was working fine but then all of the sudden one of the Spark operations has a bottleneck where only 2 cores are used even though the partition count it is set to 2000 across the pipeline. I am running Spark version: spark-core_2.11-2.0.0
My Spark configuration is as follows:
spark.executor.instances = "50"
spark.executor.cores = "4"
spark.executor.memory = "6g"
spark.driver.memory = "8g"
spark.memory.offHeap.enabled = "true"
spark.memory.offHeap.size = "4g"
spark.yarn.executor.memoryOverhead = "6096"
hive.exec.dynamic.partition.mode = "nonstrict"
spark.sql.shuffle.partitions = "3000"
spark.unsafe.sorter.spill.reader.buffer.size = "1m"
spark.file.transferTo = "false"
spark.shuffle.file.buffer = "1m"
spark.shuffle.unsafe.file.ouput.buffer = "5m"
When I do a thread dump of the executor that is running I see:
com.*.MapToSalaryRow.buildSalaryRow(SalaryTransformer.java:110)
com.*.MapToSalaryRow.call(SalaryTransformer.java:126)
com.*.MapToSalaryRow.call(SalaryTransformer.java:88)
org.apache.spark.sql.KeyValueGroupedDataset$$anonfun$mapGroups$1.apply(KeyValueGroupedDataset.scala:220)
A simplified version of the code that is having the problem is:
sourceDs.createOrReplaceTempView("salary_ds")
sourceDs.repartition(2000);
System.out.println("sourceDsdataset partition count = "+sourceDs.rdd().getNumPartitions());
Dataset<Row> salaryDs = sourceDs.groupByKey(keyByUserIdFunction, Encoders.LONG()).mapGroups(
new MapToSalaryRow( props), RowEncoder.apply(getSalarySchema())).
filter((FilterFunction<Row>) (row -> row != null));
salaryDs.persist(StorageLevel.MEMORY_ONLY_SER());
salaryDs.repartition(2000);
System.out.println("salaryDs dataset partition count = "+salaryDs.rdd().getNumPartitions());
Both of the above print statements show the partition count being 2000
The relevant code of the function MapGroups is:
class MapToSalaryInsightRow implements MapGroupsFunction<Long, Row, Row> {
private final Properties props;
#Override
public Row call(Long userId, Iterator<Row> iterator) throws Exception {
return buildSalaryRow(userId, iterator, props);
}
}
If anybody can point where the problem might be is highly appreciated.
Thanks
The problem was that there was a column of type array and for one of the rows this array was humongous therefore. Even though the partitions sizes were about the same two of them had 40 times the size. In those cases the task that got those rows would take significantly longer.

Spark binaryRecords() giving less performance as compare to textFile()

I have spark job code as below. Which works fine with below configuration on cluster.
String path = "/tmp/one.txt";
JavaRDD<SomeClass> jRDD = spark.read()
.textFile(path)
.javaRDD()
.map(line -> {
return new SomeClass(line);
});
Dataset responseSet = sparkSession.createDataFrame(jRDD, SomeClass.class);
responseSet.write()
.format("text")
.save(path + "processed");
Whereas, If I want to read binary file(same size as text) it takes much more time.
String path = "/tmp/one.txt";
JavaRDD<SomeClass> jRDD = sparkContext
.binaryRecords(path, 10000, new Configuration())
.toJavaRDD()
.map(line -> {
return new SomeClass(line);
});
Dataset responseSet = spark.createDataFrame(jRDD, SomeClass.class);
responseSet.write()
.format("text")
.save(path + "processed");
Below is my configuration.
driver-memory 8g
executor-memory 6g
num-executors 16
Time taken by first code with 150 MB file is 1.30 mins.
Time taken by second code with 150 MB file is 4 mins.
Also, first code was able to run on all 16 executors whereas second uses only one.
ny suggestions why it is slow?
I found the issue. The textFile()method was creating 16 partitions(you can checknumOfPartitions using getNumPartitions() method on RDD) whereas binaryRecords() created only 1(Java binaryRecords doesn't provide overloaded method which specifies num of partitions to be created).
I increased numOfPartitions on RDD created by binaryRecords() by using repartition(NUM_OF_PARTITIONS) method on RDD.

How to optimize Spark Job processing S3 files into Hive Parquet Table

I am new to Spark distributed development. I'm attempting to optimize my existing Spark job which takes up to 1 hour to complete.
Infrastructure:
EMR [10 instances of r4.8xlarge (32 cores, 244GB)]
Source Data: 1000 .gz files in S3 (~30MB each)
Spark Execution Parameters [Executors: 300, Executor Memory: 6gb, Cores: 1]
In general, the Spark job performs the following:
private def processLines(lines: RDD[String]): DataFrame = {
val updatedLines = lines.mapPartitions(row => ...)
spark.createDataFrame(updatedLines, schema)
}
// Read S3 files and repartition() and cache()
val lines: RDD[String] = spark.sparkContext
.textFile(pathToFiles, numFiles)
.repartition(2 * numFiles) // double the parallelism
.cache()
val numRawLines = lines.count()
// Custom process each line and cache table
val convertedLines: DataFrame = processLines(lines)
convertedRows.createOrReplaceTempView("temp_tbl")
spark.sqlContext.cacheTable("temp_tbl")
val numRows = spark.sql("select count(*) from temp_tbl").collect().head().getLong(0)
// Select a subset of the data
val myDataFrame = spark.sql("select a, b, c from temp_tbl where field = 'xxx' ")
// Define # of parquet files to write using coalesce
val numParquetFiles = numRows / 1000000
var lessParts = myDataFrame.rdd.coalesce(numParquetFiles)
var lessPartsDataFrame = spark.sqlContext.createDataFrame(lessParts, myDataFrame.schema)
lessPartsDataFrame.createOrReplaceTempView('my_view')
// Insert data from view into Hive parquet table
spark.sql("insert overwrite destination_tbl
select * from my_view")
lines.unpersist()
The app reads all S3 files => repartitions to twice the amount of files => caches the RDD => custom processes each line => creates a temp view/cache table => counts the num rows => selects a subset of the data => decrease the amount of partitions => creates a view of the subset of data => inserts to hive destination table using the view => unpersist the RDD.
I am not sure why it takes a long time to execute. Are the spark execution parameters incorrectly set or is there something being incorrectly invoked here?
Before looking at the metrics, I would try the following change to your code.
private def processLines(lines: DataFrame): DataFrame = {
lines.mapPartitions(row => ...)
}
val convertedLinesDf = spark.read.text(pathToFiles)
.filter("field = 'xxx'")
.cache()
val numLines = convertedLinesDf.count() //dataset get in memory here, it takes time
// Select a subset of the data, but it will be fast if you have enough memory
// Just use Dataframe API
val myDataFrame = convertedLinesDf.transform(processLines).select("a","b","c")
//coalesce here without converting to RDD, experiment what best
myDataFrame.coalesce(<desired_output_files_number>)
.write.option(SaveMode.Overwrite)
.saveAsTable("destination_tbl")
Caching is useless if you don't count the number of rows. And it will take some memory and add some GC pressure
Caching table may consume more memory and add more GC pressure
Converting Dataframe to RDD is costly as it implies ser/deser operations
Not sure what you trying to do with : val numParquetFiles = numRows / 1000000 and repartition(2 * numFiles). With your setup, 1000 files of 30MB each will give you 1000 partitions. It could be fine like this. Calling repartition and coalesce may trigger a shuffling operation which is costly. (Coalesce may not trigger a shuffle)
Tell me if you get any improvements !

Is there a better way to load a huge tar file in Spark while avoiding OutOfMemoryError?

I have a single tar file mytar.tar that is 40 GB in size. Inside this tar file are 500 tar.gz files, and inside each one of these tar.gz files are a bunch of JSON files. I have written up the code to process this single tar file and attempt to get the list of JSON string contents. My code looks like the following.
val isRdd = sc.binaryFiles("/mnt/mytar.tar")
.flatMap(t => {
val buf = scala.collection.mutable.ListBuffer.empty[TarArchiveInputStream]
val stream = t._2
val is = new TarArchiveInputStream(stream.open())
var entry = is.getNextTarEntry()
while (entry != null) {
val name = entry.getName()
val size = entry.getSize.toInt
if (entry.isFile() && size > -1) {
val content = new Array[Byte](size)
is.read(content, 0, content.length)
val tgIs = new TarArchiveInputStream(new GzipCompressorInputStream(new ByteArrayInputStream(content)))
buf += tgIs
}
entry = is.getNextTarEntry()
}
buf.toList
})
.cache
val byteRdd = isRdd.flatMap(is => {
val buf = scala.collection.mutable.ListBuffer.empty[Array[Byte]]
var entry = is.getNextTarEntry()
while (entry != null) {
val name = entry.getName()
val size = entry.getSize.toInt
if (entry.isFile() && name.endsWith(".json") && size > -1) {
val data = new Array[Byte](size)
is.read(data, 0, data.length)
buf += data
}
entry = is.getNextTarEntry()
}
buf.toList
})
.cache
val jsonRdd = byteRdd
.map(arr => getJson(arr))
.filter(_.length > 0)
.cache
jsonRdd.count //action just to execute the code
When I execute this code, I get an OutOfMemoryError (OOME).
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 24.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 24.0 (TID 137, 10.162.224.171, executor 13):
java.lang.OutOfMemoryError: Java heap space
My EC2 cluster has 1 driver and 2 worker nodes of type i3.xlarge (30.5 GB Memory, 4 Cores). From looking at the logs and just thinking about it, I believe the OOME happens during the creation of the isRDD (input stream RDD).
Is there anything else in the code or creation of my Spark cluster that I can do to mitigate this problem? Should I select an EC2 instance with more memory (e.g. a memory optimized instance like R5.2xlarge)? FWIW, I upgraded to an R5.2xlarge cluster setting and still saw the OOME.
One thing I have thought about doing was to untar mytar.tar and instead start with the .tar.gz files inside. I am thinking that each .tar.gz inside the tar file will have to be less than 30 GB to avoid the OOME (on the i3.xlarge).
Any tips or advice is appreciated.

Resources