Spark shuffle writes growing out of control - apache-spark

I'm using Spark 1.6.1 to process some archives from CommonCrawl. They come as gzipped text files, and I've read that Spark has to load such compressed files into an RDD of a single partition. However, I'm running it on a cluster of ten nodes with 4 CPU's each, so I need to repartition it in order to process the data in parallel. These repartition steps are taking what seems like an unacceptably long time, and when I look on the web UI at the shuffle write times for any repartition step, it grows to over 40 GB, even though one .gz archive is only around 100 MB. Here is the relevant portion of the code I'm running:
final WhitelistFilter<String> filter =
new WhitelistFilter<String>(WHITELIST_THRESHOLD, termWeights, regex);
//Each URL points to a gzip-compressed text file in the commoncrawl s3 bucket
//With ~40,000 pages per archive
int counter = 0;
for(String s : ccURL) {
//Obtain WET file from the URL, and filter down to pages of English text
JavaPairRDD<LongWritable, Text> raw =
sc.newAPIHadoopFile("sample.wet.gz",
org.apache.hadoop.mapreduce.lib.input.TextInputFormat.class,
LongWritable.class, Text.class, s3Conf)
.repartition(sc.defaultParallelism() * 3)
JavaRDD<String> pages = JavaPairRDD.fromJavaRDD(
.filter(new WETFilter())
.map(new WETTransformerWithURL())
.filter(filter)).keys();
pages.saveAsTextFile("pages-" + (counter++) + ".txt");
}
The various functions called in the filter and map steps are just basic text processing - the most complicated thing is assigning a score based on term frequencies and filtering out anything below a threshold. If I remove the call to repartition(), the entire thing will finish quickly, but without any parallelism. What about the repartitioning could be causing it to be so incredibly slow, and also make the block manager write tens of gigabytes to the disk?

Related

Number of files saved by parquet writer in pyspark

How many files does a pyspark parquet write generate? I have read that the output is one file per in memory partition. However, this does not seem to always be true.
I am running a 6 executors cluster with 6G executor memory per executor. All the rest (pyspark, overhead, offheap) are 2G
using the following data:
dummy_data = spark.createDataFrame(pd.DataFrame({'a':np.random.choice([1,2,3,4,5,6,7,8,9,10],100000)}))
The following code where I repartition without specifying a column to repartition by, always produces the number of files equal to the number of memory partitions:
df_dummy = dummy_data.repartition(200)
df_dummy.rdd.getNumPartitions()
df_dummy.write.format("parquet").save("gs://monsoon-credittech.appspot.com/spark_datasets/test_writes/df_dummy_repart_wo_id")
#files generated 200
However, the following code, where I do specify the column to repartition the data by, produces some random number of files:
df_dummy = dummy_data.repartition(200,'a')
df_dummy.rdd.getNumPartitions()
df_dummy.write.format("parquet").save("gs://monsoon-credittech.appspot.com/spark_datasets/test_writes/df_dummy_repart_w_id")
#files generated 11
Can you help me understand the number of output files that gets generated by the pyspark parquet writer.
This is an answer that does not explain everything you're noticing, but probably contains useful enough information that it would be a pity not to share it.
The reason why you're seeing a different amount of output files is because of the order of your data after those 2 partitions.
dummy_data.repartition(200) repartitions your individual rows using round robin partitioning
the result is that your data has a random ordering, because your input data has random ordering
dummy_data.repartition(200,'a') uses hash partitioning according to the column a's values
the result is that your data is chopped up in a very specific order: hashing the column values will put values where a == 1 always in the same partition
since your nr of partitions is smaller than the distinct amount of possible values, each partition will contain only 1 distinct a value.
Now, there is a pattern in the amount of output part-files you receive:
In the case of dummy_data.repartition(200), you simply get the same number of part-files as partitions. 200 in your example.
In the other case, you get 11 part-files. If you have a look at the content of those part-files, you will see that there is 1 empty file + 10 filled files. 1 for each distinct value of your original dataset. So this leads to the conclusion that while writing your files, something is being smart and merging those minuscule and identical files. I'm not sure whether this is Spark, or the PARQUET_OUTPUT_COMMITTER_CLASS, or something else.
Conclusion
In general, you get the same amount of part-files as the amount of partitions.
In your specific case, when you're repartitioning by the column (which is the only value in the Row), your parquet part-files will contain a bunch of the same values. It seems that something (I don't know what) is being smart and merging files with the same values.
In your case, you got 11 part-files because there is 1 empty file and 10 files for each distinct value in your dataframe. Try changing np.random.choice([1,2,3,4,5,6,7,8,9,10] to np.random.choice([1,2,3,4,5,6,7,8] and you will see you'll get 9 part-files (8 + 1).
Most likely, the reason you see 11 files being written after you do a .repartition(200,'a') is because your first partition (with partition id = 0) becomes empty. Spark allows the task working on that empty partition to proceed with the write, but will suppress writing all other empty parquet files for all other partitions. This behavior can be tracked down to the changes made for JIRA SPARK-21435 "Empty files should be skipped while write to file", and corresponding code in FileFormatWriter.scala:
:
val dataWriter =
if (sparkPartitionId != 0 && !iterator.hasNext) {
// In case of empty job, leave first partition to save meta for file format like parquet.
new EmptyDirectoryDataWriter(description, taskAttemptContext, committer)
} else if (description.partitionColumns.isEmpty && description.bucketSpec.isEmpty) {
:
So, if you repartition your dataset such that partition 0 becomes non-empty, you would not see any empty files written.

Spark and isolating time taken for tasks

I recently began to use Spark to process huge amount of data (~1TB). And have been able to get the job done too. However I am still trying to understand its working. Consider the following scenario:
Set reference time (say tref)
Do any one of the following two tasks:
a. Read large amount of data (~1TB) from tens of thousands of files using SciSpark into RDDs (OR)
b. Read data as above and do additional preprossing work and store the results in a DataFrame
Print the size of the RDD or DataFrame as applicable and time difference wrt to tref (ie, t0a/t0b)
Do some computation
Save the results
In other words, 1b creates a DataFrame after processing RDDs generated exactly as in 1a.
My query is the following:
Is it correct to infer that t0b – t0a = time required for preprocessing? Where can I find an reliable reference for the same?
Edit: Explanation added for the origin of question ...
My suspicion stems from Spark's lazy computation approach and its capability to perform asynchronous jobs. Can/does it initiate subsequent (preprocessing) tasks that can be computed while thousands of input files are being read? The origin of the suspicion is in the unbelievable performance (with results verified okay) I see that look too fantastic to be true.
Thanks for any reply.
I believe something like this could assist you (using Scala):
def timeIt[T](op: => T): Float = {
val start = System.currentTimeMillis
val res = op
val end = System.currentTimeMillis
(end - start) / 1000f
}
def XYZ = {
val r00 = sc.parallelize(0 to 999999)
val r01 = r00.map(x => (x,(x,x,x,x,x,x,x)))
r01.join(r01).count()
}
val time1 = timeIt(XYZ)
// or like this on next line
//val timeN = timeIt(r01.join(r01).count())
println(s"bla bla $time1 seconds.")
You need to be creative and work incrementally with Actions that cause actual execution. This has limitations thus. Lazy evaluation and such.
On the other hand, Spark Web UI records every Action, and records Stage duration for the Action.
In general: performance measuring in shared environments is difficult. Dynamic allocation in Spark in a noisy cluster means that you hold on to acquired resources during the Stage, but upon successive runs of the same or next Stage you may get less resources. But this is at least indicative and you can run in a less busy period.

Spark window function on dataframe with large number of columns

I have an ML dataframe which I read from csv files. It contains three types of columns:
ID Timestamp Feature1 Feature2...Feature_n
where n is ~ 500 (500 features in ML parlance). The total number of rows in the dataset is ~ 160 millions.
As this is the result of a previous full join, there are many features which do not have values set.
My aim is to run a "fill" function(fillna style form python pandas), where each empty feature value gets set with the previously available value for that column, per Id and Date.
I am trying to achieve this with the following spark 2.2.1 code:
val rawDataset = sparkSession.read.option("header", "true").csv(inputLocation)
val window = Window.partitionBy("ID").orderBy("DATE").rowsBetween(-50000, -1)
val columns = Array(...) //first 30 columns initially, just to see it working
val rawDataSetFilled = columns.foldLeft(rawDataset) { (originalDF, columnToFill) =>
originalDF.withColumn(columnToFill, coalesce(col(columnToFill), last(col(columnToFill), ignoreNulls = true).over(window)))
}
I am running this job on a 4 m4.large instances on Amazon EMR, with spark 2.2.1. and dynamic allocation enabled.
The job runs for over 2h without completing.
Am I doing something wrong, at the code level? Given the size of the data, and the instances, I would assume it should finish in a reasonable amount of time? And I haven't even tried with the full 500 columns, just with about 30!
Looking in the container logs, all I see are many logs like this:
INFO codegen.CodeGenerator: Code generated in 166.677493 ms
INFO execution.ExternalAppendOnlyUnsafeRowArray: Reached spill
threshold of
4096 rows, switching to
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
I have tried setting parameter spark.sql.windowExec.buffer.spill.threshold to something larger, without any impact. Is theresome other setting I should know about? Those 2 lines are the only ones I see in any container log.
In Ganglia, I see most of the CPU cores peaking around full usage, but the memory usage is lower than the maximum available. All executors are allocated and are doing work.
I have managed to rewrite the fold left logic without using withColumn calls. Apparently they can be very slow for large number of columns, and I was also getting stackoverflow errors because of that.
I would be curious to know why this massive difference - and what exactly happens behind the scenes with the query plan execution, which makes repeated withColumns calls so slow.
Links which proved very helpful: Spark Jira issue and this stackoverflow question
var rawDataset = sparkSession.read.option("header", "true").csv(inputLocation)
val window = Window.partitionBy("ID").orderBy("DATE").rowsBetween(Window.unboundedPreceding, Window.currentRow)
rawDataset = rawDataset.select(rawDataset.columns.map(column => coalesce(col(column), last(col(column), ignoreNulls = true).over(window)).alias(column)): _*)
rawDataset.write.option("header", "true").csv(outputLocation)

How to access this kind of data in Spark

The data is stored in the following forms:
data/file1_features.mat
data/file1_labels.txt
data/file2_features.mat
data/file2_labels.txt
...
data/file100_features.mat
data/file100_labels.txt
Each data/file*_features.mat stores the features of some samples and each row is a sample. Each data/file*_labels.txt stores the labels of those samples and each row is a number (e.g., 1,2,3,...). In the whole 100 files, there are total about 80 million samples.
In Spark, how to access this data set?
I have checked the spark-2.0.0-preview/examples/src/main/python/mllib/random_forest_classification_example.py. It has the following lines:
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
(trainingData, testData) = data.randomSplit([0.7, 0.3])
I run this example in ./bin/pyspark, it shows the data object is a PythonRDD.
PythonRDD[32] at RDD at PythonRDD.scala:48
The data/mllib/sample_libsvm_data.txt is just one file. In my case, there are many files. Is there any RDD in Spark to handle this case conveniently? Does it need to merge all 100 files to one big file and process it as the example? I want to use the Spark engine to scale the data set (mean-std normalization or min-max normalization).
Simply point
dir = "<path_to_data>/data"
sc.textFile(dir)
Spark automatically picks up all of the files inside that directory
If you want load specific file type for processing then you can use regular expression for loading files into RDD.
dir = "data/*.txt"
sc.textFile(dir)
Spark will all files ending with txt extension.

Spark: Cut down no. of output files

I wrote a Spark program that mimics functionality of an existing Map Reduce job. The MR job takes about 50 minutes every day, but the Spark job took only 9 minutes! That’s great!
When I looked at the output directory, I noticed that it created 1,020 part files. The MR job uses only 20 reducers so it creates only 20 files. We need to cut down on # of output files; otherwise our Namespace would be full in no time.
I am trying to figure out how I can reduce the number of output files under Spark. Seems like 1,020 tasks are getting triggered and each one creates a part file. Is this correct? Do I have to change the level of parallelism to cut down no. of tasks thereby reducing no. of output files? If so how do I set it? I am afraid cutting down no. of tasks will slow down this process – but I can test that!
Cutting down the number of reduce tasks will slow down the process for sure. However, it still should be considerably faster than Hadoop MapReduce for your use case.
In my opinion, the best method to limit the number of output files is using the coalesce(numPartitions) transformation. Below is an example:
JavaSparkContext ctx = new JavaSparkContext(/*your configuration*/);
JavaRDD<String> myData = ctx.textFile("path/to/my/file.txt");
//Consider we have 1020 partitions and thus 1020 map tasks
JavaRDD<String> mappedData = myData.map( your map function );
//Consider we need 20 output files
JavaRDD<String> newData = mappedData.coalesce(20)
newData.saveAsTextFile("output path");
In this example, the map function would be executed by 1020 tasks, which would not be altered in any way. However, after having coalesced the partitions, there should only be 20 partitions to work with. In that case, 20 output files would be saved at the end of the program.
As mentioned earlier, take into account that this method will be slower than having 1020 output files. The data needs to be stored into few partitions (from 1020 to 20).
Note: please take a look to the repartition command on the following link too.

Resources