Apache Spark write to multiple outputs [different parquet schemas] without caching - apache-spark

I want to transform my input data (XML files) and produce 3 different outputs.
Each output will be in parquet format and will have a different schema/number of columns.
Currently in my solution, the data is stored in RDD[Row], where each Row belongs to one of three types and has a different number of fields. What I'm doing now is caching the RDD, then filtering it (using the field telling me about the record type) and saving the data using the following method:
var resultDF_1 = sqlContext.createDataFrame(filtered_data_1, schema_1)
resultDF_1.write.parquet(output_path_1)
...
// the same for filtered_data_2 and filtered_data_3
Is there any way to do it better, for example do not cache entire data in memory?
In MapReduce we have MultipleOutputs class and we can do it this way:
MultipleOutputs.addNamedOutput(job, "data_type_1", DataType1OutputFormat.class, Void.class, Group.class);
MultipleOutputs.addNamedOutput(job, "data_type_2", DataType2OutputFormat.class, Void.class, Group.class);
MultipleOutputs.addNamedOutput(job, "data_type_3", DataType3OutputFormat.class, Void.class, Group.class);
...
MultipleOutputs<Void, Group> mos = new MultipleOutputs<>(context);
mos.write("data_type_1", null, myRecordGroup1, filePath1);
mos.write("data_type_2", null, myRecordGroup2, filePath2);
...

We had exactly this problem, to re-iterate: we read 1000s of datasets into one RDD, all of different schemas (we used a nested Map[String, Any]) and wanted to write those 1000s of datasets to different Parquet partitions in their respective schemas. All in a single embarrassingly parallel Spark Stage.
Our initial approach indeed did the hacky thing of caching, but this meant (a) 1000 passes of the cached data (b) hitting a lot of memory issues!
For a long time now I've wanted to bypass the Spark's provided .parquet methods and go to lower level underlying libraries, and wrap that in a nice functional signature. Finally recently we did exactly this!
The code is too much to copy and paste all of it here, so I will just paste the main crux of the code to explain how it works. We intend on making this code Open Source in the next year or two.
val successFiles: List[String] = successFilePaths(tableKeyToSchema, tableKeyToOutputKey, tableKeyToOutputKeyNprs)
// MUST happen first
info("Deleting success files")
successFiles.foreach(S3Utils.deleteObject(bucket, _))
if (saveMode == SaveMode.Overwrite) {
info("Deleting past files as in Overwrite mode")
parDeleteDirContents(bucket, allDirectories(tableKeyToOutputKey, tableKeyToOutputKeyNprs, partitions, continuallyRunStartTime))
} else {
info("Not deleting past files as in Append mode")
}
rdd.mapPartitionsWithIndex {
case (index, records) =>
records.toList.groupBy(_._1).mapValues(_.map(_._2)).foreach {
case (regularKey: RegularKey, data: List[NotProcessableRecord Either UntypedStruct]) =>
val (nprs: List[NotProcessableRecord], successes: List[UntypedStruct]) =
Foldable[List].partitionEither(data)(identity)
val filename = s"part-by-partition-index-$index.snappy.parquet"
Parquet.writeUntypedStruct(
data = successes,
schema = toMessageType(tableKeyToSchema(regularKey.tableKey)),
fsMode = fs,
path = s3 / bucket / tableKeyToOutputKey(regularKey.tableKey) / regularKey.partition.pathSuffix /?
continuallyRunStartTime.map(hourMinutePathSuffix) / filename
)
Parquet.writeNPRs(
nprs = nprs,
fsMode = fs,
path = s3 / bucket / tableKeyToOutputKeyNprs(regularKey.tableKey) / regularKey.partition.pathSuffix /?
continuallyRunStartTime.map(hourMinutePathSuffix) / filename
)
} pipe Iterator.single
}.count() // Just some action to force execution
info("Writing _SUCCESS files")
successFiles.foreach(S3Utils.uploadFileContent(bucket, "", _))
Of course this code cannot be copy and pasted as many methods and values are not provided. The key points are:
We hand crank the deleting of _SUCCESS files and previous files when overwriting
Each spark partition will result in one-or-many output files (many when multiple data schemas are in the same partition)
We hand crank the writing of _SUCCESS files
Notes:
UntypedStruct is our nested representation of arbitrary schema. It's a little bit like Row in Spark but much better, as it's based on Map[String, Any].
NotProcessableRecord are essentially just dead letters
Parquet.writeUntypedStruct is the crux of the logic of writing a parquet file, so we'll explain this in more detail. Firstly
val toMessageType: StructType => MessageType = new org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter().convert
Should be self explanatory. Next fsMode contains within it the com.amazonaws.auth.AWSCredentials, then inside writeUntypedStruct we use that to construct org.apache.hadoop.conf.Configuration setting fs.s3a.access.key and fs.s3a.secret.key.
writeUntypedStruct basically just calls out to:
def writeRaw(
data: List[UntypedStruct],
schema: MessageType,
config: Configuration,
path: Path,
compression: CompressionCodecName = CompressionCodecName.SNAPPY
): Unit =
Using.resource(
ExampleParquetWriter.builder(path)
.withType(schema)
.withConf(config)
.withCompressionCodec(compression)
.withValidation(true)
.build()
)(writer => data.foreach(data => writer.write(transpose(data, new SimpleGroup(schema)))))
where SimpleGroup comes from org.apache.parquet.example.data.simple, and ExampleParquetWriter extends ParquetWriter<Group>. The method transpose is a very tedious self writing recursion through the UntypedStruct populating a Group (some ugly Java mutable low level thing).
Credit must go to https://github.com/davidainslie for figuring out how these underlying libraries work, and labouring out the code, which like I said, we intend on making Open Source soon!

AFAIK, there is no way to split one RDD into multiple RDD per se. This is just how the way Spark's DAG works: only child RDDs pulling data from parent RDDs.
We can, however, have multiple child RDDs reading from the same parent RDD. To avoid recomputing the parent RDD, there is no other way but to cache it. I assume that you want to avoid caching because you're afraid of insufficient memory. We can avoid Out Of Memory (OOM) issue by persisting the RDD to MEMORY_AND_DISK so that large RDD will spill to disk if and when needed.
Let's begin with your original data:
val allDataRDD = sc.parallelize(Seq(Row(1,1,1),Row(2,2,2),Row(3,3,3)))
We can persist this in memory first, but allow it to spill over to disk in case of insufficient memory:
allDataRDD.persist(StorageLevel.MEMORY_AND_DISK)
We then create the 3 RDD outputs:
filtered_data_1 = allDataRDD.filter(_.get(1)==1) // //
filtered_data_2 = allDataRDD.filter(_.get(2)==1) // use your own filter funcs here
filtered_data_3 = allDataRDD.filter(_.get(3)==1) // //
We then write the outputs:
var resultDF_1 = sqlContext.createDataFrame(filtered_data_1, schema_1)
resultDF_1.write.parquet(output_path_1)
var resultDF_2 = sqlContext.createDataFrame(filtered_data_2, schema_2)
resultDF_2.write.parquet(output_path_2)
var resultDF_3 = sqlContext.createDataFrame(filtered_data_3, schema_3)
resultDF_3.write.parquet(output_path_3)
If you truly really want to avoid multiple passes, there is a workaround using a custom partitioner. You can repartition your data into 3 partitions and each partition will have its own task and hence its own output file/part. The caveat is that parallelism will be heavily reduced to 3 threads/tasks, and there's also the risk of >2GB of data stored in a single partition (Spark has a 2GB limit per partition). I am not providing detailed code for this method because I don't think it can write parquet files with different schema.

Related

How can you get around the 2GB buffer limit when using Dataset.groupByKey?

When using Dataset.groupByKey(_.key).mapGroups or Dataset.groupByKey(_.key).cogroup in Spark, I've run into a problem when one of the groupings results in more than 2GB of data.
I need to normalize the data by group before I can start to reduce it, and I would like to split up the groups into smaller subgroups so they distribute better. For example, here's one way I've attempted to split the groups:
val groupedInputs = inputData.groupByKey(_.key).mapGroups {
case(key, inputSeries) => inputSeries.grouped(maxGroupSize).map(group => (key, group))
}
But unfortunately however I try to work around it, my jobs always die with an error like this: java.lang.UnsupportedOperationException: Cannot grow BufferHolder by size 23816 because the size after growing exceeds size limitation 2147483632. When using Kryo serialization I get a different Kryo serialization failed: Buffer overflow error recommending I increase spark.kryoserializer.buffer.max, but I've already increased it to the 2GB limit.
One solution that occurs to me is to add a random value to the keys before grouping them. This isn't ideal since it'll split up every group (not just the large ones), but I'm willing to sacrifice "ideal" for the sake of "working". That code would look something like this:
val splitInputs = inputData.map( record => (record, ThreadLocalRandom.current.nextInt(splitFactor)))
val groupedInputs = splitInputs.groupByKey{ case(record, split) => (record.key, split)).mapGroups {
case((key, _), inputSeries) => inputSeries.grouped(maxGroupSize).map(group => (key, group.map(_._1)))
}
Add a salt key and do groupBy on your key and the salt key and later
import scala.util.Random
val start = 1
val end = 5
val randUdf = udf({() => start + Random.nextInt((end - start) + 1)})
val saltGroupBy=skewDF.withColumn("salt_key", randUdf())
.groupBy(col("name"), col("salt_key"))
So your all the skew data doesn't go into one executor and cause the 2GB Limit.
But you have to develop a logic to aggregate the above result and finally remove the salt key at the end.
When you use groupBy all the records with the same key will reach one executor and bottle neck occur.
The above is one of the method to mitigate it.
For this case, where the dataset had a lot of skew and it was important to group the records into regularly-sized groups, I decided to process the dataset in two passes. First I used a window function to number the rows by key, and converted that to a "group index," based on a configurable "maxGroupSize":
// The "orderBy" doesn't seem necessary here,
// but the row_number function requires it.
val partitionByKey = Window.partitionBy(key).orderBy(key)
val indexedData = inputData.withColumn("groupIndex",
(row_number.over(partitionByKey) / maxGroupSize).cast(IntegerType))
.as[(Record, Int)]
Then I can group by key and index, and produce groups that are consistently sized--the keys with a lot of records get split up more, and the keys with few records may not be split up at all.
indexedData.groupByKey{ case (record, groupIndex) => (record.key, groupIndex) }
.mapGroups{ case((key, _), recordGroup) =>
// Remove the index values before returning the groups
(key, recordGroup.map(_._1))
}

Spark cached Dataset not reused in join

After some complex transformations I call cache on dataset.
val complexDs = (some joins and transformations).cache
complexDs.count //invoke action and store dataset in memory
Then I define method like this:
def getCountForType(input: Dataset[Row], tp: String):Dataset[Row] = {
input
.groupBy($"column1",$"column2")
.agg(sum(when($"typeColumn" === tp,1).otherwise(0).as("some_column_name")) }
Now I create several new datasets by executing this method.
val newDs1 = getCountForType(complexDs, "aType")
val newDs2 = getCountForType(complexDs, "bType")
And funny part comes. When I call action on each of this several new datasets separately, I get results instantaneously and on SparkUI you can tell there is in memory table scan...But when I try to do join like this:
val finalDs = newDs1.as("a").join(newDs2.as("b"), $"a.column1" ===
$"b.column1" && $"a.column2" === $"b.column2)
display(finalDs)
I can tell something is wrong as execution gets terribly slow and on SparkUI, one dataset is read from cache and other one generated from beginning. Anyone has idea what is going on?
UPDATE
When i try(just for testing purposes) to union newDs1 and newDs2, I see there are 2 in memory table scans and results are instantaneous. Here are 2 screenshots of DAG. Union - reads already calculated data from memory, Join - calculates one dataset from beggining.

Spark write only to one hbase region server

import org.apache.hadoop.hbase.mapreduce.TableOutputFormat
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.spark.rdd.PairRDDFunctions
def bulkWriteToHBase(sparkSession: SparkSession, sparkContext: SparkContext, jobContext: Map[String, String], sinkTableName: String, outRDD: RDD[(ImmutableBytesWritable, Put)]): Unit = {
val hConf = HBaseConfiguration.create()
hConf.set("hbase.zookeeper.quorum", jobContext("hbase.zookeeper.quorum"))
hConf.set("zookeeper.znode.parent", jobContext("zookeeper.znode.parent"))
hConf.set(TableInputFormat.INPUT_TABLE, sinkTableName)
val hJob = Job.getInstance(hConf)
hJob.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, sinkTableName)
hJob.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])
outRDD.saveAsNewAPIHadoopDataset(hJob.getConfiguration())
}
what I have found by using this hbase bulk insertion is that, every time spark will only write into one single region server from hbase, which becomes the bottleneck.
however when I use almost the same approach but reading from hbase, it is using multiple executors to do parallel reading .
def bulkReadFromHBase(sparkSession: SparkSession, sparkContext: SparkContext, jobContext: Map[String, String], sourceTableName: String) = {
val hConf = HBaseConfiguration.create()
hConf.set("hbase.zookeeper.quorum", jobContext("hbase.zookeeper.quorum"))
hConf.set("zookeeper.znode.parent", jobContext("zookeeper.znode.parent"))
hConf.set(TableInputFormat.INPUT_TABLE, sourceTableName)
val inputRDD = sparkContext.newAPIHadoopRDD(hConf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
inputRDD
}
can anyone please explain why this could happen? or maybe I have
used the wrong way for spark-hbase bulk I/O ?
Question : I have used the wrong way for spark-hbase bulk I/O ?
No your way is right, although, you need to pre-split regions before hand & create table with presplit regions.
for example create 'test_table', 'f1', SPLITS=> ['1', '2', '3', '4', '5', '6', '7', '8', '9']
Above table occupies 9 regions..
design good rowkey with will starts with 1-9
you can use guava murmur hash like below.
import com.google.common.hash.HashCode;
import com.google.common.hash.HashFunction;
import com.google.common.hash.Hashing;
/**
* getMurmurHash.
*
* #param content
* #return HashCode
*/
public static HashCode getMurmurHash(String content) {
final HashFunction hf = Hashing.murmur3_128();
final HashCode hc = hf.newHasher().putString(content, Charsets.UTF_8).hash();
return hc;
}
final long hash = getMurmur128Hash(Bytes.toString(yourrowkey as string)).asLong();
final int prefix = Math.abs((int) hash % 9);
now append this prefix to your rowkey
For example
1rowkey1 // will go in to first region
2rowkey2 // will go in to
second region
3rowkey3 // will go in to third region
...
9rowkey9 //
will go in to ninth region
If you are doing pre-splitting, and want to manually manage region splits, you can also disable region splits, by setting hbase.hregion.max.filesize to a high number and setting the split policy to ConstantSizeRegionSplitPolicy. However, you should use a safeguard value of like 100GB, so that regions does not grow beyond a region server’s capabilities. You can consider disabling automated splitting and rely on the initial set of regions from pre-splitting for example, if you are using uniform hashes for your key prefixes, and you can ensure that the read/write load to each region as well as its size is uniform across the regions in the table
1) please ensure that you can presplit the table before loading data in to hbase table 2) Design good rowkey as Explained below using murmurhash or some other hashing technique. to ensure uniform distribution across the regions.
Also look at http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/
Question : can anyone please explain why this could happen?
reason is quite obvious and simple HOT SPOTTING of data in to one specific reason becuase of poor rowkey for that table...
Consider a hashmap in java which has elements with hashcode 1234. then it will fill all the elements in one bucket isntit ? If hashmap elements are distributed across different good hashcode then it will put elements in different buckets. same is the case with hbase. here your hashcode is just like your rowkey...
Further more,
What happens if I already have a table and I want to split the regions
across...
The RegionSplitter class provides several utilities to help in the administration lifecycle for developers who choose to manually split regions instead of having HBase handle that automatically.
The most useful utilities are:
Create a table with a specified number of pre-split regions
Execute a rolling split of all regions on an existing table
Example :
$ hbase org.apache.hadoop.hbase.util.RegionSplitter test_table HexStringSplit -c 10 -f f1
where -c 10, specifies the requested number of regions as 10, and -f specifies the column families you want in the table, separated by “:”. The tool will create a table named “test_table” with 10 regions:
13/01/18 18:49:32 DEBUG hbase.HRegionInfo: Current INFO from scan results = {NAME => 'test_table,,1358563771069.acc1ad1b7962564fc3a43e5907e8db33.', STARTKEY => '', ENDKEY => '19999999', ENCODED => acc1ad1b7962564fc3a43e5907e8db33,}
13/01/18 18:49:32 DEBUG hbase.HRegionInfo: Current INFO from scan results = {NAME => 'test_table,19999999,1358563771096.37ec12df6bd0078f5573565af415c91b.', STARTKEY => '19999999', ENDKEY => '33333332', ENCODED => 37ec12df6bd0078f5573565af415c91b,}
...
as discussed in comment, you found that my final RDD right before writing into hbase only has 1 partition! which indicates that there
was only one executor holding the entire data... I am still trying to
find out why.
Also, Check
spark.default.parallelism defaults to the number of all cores on all
machines. The parallelize api has no parent RDD to determine the
number of partitions, so it uses the spark.default.parallelism.
So You can increase partitions by repartitioning.
NOTE : I observed that, In Mapreduce The number of partitions of the regions/input split = number of mappers launched.. Similarly in your case it may be the same situation where data loaded in to one particular region thats why one executor lauched. please verify that as well
Though you have not provided example data or enough explanation,this is mostly not due to your code or configuration.
It is happening so,due to non-optimal rowkey design.
The data you are writing is having keys(hbase rowkey) improperly structured(maybe monotonically increasing or something else).So, write to one of the regions is happening.You can prevent that thro' various ways(various recommended practices for rowkey design like salting,inverting,and other techniques).
For reference you can see http://hbase.apache.org/book.html#rowkey.design
In case,if you are wondering whether the write is done in parallel for all regions or one by one(not clear from question) look at this :
http://hbase.apache.org/book.html#_bulk_load.

How to process tab-separated files in Spark?

I have a file which is tab separated. The third column should be my key and the entire record should be my value (as per Map reduce concept).
val cefFile = sc.textFile("C:\\text1.txt")
val cefDim1 = cefFile.filter { line => line.startsWith("1") }
val joinedRDD = cefFile.map(x => x.split("\\t"))
joinedRDD.first().foreach { println }
I am able to get the value of first column but not third. Can anyone suggest me how I could accomplish this?
After you've done the split x.split("\\t") your rdd (which in your example you called joinedRDD but I'm going to call it parsedRDD since we haven't joined it with anything yet) is going to be an RDD of Arrays. We could turn this into an array of key/value tuples by doing parsedRDD.map(r => (r(2), r)). That being said - you aren't limited to just map & reduce operations in Spark so its possible that another data structure might be better suited. Also for tab separated files, you could use spark-csv along with Spark DataFrames if that is a good fit for the eventual problem you are looking to solve.

Ordered union on spark RDDs

I am trying to do a sort on key of key-record pairs using apache spark. The key is 10 bytes long and the value is about 90 bytes long. In other words I am trying to replicate the sort benchmark Databricks used to break the sorting record. One of the things I noticed from the documentation is that they sorted on key-line-number pairs as opposed to key-record pairs to probably be cache/tlb friendly. I tried to replicate this approach but have not found a suitable solution. Here is what I have tried:
var keyValueRDD_1 = input.map(x => (x.substring(0, 10), x.substring(12, 13)))
var keyValueRDD_2 = input.map(x => (x.substring(0, 10), x.substring(14, 98))
var result = keyValueRDD_1.sortByKey(true, 1) // assume partitions = 1
var unionResult = result.union(keyValueRDD_2)
var finalResult = unionResult.foldByKey("")(_+_)
When I do a union on the result RDD and keyValueRDD_2 RDD and print the output of the unionResultRDD, the result and keyValueRDD_2 are not interleaved. In other words, it looks like the unionResult RDD has the keyValueRDD_2 contents followed by the result RDD contents. However, when I do a foldByKey operation which combines the values of same key into a single key-value pair, the sorted order is destroyed. I need to do a fold by key operation in order to save the result as the original key-record pair. Is there an alternate rdd function that could be used to achieve this?
Any tips or suggestions would be quite useful.
Thanks
The union method just puts two RDDs one after the other, except if they have the same partitioner. Then it joins the partitions.
What you want to do is impossible.
When you have one RDD sorted (keyValueRDD_1) and another unsorted RDD with the same keys (keyValueRDD_2) then the only way to get the second RDD sorted is to sort it.
The existence of the sorted RDD does not help us sort the second RDD.
The Databricks article talks about an optimization that happens locally on the executors. After the shuffle step, the records are roughly sorted. Each partition now covers a range of keys, but the partitions are unsorted.
Now you have to sort each partition locally, and this is where the prefix optimization helps with cache locality.

Resources