What is an efficient way to partition by column but maintain a fixed partition count? - apache-spark

What is the best way to partition the data by a field into predefined partition count?
I am currently partitioning the data by specifying the partionCount=600. The count 600 is found to give best query performance for my dataset/cluster setup.
val rawJson = sqlContext.read.json(filename).coalesce(600)
rawJson.write.parquet(filenameParquet)
Now I want to partition this data by the column 'eventName' but still keep the count 600. The data currently has around 2000 unique eventNames, plus the number of rows in each eventName is not uniform. Around 10 eventNames have more than 50% of the data causing data skew. Hence if I do the partitioning like below, its not very performant. The write is taking 5x more time than without.
val rawJson = sqlContext.read.json(filename)
rawJson.write.partitionBy("eventName").parquet(filenameParquet)
What is a good way to partition the data for these scenarios? Is there a way to partition by eventName but spread this into 600 partitions?
My schema looks like this:
{
"eventName": "name1",
"time": "2016-06-20T11:57:19.4941368-04:00",
"data": {
"type": "EventData",
"dataDetails": {
"name": "detailed1",
"id": "1234",
...
...
}
}
}
Thanks!

This is a common problem with skewed data and there are several approaches you can take.
List bucketing works if the skew remains stable over time, which may or may not be the case, especially if new values of the partitioning variable are introduced. I have not researched how easy it is to adjust list bucketing over time and, as your comment states, you can't use that anyway because it is a Spark 2.0 feature.
If you are on 1.6.x, the key observation is that you can create your own function that maps each event name into one of 600 unique values. You can do this as a UDF or as a case expression. Then, you simply create a column using that function and then partition by that column using repartition(600, 'myPartitionCol) as opposed to coalesce(600).
Because we deal with very skewed data at Swoop, I've found the following workhorse data structure to be quite useful for building partitioning-related tools.
/** Given a key, returns a random number in the range [x, y) where
* x and y are the numbers in the tuple associated with a key.
*/
class RandomRangeMap[A](private val m: Map[A, (Int, Int)]) extends Serializable {
private val r = new java.util.Random() // Scala Random is not serializable in 2.10
def apply(key: A): Int = {
val (start, end) = m(key)
start + r.nextInt(end - start)
}
override def toString = s"RandomRangeMap($r, $m)"
}
For example, here is how we build a partitioner for a slightly different case: one where the data is skewed and the number of keys is small so we have to increase the number of partitions for the skewed keys while sticking with 1 as the minimum number of partitions per key:
/** Partitions data such that each unique key ends in P(key) partitions.
* Must be instantiated with a sequence of unique keys and their Ps.
* Partition sizes can be highly-skewed by the data, which is where the
* multiples come in.
*
* #param keyMap maps key values to their partition multiples
*/
class ByKeyPartitionerWithMultiples(val keyMap: Map[Any, Int]) extends Partitioner {
private val rrm = new RandomRangeMap(
keyMap.keys
.zip(
keyMap.values
.scanLeft(0)(_+_)
.zip(keyMap.values)
.map {
case (start, count) => (start, start + count)
}
)
.toMap
)
override val numPartitions =
keyMap.values.sum
override def getPartition(key: Any): Int =
rrm(key)
}
object ByKeyPartitionerWithMultiples {
/** Builds a UDF with a ByKeyPartitionerWithMultiples in a closure.
*
* #param keyMap maps key values to their partition multiples
*/
def udf(keyMap: Map[String, Int]) = {
val partitioner = new ByKeyPartitionerWithMultiples(keyMap.asInstanceOf[Map[Any, Int]])
(key:String) => partitioner.getPartition(key)
}
}
In your case, you have to merge several event names into a single partition, which would require changes but I hope the code above gives you an idea how to approach the problem.
One final observation is that if the distribution of event names values a lot in your data over time, you can perform a statistics gathering pass over some part of the data to compute a mapping table. You don't have to do this all the time, just when it is needed. To determine that, you can look at the number of rows and/or size of output files in each partition. In other words, the entire process can be automated as part of your Spark jobs.

Related

Reading guarantees for full table scan while updating the table?

Given schema:
CREATE TABLE keyspace.table (
key text,
ckey text,
value text
PRIMARY KEY (key, ckey)
)
...and Spark pseudocode:
val sc: SparkContext = ...
val connector: CassandraConnector = ...
sc.cassandraTable("keyspace", "table")
.mapPartitions { partition =>
connector.withSessionDo { session =>
partition.foreach { row =>
val key = row.getString("key")
val ckey = Random.nextString(42)
val value = row.getString("value")
session.execute(s"INSERT INTO keyspace.table (key, ckey, value)" +
" VALUES ($key, $ckey, $value)")
}
}
}
Is it possible for a code like this to read an inserted value within a single application (Spark job) run? More generalized version of my question would be whether a token range scan CQL query can read newly inserted values while iterating over rows.
Yes, it is possible exactly as Alex wrote
but I don't think it's possible with above code
So per data model the table is ordered by ckey in ascending order
The funny part however is the page size and how many pages are prefetched and since this is by default 1000 (spark.cassandra.input.fetch.sizeInRows), then the only problem could occur, if you wouldn't use 42, but something bigger and/or the executor didn't page yet
Also I think you use unnecessary nesting, so the code to achieve what you want might be simplified (after all cassandraTable will give you a data frame).
(I hope I understand that you want to read per partition (note a partition in your case is all rows under one primary key - "key") and for every row (distinguished by ckey) in this partition generate new one (with new ckey that will just duplicate value with new ckey) - use case for such code is a mystery for me, but I hope it has some sense:-))

How would you limit the number of records to process per grouped key in spark? (for skewed data)

I have two large datasets. There are multiple groupings of the same ids. Each group has a score. I'm trying to broadcast the score to each id in each group. But I have a nice constraint that I don't care about groups with more than 1000 ids.
Unfortunately, Spark keeps reading the full grouping. I can't seem to figure out a way to push down the limit so that Spark only reads up to 1000 records, and if there are any more gives up.
So far I've tried this:
def run: Unit = {
// ...
val scores: RDD[(GroupId, Score)] = readScores(...)
val data: RDD[(GroupId, Id)] = readData(...)
val idToScore: RDD[(Id, Score)] = scores.cogroup(data)
.flatMap(maxIdsPerGroupFilter(1000))
// ...
}
def maxIdsPerGroupFilter(maxIds: Int)(t: (GroupId, (Iterable[Score], Iterable[Id]))): Iterator[(Id, Score)] = {
t match {
case (groupId: GroupId, (scores: Iterable[Score], ids: Iterable[Id])) =>
if (!scores.iterator.hasNext) {
return Iterator.empty
}
val score: Score = scores.iterator.next()
val iter = ids.iterator
val uniqueIds: mutable.HashSet[Id] = new mutable.HashSet[Id]
while (iter.hasNext) {
uniqueIds.add(iter.next())
if (uniqueIds.size > maxIds) {
return Iterator.empty
}
}
uniqueIds.map((_, score)).iterator
}
}
(Even with variants where the filter function just returns empty iterators, Spark still is insistent on reading all the data)
The side effect of this is that because some groups have too many ids, I have a lot of skew in the data and the job can never finish when processing the full scale of data.
I want the reduce-side to only read in the data it needs, and not crap out because of data skew.
I have a feeling that somehow I need to create a transform that is able to push down a limit or take clause, but I can't figure out how.
Can't we just filter out those groups which have records more than 1k using count() in grouped data?
or if you want to have those groups also which have more than 1k records but only to pick upto 1k records then in spark sql query you can use ROW_NUMBER() OVER (PARTITION BY id ORDER BY someColumn DESC) AS rn and then put condition rn<=1000.

How can you get around the 2GB buffer limit when using Dataset.groupByKey?

When using Dataset.groupByKey(_.key).mapGroups or Dataset.groupByKey(_.key).cogroup in Spark, I've run into a problem when one of the groupings results in more than 2GB of data.
I need to normalize the data by group before I can start to reduce it, and I would like to split up the groups into smaller subgroups so they distribute better. For example, here's one way I've attempted to split the groups:
val groupedInputs = inputData.groupByKey(_.key).mapGroups {
case(key, inputSeries) => inputSeries.grouped(maxGroupSize).map(group => (key, group))
}
But unfortunately however I try to work around it, my jobs always die with an error like this: java.lang.UnsupportedOperationException: Cannot grow BufferHolder by size 23816 because the size after growing exceeds size limitation 2147483632. When using Kryo serialization I get a different Kryo serialization failed: Buffer overflow error recommending I increase spark.kryoserializer.buffer.max, but I've already increased it to the 2GB limit.
One solution that occurs to me is to add a random value to the keys before grouping them. This isn't ideal since it'll split up every group (not just the large ones), but I'm willing to sacrifice "ideal" for the sake of "working". That code would look something like this:
val splitInputs = inputData.map( record => (record, ThreadLocalRandom.current.nextInt(splitFactor)))
val groupedInputs = splitInputs.groupByKey{ case(record, split) => (record.key, split)).mapGroups {
case((key, _), inputSeries) => inputSeries.grouped(maxGroupSize).map(group => (key, group.map(_._1)))
}
Add a salt key and do groupBy on your key and the salt key and later
import scala.util.Random
val start = 1
val end = 5
val randUdf = udf({() => start + Random.nextInt((end - start) + 1)})
val saltGroupBy=skewDF.withColumn("salt_key", randUdf())
.groupBy(col("name"), col("salt_key"))
So your all the skew data doesn't go into one executor and cause the 2GB Limit.
But you have to develop a logic to aggregate the above result and finally remove the salt key at the end.
When you use groupBy all the records with the same key will reach one executor and bottle neck occur.
The above is one of the method to mitigate it.
For this case, where the dataset had a lot of skew and it was important to group the records into regularly-sized groups, I decided to process the dataset in two passes. First I used a window function to number the rows by key, and converted that to a "group index," based on a configurable "maxGroupSize":
// The "orderBy" doesn't seem necessary here,
// but the row_number function requires it.
val partitionByKey = Window.partitionBy(key).orderBy(key)
val indexedData = inputData.withColumn("groupIndex",
(row_number.over(partitionByKey) / maxGroupSize).cast(IntegerType))
.as[(Record, Int)]
Then I can group by key and index, and produce groups that are consistently sized--the keys with a lot of records get split up more, and the keys with few records may not be split up at all.
indexedData.groupByKey{ case (record, groupIndex) => (record.key, groupIndex) }
.mapGroups{ case((key, _), recordGroup) =>
// Remove the index values before returning the groups
(key, recordGroup.map(_._1))
}

Spark write only to one hbase region server

import org.apache.hadoop.hbase.mapreduce.TableOutputFormat
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.spark.rdd.PairRDDFunctions
def bulkWriteToHBase(sparkSession: SparkSession, sparkContext: SparkContext, jobContext: Map[String, String], sinkTableName: String, outRDD: RDD[(ImmutableBytesWritable, Put)]): Unit = {
val hConf = HBaseConfiguration.create()
hConf.set("hbase.zookeeper.quorum", jobContext("hbase.zookeeper.quorum"))
hConf.set("zookeeper.znode.parent", jobContext("zookeeper.znode.parent"))
hConf.set(TableInputFormat.INPUT_TABLE, sinkTableName)
val hJob = Job.getInstance(hConf)
hJob.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, sinkTableName)
hJob.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])
outRDD.saveAsNewAPIHadoopDataset(hJob.getConfiguration())
}
what I have found by using this hbase bulk insertion is that, every time spark will only write into one single region server from hbase, which becomes the bottleneck.
however when I use almost the same approach but reading from hbase, it is using multiple executors to do parallel reading .
def bulkReadFromHBase(sparkSession: SparkSession, sparkContext: SparkContext, jobContext: Map[String, String], sourceTableName: String) = {
val hConf = HBaseConfiguration.create()
hConf.set("hbase.zookeeper.quorum", jobContext("hbase.zookeeper.quorum"))
hConf.set("zookeeper.znode.parent", jobContext("zookeeper.znode.parent"))
hConf.set(TableInputFormat.INPUT_TABLE, sourceTableName)
val inputRDD = sparkContext.newAPIHadoopRDD(hConf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
inputRDD
}
can anyone please explain why this could happen? or maybe I have
used the wrong way for spark-hbase bulk I/O ?
Question : I have used the wrong way for spark-hbase bulk I/O ?
No your way is right, although, you need to pre-split regions before hand & create table with presplit regions.
for example create 'test_table', 'f1', SPLITS=> ['1', '2', '3', '4', '5', '6', '7', '8', '9']
Above table occupies 9 regions..
design good rowkey with will starts with 1-9
you can use guava murmur hash like below.
import com.google.common.hash.HashCode;
import com.google.common.hash.HashFunction;
import com.google.common.hash.Hashing;
/**
* getMurmurHash.
*
* #param content
* #return HashCode
*/
public static HashCode getMurmurHash(String content) {
final HashFunction hf = Hashing.murmur3_128();
final HashCode hc = hf.newHasher().putString(content, Charsets.UTF_8).hash();
return hc;
}
final long hash = getMurmur128Hash(Bytes.toString(yourrowkey as string)).asLong();
final int prefix = Math.abs((int) hash % 9);
now append this prefix to your rowkey
For example
1rowkey1 // will go in to first region
2rowkey2 // will go in to
second region
3rowkey3 // will go in to third region
...
9rowkey9 //
will go in to ninth region
If you are doing pre-splitting, and want to manually manage region splits, you can also disable region splits, by setting hbase.hregion.max.filesize to a high number and setting the split policy to ConstantSizeRegionSplitPolicy. However, you should use a safeguard value of like 100GB, so that regions does not grow beyond a region server’s capabilities. You can consider disabling automated splitting and rely on the initial set of regions from pre-splitting for example, if you are using uniform hashes for your key prefixes, and you can ensure that the read/write load to each region as well as its size is uniform across the regions in the table
1) please ensure that you can presplit the table before loading data in to hbase table 2) Design good rowkey as Explained below using murmurhash or some other hashing technique. to ensure uniform distribution across the regions.
Also look at http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/
Question : can anyone please explain why this could happen?
reason is quite obvious and simple HOT SPOTTING of data in to one specific reason becuase of poor rowkey for that table...
Consider a hashmap in java which has elements with hashcode 1234. then it will fill all the elements in one bucket isntit ? If hashmap elements are distributed across different good hashcode then it will put elements in different buckets. same is the case with hbase. here your hashcode is just like your rowkey...
Further more,
What happens if I already have a table and I want to split the regions
across...
The RegionSplitter class provides several utilities to help in the administration lifecycle for developers who choose to manually split regions instead of having HBase handle that automatically.
The most useful utilities are:
Create a table with a specified number of pre-split regions
Execute a rolling split of all regions on an existing table
Example :
$ hbase org.apache.hadoop.hbase.util.RegionSplitter test_table HexStringSplit -c 10 -f f1
where -c 10, specifies the requested number of regions as 10, and -f specifies the column families you want in the table, separated by “:”. The tool will create a table named “test_table” with 10 regions:
13/01/18 18:49:32 DEBUG hbase.HRegionInfo: Current INFO from scan results = {NAME => 'test_table,,1358563771069.acc1ad1b7962564fc3a43e5907e8db33.', STARTKEY => '', ENDKEY => '19999999', ENCODED => acc1ad1b7962564fc3a43e5907e8db33,}
13/01/18 18:49:32 DEBUG hbase.HRegionInfo: Current INFO from scan results = {NAME => 'test_table,19999999,1358563771096.37ec12df6bd0078f5573565af415c91b.', STARTKEY => '19999999', ENDKEY => '33333332', ENCODED => 37ec12df6bd0078f5573565af415c91b,}
...
as discussed in comment, you found that my final RDD right before writing into hbase only has 1 partition! which indicates that there
was only one executor holding the entire data... I am still trying to
find out why.
Also, Check
spark.default.parallelism defaults to the number of all cores on all
machines. The parallelize api has no parent RDD to determine the
number of partitions, so it uses the spark.default.parallelism.
So You can increase partitions by repartitioning.
NOTE : I observed that, In Mapreduce The number of partitions of the regions/input split = number of mappers launched.. Similarly in your case it may be the same situation where data loaded in to one particular region thats why one executor lauched. please verify that as well
Though you have not provided example data or enough explanation,this is mostly not due to your code or configuration.
It is happening so,due to non-optimal rowkey design.
The data you are writing is having keys(hbase rowkey) improperly structured(maybe monotonically increasing or something else).So, write to one of the regions is happening.You can prevent that thro' various ways(various recommended practices for rowkey design like salting,inverting,and other techniques).
For reference you can see http://hbase.apache.org/book.html#rowkey.design
In case,if you are wondering whether the write is done in parallel for all regions or one by one(not clear from question) look at this :
http://hbase.apache.org/book.html#_bulk_load.

Spark - convert string IDs to unique integer IDs

I have a dataset which looks like this, where each user and product ID is a string:
userA, productX
userA, productX
userB, productY
with ~2.8 million products and 300 million users; about 2.1 billion user-product associations.
My end goal is to run Spark collaborative filtering (ALS) on this dataset. Since it takes int keys for users and products, my first step is to assign a unique int to each user and product, and transform the dataset above so that users and products are represented by ints.
Here's what I've tried so far:
val rawInputData = sc.textFile(params.inputPath)
.filter { line => !(line contains "\\N") }
.map { line =>
val parts = line.split("\t")
(parts(0), parts(1)) // user, product
}
// find all unique users and assign them IDs
val idx1map = rawInputData.map(_._1).distinct().zipWithUniqueId().cache()
// find all unique products and assign IDs
val idx2map = rawInputData.map(_._2).distinct().zipWithUniqueId().cache()
idx1map.map{ case (id, idx) => id + "\t" + idx.toString
}.saveAsTextFile(params.idx1Out)
idx2map.map{ case (id, idx) => id + "\t" + idx.toString
}.saveAsTextFile(params.idx2Out)
// join with user ID map:
// convert from (userStr, productStr) to (productStr, userIntId)
val rev = rawInputData.cogroup(idx1map).flatMap{
case (id1, (id2s, idx1s)) =>
val idx1 = idx1s.head
id2s.map { (_, idx1)
}
}
// join with product ID map:
// convert from (productStr, userIntId) to (userIntId, productIntId)
val converted = rev.cogroup(idx2map).flatMap{
case (id2, (idx1s, idx2s)) =>
val idx2 = idx2s.head
idx1s.map{ (_, idx2)
}
}
// save output
val convertedInts = converted.map{
case (a,b) => a.toInt.toString + "\t" + b.toInt.toString
}
convertedInts.saveAsTextFile(params.outputPath)
When I try to run this on my cluster (40 executors with 5 GB RAM each), it's able to produce the idx1map and idx2map files fine, but it fails with out of memory errors and fetch failures at the first flatMap after cogroup. I haven't done much with Spark before so I'm wondering if there is a better way to accomplish this; I don't have a good idea of what steps in this job would be expensive. Certainly cogroup would require shuffling the whole data set across the network; but what does something like this mean?
FetchFailed(BlockManagerId(25, ip-***.ec2.internal, 48690), shuffleId=2, mapId=87, reduceId=25)
The reason I'm not just using a hashing function is that I'd eventually like to run this on a much larger dataset (on the order of 1 billion products, 1 billion users, 35 billion associations), and number of Int key collisions would become quite large. Is running ALS on a dataset of that scale even close to feasible?
I looks like you are essentially collecting all lists of users, just to split them up again. Try just using join instead of cogroup, which seems to me to do more like what you want. For example:
import org.apache.spark.SparkContext._
// Create some fake data
val data = sc.parallelize(Seq(("userA", "productA"),("userA", "productB"),("userB", "productB")))
val userId = sc.parallelize(Seq(("userA",1),("userB",2)))
val productId = sc.parallelize(Seq(("productA",1),("productB",2)))
// Replace userName with ID's
val userReplaced = data.join(userId).map{case (_,(prod,user)) => (prod,user)}
// Replace product names with ID's
val bothReplaced = userReplaced.join(productId).map{case (_,(user,prod)) => (user,prod)}
// Check results:
bothReplaced.collect()) // Array((1,1), (1,2), (2,2))
Please drop a comments on how well it performs.
(I have no idea what FetchFailed(...) means)
My platform version : CDH :5.7, Spark :1.6.0/StandAlone;
My Test Data Size:31815167 all data; 31562704 distinct user strings, 4140276 distinct product strings .
First idea:
My first idea is to use collectAsMap action and then use the map idea to change the user/product string to int . With driver memory up to 12G , i got OOM or GC overhead exception (the exception is limited by driver memory).
But this idea can only use on a small data size, with bigger data size , you need a bigger driver memory .
Second idea :
Second idea is to use join method, as Tobber proposaled. Here is some test result:
Job setup:
driver: 2G , 2 cpu;
executor : (8G , 4 cpu) * 7;
I follow the steps:
1) find unique user strings and zipWithIndexes;
2) join the original data;
3) save the encoded data;
The job take about 10 minutes to finish.

Resources