Why do all data end up in one partition after reduceByKey? - apache-spark
I have Spark application which contains the following segment:
val repartitioned = rdd.repartition(16)
val filtered: RDD[(MyKey, myData)] = MyUtils.filter(repartitioned, startDate, endDate)
val mapped: RDD[(DateTime, myData)] = filtered.map(kv=(kv._1.processingTime, kv._2))
val reduced: RDD[(DateTime, myData)] = mapped.reduceByKey(_+_)
When I run this with some logging this is what I see:
repartitioned ======> [List(2536, 2529, 2526, 2520, 2519, 2514, 2512, 2508, 2504, 2501, 2496, 2490, 2551, 2547, 2543, 2537)]
filtered ======> [List(2081, 2063, 2043, 2040, 2063, 2050, 2081, 2076, 2042, 2066, 2032, 2001, 2031, 2101, 2050, 2068)]
mapped ======> [List(2081, 2063, 2043, 2040, 2063, 2050, 2081, 2076, 2042, 2066, 2032, 2001, 2031, 2101, 2050, 2068)]
reduced ======> [List(0, 0, 0, 0, 0, 0, 922, 0, 0, 0, 0, 0, 0, 0, 0, 0)]
My logging is done using these two lines:
val sizes: RDD[Int] = rdd.mapPartitions(iter => Array(iter.size).iterator, true)
log.info(s"rdd ======> [${sizes.collect().toList}]")
My question is why does my data end up in one partition after the reduceByKey? After the filter it can be seen that the data is evenly distributed, but the reduceByKey results in data in only one partition.
I am guessing all your processing times are the same.
Alternatively, their hashCode (from the DateTime class) are the same. Is that a custom class ?
I will answer my own question, since I figured it out. My DateTimes were all without seconds and milliseconds since I wanted to group data belonging to the same minute. The hashCode() for Joda DateTimes which are one minute apart is a constant:
scala> val now = DateTime.now
now: org.joda.time.DateTime = 2015-11-23T11:14:17.088Z
scala> now.withSecondOfMinute(0).withMillisOfSecond(0).hashCode - now.minusMinutes(1).withSecondOfMinute(0).withMillisOfSecond(0).hashCode
res42: Int = 60000
As can be seen by this example, if the hashCode values are similarly spaced, they can end up in the same partition:
scala> val nums = for(i <- 0 to 1000000) yield ((i*20 % 1000), i)
nums: scala.collection.immutable.IndexedSeq[(Int, Int)] = Vector((0,0), (20,1), (40,2), (60,3), (80,4), (100,5), (120,6), (140,7), (160,8), (180,9), (200,10), (220,11), (240,12), (260,13), (280,14), (300,15), (320,16), (340,17), (360,18), (380,19), (400,20), (420,21), (440,22), (460,23), (480,24), (500,25), (520,26), (540,27), (560,28), (580,29), (600,30), (620,31), (640,32), (660,33), (680,34), (700,35), (720,36), (740,37), (760,38), (780,39), (800,40), (820,41), (840,42), (860,43), (880,44), (900,45), (920,46), (940,47), (960,48), (980,49), (0,50), (20,51), (40,52), (60,53), (80,54), (100,55), (120,56), (140,57), (160,58), (180,59), (200,60), (220,61), (240,62), (260,63), (280,64), (300,65), (320,66), (340,67), (360,68), (380,69), (400,70), (420,71), (440,72), (460,73), (480,74), (500...
scala> val rddNum = sc.parallelize(nums)
rddNum: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:23
scala> val reducedNum = rddNum.reduceByKey(_+_)
reducedNum: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[1] at reduceByKey at <console>:25
scala> reducedNum.mapPartitions(iter => Array(iter.size).iterator, true).collect.toList
res2: List[Int] = List(50, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
To distribute my data more evenly across the partitions I created my own custom Partitoiner:
class JodaPartitioner(rddNumPartitions: Int) extends Partitioner {
def numPartitions: Int = rddNumPartitions
def getPartition(key: Any): Int = {
key match {
case dateTime: DateTime =>
val sum = dateTime.getYear + dateTime.getMonthOfYear + dateTime.getDayOfMonth + dateTime.getMinuteOfDay + dateTime.getSecondOfDay
sum % numPartitions
case _ => 0
}
}
}
Related
Iterative RDD/Dataframe processing in Spark
My ADLA solution is being transitioned to Spark. I'm trying to find the right replacement for U-SQL REDUCE expression to enable: Read logical partition and store information in a list/dictionary/vector or other data structure in memory Apply logic that requires multiple iterations Output results as additional columns together with the original data (the original rows might be partially eliminated or duplicated) Example of possible task: Input dataset has sales and return transactions with their IDs and attributes The solution is supposed finding the most likely sale for each return Return transaction must happen after the sales transaction and be as similar to the sales transactions as possible (best available match) Return transaction must be linked to exactly one sales transaction; sales transaction could be linked to one or no return transaction - link is supposed to be captured in the new column LinkedTransactionId The solution could be probably achieved by groupByKey command, but I'm failing identify how to apply the logic across multiple rows. All examples I've managed to find are some variation of in-line function (usually an aggregate - e.g. .map(t => (t._1, t._2.sum))) which doesn't require information about individual records from the same partition. Can anyone share example of similar solution or point me to the right direction?
Here is one possible solution - feedbacks and suggestions for different approach or examples of iterative Spark/Scala solutions are greatly appreciated: Example will read Sales and Credit transactions for each customer (CustomerId) and process each customer as separate partition (outer mapPartition loop) Credit will be mapped to the sales with closest score (i.e. smallest score difference - using the foreach inner loop inside each partition) Mutable map trnMap is preventing double-assignmet of each transactions and captures updates from the process Results are outputted thru an iterator as into final dataset dfOut2 Note: in this particular case the same result could have been achieved using windowing functions w/o using iterative solution, but the purpose is to test the iterative logic itself) import org.apache.spark.sql.SparkSession import org.apache.spark._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._ import org.apache.spark.api.java.JavaRDD case class Person(name: String, var age: Int) case class SalesTransaction( CustomerId : Int, TransactionId : Int, Score : Int, Revenue : Double, Type : String, Credited : Double = 0.0, LinkedTransactionId : Int = 0, IsProcessed : Boolean = false ) case class TransactionScore( TransactionId : Int, Score : Int ) case class TransactionPair( SalesId : Int, CreditId : Int, ScoreDiff : Int ) object ExampleDataFramePartition{ def main(args: Array[String]): Unit = { val spark = SparkSession .builder() .appName("Example Combiner") .config("spark.some.config.option", "some-value") .getOrCreate() import spark.implicits._ val df = Seq( (1, 1, 123, "Sales", 100), (1, 2, 122, "Credit", 100), (1, 3, 99, "Sales", 70), (1, 4, 101, "Sales", 77), (1, 5, 102, "Credit", 75), (1, 6, 98, "Sales", 71), (2, 7, 200, "Sales", 55), (2, 8, 220, "Sales", 55), (2, 9, 200, "Credit", 50), (2, 10, 205, "Sales", 50) ).toDF("CustomerId", "TransactionId", "TransactionAttributesScore", "TransactionType", "Revenue") .withColumn("Revenue", $"Revenue".cast(DoubleType)) .repartition(2,$"CustomerId") df.show() val dfOut2 = df.mapPartitions(p => { println(p) val trnMap = scala.collection.mutable.Map[Int, SalesTransaction]() val trnSales = scala.collection.mutable.ArrayBuffer.empty[TransactionScore] val trnCredits = scala.collection.mutable.ArrayBuffer.empty[TransactionScore] val trnPairs = scala.collection.mutable.ArrayBuffer.empty[TransactionPair] p.foreach(row => { val trnKey: Int = row.getAs[Int]("TransactionId") val trnValue: SalesTransaction = new SalesTransaction(row.getAs("CustomerId") , trnKey , row.getAs("TransactionAttributesScore") , row.getAs("Revenue") , row.getAs("TransactionType") ) trnMap += (trnKey -> trnValue) if(trnValue.Type == "Sales") { trnSales += new TransactionScore(trnKey, trnValue.Score)} else { trnCredits += new TransactionScore(trnKey, trnValue.Score)} }) if(trnCredits.size > 0 && trnSales.size > 0) { //define transaction pairs trnCredits.foreach(cr => { trnSales.foreach(sl => { trnPairs += new TransactionPair(cr.TransactionId, sl.TransactionId, math.abs(cr.Score - sl.Score)) }) }) } trnPairs.sortBy(t => t.ScoreDiff) .foreach(t => { if(!trnMap(t.CreditId).IsProcessed && !trnMap(t.SalesId).IsProcessed){ trnMap(t.SalesId) = new SalesTransaction(trnMap(t.SalesId).CustomerId , trnMap(t.SalesId).TransactionId , trnMap(t.SalesId).Score , trnMap(t.SalesId).Revenue , trnMap(t.SalesId).Type , math.min(trnMap(t.CreditId).Revenue, trnMap(t.SalesId).Revenue) , t.CreditId , true ) trnMap(t.CreditId) = new SalesTransaction(trnMap(t.CreditId).CustomerId , trnMap(t.CreditId).TransactionId , trnMap(t.CreditId).Score , trnMap(t.CreditId).Revenue , trnMap(t.CreditId).Type , math.min(trnMap(t.CreditId).Revenue, trnMap(t.SalesId).Revenue) , t.SalesId , true ) } }) trnMap.map(m => m._2).toIterator }) dfOut2.show() spark.stop() } }
I keep getting an error about the wrong number of arguments
I'm a student, and I'm trying to figure out what I have done wrong. Could anyone spot what I might have done? super(Fighter, self).__init__(modelPath, parentNode, nodeName, 0, 0, 0, 3.0, ) TypeError: __init__() takes at most 3 arguments (8 given)** Code: class Fighter(ShowBase, object): fighterCount = 0 def __init__(self, modelPath, parentNode, nodeName, posVec, traverser, scaleVec = 1.0): super(Fighter, self).__init__(modelPath, parentNode, nodeName, 0, 0, 0, 3.0, ) self.modelNode.setScale(scaleVec) self.modelNode.setPos(posVec) self.trav = traverser self.origin = render.attachNewNode("origin") self.origin.setPos(0, 0, 0) self.origin.hide() self.setKeyBindings() self.hud = Hud("./Tools/Hud.x", self.modelNode, "Hud", (0, 10, 0))
Spark GroupBy Aggregate functions
case class Step (Id : Long, stepNum : Long, stepId : Int, stepTime: java.sql.Timestamp ) I have a Dataset[Step] and I want to perform a groupBy operation on the "Id" col. My output should look like Dataset[(Long, List[Step])]. How do I do this? lets say variable "inquiryStepMap" is of type Dataset[Step] then we can do this with RDDs as follows val inquiryStepGrouped: RDD[(Long, Iterable[Step])] = inquiryStepMap.rdd.groupBy(x => x.Id)
It seems you need groupByKey: Sample: import java.sql.Timestamp val t = new Timestamp(2017, 5, 1, 0, 0, 0, 0) val ds = Seq(Step(1L, 21L, 1, t), Step(1L, 20L, 2, t), Step(2L, 10L, 3, t)).toDS() groupByKey and then mapGroups: ds.groupByKey(_.Id).mapGroups((Id, Vals) => (Id, Vals.toList)) // res18: org.apache.spark.sql.Dataset[(Long, List[Step])] = [_1: bigint, _2: array<struct<Id:bigint,stepNum:bigint,stepId:int,stepTime:timestamp>>] And the result looks like: ds.groupByKey(_.Id).mapGroups((Id, Vals) => (Id, Vals.toList)).show() +---+--------------------+ | _1| _2| +---+--------------------+ | 1|[[1,21,1,3917-06-...| | 2|[[2,10,3,3917-06-...| +---+--------------------+
how to convert mix of text and numerical data to feature data in apache spark
I have a CSV of both textual and numerical data. I need to convert it to feature vector data in Spark (Double values). Is there any way to do that ? I see some e.g where each keyword is mapped to some double value and use this to convert. However if there are multiple keywords, it is difficult to do this way. Is there any other way out? I see Spark provides Extractors which will convert into feature vectors. Could someone please give an example? 48, Private, 105808, 9th, 5, Widowed, Transport-moving, Unmarried, White, Male, 0, 0, 40, United-States, >50K 42, Private, 169995, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, <=50K
Finally I did this way. I iterate through each data and make a map with key as each item and increment a Double counter. def createMap(data: RDD[String]) : Map[String,Double] = { var mapData:Map[String,Double] = Map() var counter = 0.0 data.collect().foreach{ item => counter = counter +1 mapData += (item -> counter) } mapData } def getLablelValue(input: String): Int = input match { case "<=50K" => 0 case ">50K" => 1 } val census = sc.textFile("/user/cloudera/census_data.txt") val orgTypeRdd = census.map(line => line.split(", ")(1)).distinct val gradeTypeRdd = census.map(line => line.split(", ")(3)).distinct val marStatusRdd = census.map(line => line.split(", ")(5)).distinct val jobTypeRdd = census.map(line => line.split(", ")(6)).distinct val familyStatusRdd = census.map(line => line.split(", ")(7)).distinct val raceTypeRdd = census.map(line => line.split(", ")(8)).distinct val genderTypeRdd = census.map(line => line.split(", ")(9)).distinct val countryRdd = census.map(line => line.split(", ")(13)).distinct val salaryRange = census.map(line => line.split(", ")(14)).distinct val orgTypeMap = createMap(orgTypeRdd) val gradeTypeMap = createMap(gradeTypeRdd) val marStatusMap = createMap(marStatusRdd) val jobTypeMap = createMap(jobTypeRdd) val familyStatusMap = createMap(familyStatusRdd) val raceTypeMap = createMap(raceTypeRdd) val genderTypeMap = createMap(genderTypeRdd) val countryMap = createMap(countryRdd) val salaryRangeMap = createMap(salaryRange) val featureVector = census.map{line => val fields = line.split(", ") LabeledPoint(getLablelValue(fields(14).toString) , Vectors.dense(fields(0).toDouble, orgTypeMap(fields(1).toString) , fields(2).toDouble , gradeTypeMap(fields(3).toString) , fields(4).toDouble , marStatusMap(fields(5).toString), jobTypeMap(fields(6).toString), familyStatusMap(fields(7).toString),raceTypeMap(fields(8).toString),genderTypeMap (fields(9).toString), fields(10).toDouble , fields(11).toDouble , fields(12).toDouble,countryMap(fields(13).toString) , salaryRangeMap(fields(14).toString))) }
how to filter () a pairRDD according to two conditions
how can i filter my pair RDD if i have 2 conditions for filter it , one to test the key and the other one to test the value (wanna the portion of code) bcz i used this portion and it didnt work saddly JavaPairRDD filtering = pairRDD1.filter((x,y) -> (x._1.equals(y._1))&&(x._2.equals(y._2)))));
You can't use regular filter for this, because that checks one item at a time. You have to compare multiple items to each other, and check which one to keep. Here's an example which only keeps items which are repeated: val items = List(1, 2, 5, 6, 6, 7, 8, 10, 12, 13, 15, 16, 16, 19, 20) val rdd = sc.parallelize(items) // now create an RDD with all possible combinations of pairs val mapped = rdd.map { case (x) => (x, 1)} val reduced = mapped.reduceByKey{ case (x, y) => x + y } val filtered = reduced.filter { case (item, count) => count > 1 } // Now print out the results: filtered.collect().foreach { case (item, count) => println(s"Keeping $item because it occurred $count times.")} It's probably not the most performant code for this, but it should give you an idea for the approach.