Why do all data end up in one partition after reduceByKey?

Why do all data end up in one partition after reduceByKey? - apache-spark

I have Spark application which contains the following segment:
val repartitioned = rdd.repartition(16)
val filtered: RDD[(MyKey, myData)] = MyUtils.filter(repartitioned, startDate, endDate)
val mapped: RDD[(DateTime, myData)] = filtered.map(kv=(kv._1.processingTime, kv._2))
val reduced: RDD[(DateTime, myData)] = mapped.reduceByKey(_+_)
When I run this with some logging this is what I see:
repartitioned ======> [List(2536, 2529, 2526, 2520, 2519, 2514, 2512, 2508, 2504, 2501, 2496, 2490, 2551, 2547, 2543, 2537)]
filtered ======> [List(2081, 2063, 2043, 2040, 2063, 2050, 2081, 2076, 2042, 2066, 2032, 2001, 2031, 2101, 2050, 2068)]
mapped ======> [List(2081, 2063, 2043, 2040, 2063, 2050, 2081, 2076, 2042, 2066, 2032, 2001, 2031, 2101, 2050, 2068)]
reduced ======> [List(0, 0, 0, 0, 0, 0, 922, 0, 0, 0, 0, 0, 0, 0, 0, 0)]
My logging is done using these two lines:
val sizes: RDD[Int] = rdd.mapPartitions(iter => Array(iter.size).iterator, true)
log.info(s"rdd ======> [${sizes.collect().toList}]")
My question is why does my data end up in one partition after the reduceByKey? After the filter it can be seen that the data is evenly distributed, but the reduceByKey results in data in only one partition.

I am guessing all your processing times are the same.
Alternatively, their hashCode (from the DateTime class) are the same. Is that a custom class ?

I will answer my own question, since I figured it out. My DateTimes were all without seconds and milliseconds since I wanted to group data belonging to the same minute. The hashCode() for Joda DateTimes which are one minute apart is a constant:
scala> val now = DateTime.now
now: org.joda.time.DateTime = 2015-11-23T11:14:17.088Z
scala> now.withSecondOfMinute(0).withMillisOfSecond(0).hashCode - now.minusMinutes(1).withSecondOfMinute(0).withMillisOfSecond(0).hashCode
res42: Int = 60000
As can be seen by this example, if the hashCode values are similarly spaced, they can end up in the same partition:
scala> val nums = for(i <- 0 to 1000000) yield ((i*20 % 1000), i)
nums: scala.collection.immutable.IndexedSeq[(Int, Int)] = Vector((0,0), (20,1), (40,2), (60,3), (80,4), (100,5), (120,6), (140,7), (160,8), (180,9), (200,10), (220,11), (240,12), (260,13), (280,14), (300,15), (320,16), (340,17), (360,18), (380,19), (400,20), (420,21), (440,22), (460,23), (480,24), (500,25), (520,26), (540,27), (560,28), (580,29), (600,30), (620,31), (640,32), (660,33), (680,34), (700,35), (720,36), (740,37), (760,38), (780,39), (800,40), (820,41), (840,42), (860,43), (880,44), (900,45), (920,46), (940,47), (960,48), (980,49), (0,50), (20,51), (40,52), (60,53), (80,54), (100,55), (120,56), (140,57), (160,58), (180,59), (200,60), (220,61), (240,62), (260,63), (280,64), (300,65), (320,66), (340,67), (360,68), (380,69), (400,70), (420,71), (440,72), (460,73), (480,74), (500...
scala> val rddNum = sc.parallelize(nums)
rddNum: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:23
scala> val reducedNum = rddNum.reduceByKey(_+_)
reducedNum: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[1] at reduceByKey at <console>:25
scala> reducedNum.mapPartitions(iter => Array(iter.size).iterator, true).collect.toList
res2: List[Int] = List(50, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
To distribute my data more evenly across the partitions I created my own custom Partitoiner:
class JodaPartitioner(rddNumPartitions: Int) extends Partitioner {
def numPartitions: Int = rddNumPartitions
def getPartition(key: Any): Int = {
key match {
case dateTime: DateTime =>
val sum = dateTime.getYear + dateTime.getMonthOfYear + dateTime.getDayOfMonth + dateTime.getMinuteOfDay + dateTime.getSecondOfDay
sum % numPartitions
case _ => 0
}
}
}

Related

Iterative RDD/Dataframe processing in Spark

My ADLA solution is being transitioned to Spark. I'm trying to find the right replacement for U-SQL REDUCE expression to enable:
Read logical partition and store information in a list/dictionary/vector or other data structure in memory
Apply logic that requires multiple iterations
Output results as additional columns together with the original data (the original rows might be partially eliminated or duplicated)
Example of possible task:
Input dataset has sales and return transactions with their IDs and attributes
The solution is supposed finding the most likely sale for each return
Return transaction must happen after the sales transaction and be as similar to the sales transactions as possible (best available match)
Return transaction must be linked to exactly one sales transaction; sales transaction could be linked to one or no return transaction - link is supposed to be captured in the new column LinkedTransactionId
The solution could be probably achieved by groupByKey command, but I'm failing identify how to apply the logic across multiple rows. All examples I've managed to find are some variation of in-line function (usually an aggregate - e.g. .map(t => (t._1, t._2.sum))) which doesn't require information about individual records from the same partition.
Can anyone share example of similar solution or point me to the right direction?

Here is one possible solution - feedbacks and suggestions for different approach or examples of iterative Spark/Scala solutions are greatly appreciated:
Example will read Sales and Credit transactions for each customer (CustomerId) and process each customer as separate partition (outer mapPartition loop)
Credit will be mapped to the sales with closest score (i.e. smallest score difference - using the foreach inner loop inside each partition)
Mutable map trnMap is preventing double-assignmet of each transactions and captures updates from the process
Results are outputted thru an iterator as into final dataset dfOut2
Note: in this particular case the same result could have been achieved using windowing functions w/o using iterative solution, but the purpose is to test the iterative logic itself)
import org.apache.spark.sql.SparkSession
import org.apache.spark._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.api.java.JavaRDD
case class Person(name: String, var age: Int)
case class SalesTransaction(
CustomerId : Int,
TransactionId : Int,
Score : Int,
Revenue : Double,
Type : String,
Credited : Double = 0.0,
LinkedTransactionId : Int = 0,
IsProcessed : Boolean = false
)
case class TransactionScore(
TransactionId : Int,
Score : Int
)
case class TransactionPair(
SalesId : Int,
CreditId : Int,
ScoreDiff : Int
)
object ExampleDataFramePartition{
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("Example Combiner")
.config("spark.some.config.option", "some-value")
.getOrCreate()
import spark.implicits._
val df = Seq(
(1, 1, 123, "Sales", 100),
(1, 2, 122, "Credit", 100),
(1, 3, 99, "Sales", 70),
(1, 4, 101, "Sales", 77),
(1, 5, 102, "Credit", 75),
(1, 6, 98, "Sales", 71),
(2, 7, 200, "Sales", 55),
(2, 8, 220, "Sales", 55),
(2, 9, 200, "Credit", 50),
(2, 10, 205, "Sales", 50)
).toDF("CustomerId", "TransactionId", "TransactionAttributesScore", "TransactionType", "Revenue")
.withColumn("Revenue", $"Revenue".cast(DoubleType))
.repartition(2,$"CustomerId")
df.show()
val dfOut2 = df.mapPartitions(p => {
println(p)
val trnMap = scala.collection.mutable.Map[Int, SalesTransaction]()
val trnSales = scala.collection.mutable.ArrayBuffer.empty[TransactionScore]
val trnCredits = scala.collection.mutable.ArrayBuffer.empty[TransactionScore]
val trnPairs = scala.collection.mutable.ArrayBuffer.empty[TransactionPair]
p.foreach(row => {
val trnKey: Int = row.getAs[Int]("TransactionId")
val trnValue: SalesTransaction = new SalesTransaction(row.getAs("CustomerId")
, trnKey
, row.getAs("TransactionAttributesScore")
, row.getAs("Revenue")
, row.getAs("TransactionType")
)
trnMap += (trnKey -> trnValue)
if(trnValue.Type == "Sales") {
trnSales += new TransactionScore(trnKey, trnValue.Score)}
else {
trnCredits += new TransactionScore(trnKey, trnValue.Score)}
})
if(trnCredits.size > 0 && trnSales.size > 0) {
//define transaction pairs
trnCredits.foreach(cr => {
trnSales.foreach(sl => {
trnPairs += new TransactionPair(cr.TransactionId, sl.TransactionId, math.abs(cr.Score - sl.Score))
})
})
}
trnPairs.sortBy(t => t.ScoreDiff)
.foreach(t => {
if(!trnMap(t.CreditId).IsProcessed && !trnMap(t.SalesId).IsProcessed){
trnMap(t.SalesId) = new SalesTransaction(trnMap(t.SalesId).CustomerId
, trnMap(t.SalesId).TransactionId
, trnMap(t.SalesId).Score
, trnMap(t.SalesId).Revenue
, trnMap(t.SalesId).Type
, math.min(trnMap(t.CreditId).Revenue, trnMap(t.SalesId).Revenue)
, t.CreditId
, true
)
trnMap(t.CreditId) = new SalesTransaction(trnMap(t.CreditId).CustomerId
, trnMap(t.CreditId).TransactionId
, trnMap(t.CreditId).Score
, trnMap(t.CreditId).Revenue
, trnMap(t.CreditId).Type
, math.min(trnMap(t.CreditId).Revenue, trnMap(t.SalesId).Revenue)
, t.SalesId
, true
)
}
})
trnMap.map(m => m._2).toIterator
})
dfOut2.show()
spark.stop()
}
}

I keep getting an error about the wrong number of arguments

I'm a student, and I'm trying to figure out what I have done wrong. Could anyone spot what I might have done?
super(Fighter, self).__init__(modelPath, parentNode, nodeName, 0, 0, 0, 3.0, )
TypeError: __init__() takes at most 3 arguments (8 given)**
Code:
class Fighter(ShowBase, object):
fighterCount = 0
def __init__(self, modelPath, parentNode, nodeName, posVec, traverser, scaleVec = 1.0):
super(Fighter, self).__init__(modelPath, parentNode, nodeName, 0, 0, 0, 3.0, )
self.modelNode.setScale(scaleVec)
self.modelNode.setPos(posVec)
self.trav = traverser
self.origin = render.attachNewNode("origin")
self.origin.setPos(0, 0, 0)
self.origin.hide()
self.setKeyBindings()
self.hud = Hud("./Tools/Hud.x", self.modelNode, "Hud", (0, 10, 0))

Spark GroupBy Aggregate functions

case class Step (Id : Long,
stepNum : Long,
stepId : Int,
stepTime: java.sql.Timestamp
)
I have a Dataset[Step] and I want to perform a groupBy operation on the "Id" col.
My output should look like Dataset[(Long, List[Step])]. How do I do this?
lets say variable "inquiryStepMap" is of type Dataset[Step] then we can do this with RDDs as follows
val inquiryStepGrouped: RDD[(Long, Iterable[Step])] = inquiryStepMap.rdd.groupBy(x => x.Id)

It seems you need groupByKey:
Sample:
import java.sql.Timestamp
val t = new Timestamp(2017, 5, 1, 0, 0, 0, 0)
val ds = Seq(Step(1L, 21L, 1, t), Step(1L, 20L, 2, t), Step(2L, 10L, 3, t)).toDS()
groupByKey and then mapGroups:
ds.groupByKey(_.Id).mapGroups((Id, Vals) => (Id, Vals.toList))
// res18: org.apache.spark.sql.Dataset[(Long, List[Step])] = [_1: bigint, _2: array<struct<Id:bigint,stepNum:bigint,stepId:int,stepTime:timestamp>>]
And the result looks like:
ds.groupByKey(_.Id).mapGroups((Id, Vals) => (Id, Vals.toList)).show()
+---+--------------------+
| _1| _2|
+---+--------------------+
| 1|[[1,21,1,3917-06-...|
| 2|[[2,10,3,3917-06-...|
+---+--------------------+

how to convert mix of text and numerical data to feature data in apache spark

I have a CSV of both textual and numerical data. I need to convert it to feature vector data in Spark (Double values). Is there any way to do that ?
I see some e.g where each keyword is mapped to some double value and use this to convert. However if there are multiple keywords, it is difficult to do this way.
Is there any other way out? I see Spark provides Extractors which will convert into feature vectors. Could someone please give an example?
48, Private, 105808, 9th, 5, Widowed, Transport-moving, Unmarried, White, Male, 0, 0, 40, United-States, >50K
42, Private, 169995, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, <=50K

Finally I did this way. I iterate through each data and make a map with key as each item and increment a Double counter.
def createMap(data: RDD[String]) : Map[String,Double] = {
var mapData:Map[String,Double] = Map()
var counter = 0.0
data.collect().foreach{ item =>
counter = counter +1
mapData += (item -> counter)
}
mapData
}
def getLablelValue(input: String): Int = input match {
case "<=50K" => 0
case ">50K" => 1
}
val census = sc.textFile("/user/cloudera/census_data.txt")
val orgTypeRdd = census.map(line => line.split(", ")(1)).distinct
val gradeTypeRdd = census.map(line => line.split(", ")(3)).distinct
val marStatusRdd = census.map(line => line.split(", ")(5)).distinct
val jobTypeRdd = census.map(line => line.split(", ")(6)).distinct
val familyStatusRdd = census.map(line => line.split(", ")(7)).distinct
val raceTypeRdd = census.map(line => line.split(", ")(8)).distinct
val genderTypeRdd = census.map(line => line.split(", ")(9)).distinct
val countryRdd = census.map(line => line.split(", ")(13)).distinct
val salaryRange = census.map(line => line.split(", ")(14)).distinct
val orgTypeMap = createMap(orgTypeRdd)
val gradeTypeMap = createMap(gradeTypeRdd)
val marStatusMap = createMap(marStatusRdd)
val jobTypeMap = createMap(jobTypeRdd)
val familyStatusMap = createMap(familyStatusRdd)
val raceTypeMap = createMap(raceTypeRdd)
val genderTypeMap = createMap(genderTypeRdd)
val countryMap = createMap(countryRdd)
val salaryRangeMap = createMap(salaryRange)
val featureVector = census.map{line =>
val fields = line.split(", ")
LabeledPoint(getLablelValue(fields(14).toString) , Vectors.dense(fields(0).toDouble, orgTypeMap(fields(1).toString) , fields(2).toDouble , gradeTypeMap(fields(3).toString) , fields(4).toDouble , marStatusMap(fields(5).toString), jobTypeMap(fields(6).toString), familyStatusMap(fields(7).toString),raceTypeMap(fields(8).toString),genderTypeMap (fields(9).toString), fields(10).toDouble , fields(11).toDouble , fields(12).toDouble,countryMap(fields(13).toString) , salaryRangeMap(fields(14).toString)))
}

how to filter () a pairRDD according to two conditions

how can i filter my pair RDD if i have 2 conditions for filter it , one to test the key and the other one to test the value (wanna the portion of code) bcz i used this portion and it didnt work saddly
JavaPairRDD filtering = pairRDD1.filter((x,y) -> (x._1.equals(y._1))&&(x._2.equals(y._2)))));

You can't use regular filter for this, because that checks one item at a time. You have to compare multiple items to each other, and check which one to keep. Here's an example which only keeps items which are repeated:
val items = List(1, 2, 5, 6, 6, 7, 8, 10, 12, 13, 15, 16, 16, 19, 20)
val rdd = sc.parallelize(items)
// now create an RDD with all possible combinations of pairs
val mapped = rdd.map { case (x) => (x, 1)}
val reduced = mapped.reduceByKey{ case (x, y) => x + y }
val filtered = reduced.filter { case (item, count) => count > 1 }
// Now print out the results:
filtered.collect().foreach { case (item, count) =>
println(s"Keeping $item because it occurred $count times.")}
It's probably not the most performant code for this, but it should give you an idea for the approach.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Why do all data end up in one partition after reduceByKey? - apache-spark

I am guessing all your processing times are the same. Alternatively, their hashCode (from the DateTime class) are the same. Is that a custom class ?

Related

Iterative RDD/Dataframe processing in Spark

I keep getting an error about the wrong number of arguments

Spark GroupBy Aggregate functions

how to convert mix of text and numerical data to feature data in apache spark

how to filter () a pairRDD according to two conditions

Categories

Resources