first of all, sorry if this is a dump question, I'm kinda new with Spark.
I am trying to do some group operations in Spark and I'm trying to avoid extra shuffle when modifying the key of my RDD.
Original RDDs are json Strings
Simplifying the logic my code looks like this:
case class Key1 (a: String, b: String)
val grouped1: RDD[(Key1, String)] = rdd1.keyBy(generateKey1(_))
val grouped2: RDD[(Key1, String)] = rdd2.keyBy(generateKey2(_))
val joined: RDD[(Key1, (String, String)) = groped1.join(grouped2)
Now I want to include a new field in the key and do some reduce operations. So I have something like:
case class key2 (a: String, b: String, c: String)
val withNewKey: RDD[Key2, (String, String)] = joined.map{ case (key, (val1, val2)) => {
val newKey = Key2(key.a, key.b, extractWhatever(val2))
(newKey, (val1, val2))
}}
withNewKey.reduceByKey.....
If I'm not wrong, as the Key has changed the partition is lost, so the reduce operation will probably shuffle the data, but it doesn't make sense, as the key was extended and no shuffle would be needed.
Am I missing something? How can I avoid that shuffle?
Thanks
You can use mapPartitions with preservesPartitioning set to true:
joined.mapPartitions(
_.map{ case (key, (val1, val2)) => ... },
true
)
Related
DataSet#foreach(f) applies the function f to each row in the dataset. In a clustered environment, the data is split across the cluster. How can the results from each of these functions be collected?
For example, say the function would count the number of characters stored in each row. How can you create a DataSet or RDD that contains the results of each of these functions applied to each row?
The definition for foreach looks something like :
final def foreach(f: (A) ⇒ Unit): Unit
f : The function that is applied for its side-effect to every element.
The result of function f is discarded
foreach in Scala is generally used to denote the usage of a function that involves a side-effect, e.g. printing to STDOUT.
If you want to return something by applying a particular function, you'll have to use map
final def map[B](f: (A) ⇒ B): List[B]
I copied the syntax from the documentation for List but it'll be something similar for RDDs as well.
As you can see, it works the function f on datatype A and returns a collection of datatype B where A and B can be the same data type as well.
val rdd = sc.parallelize(Array(
"String1",
"String2",
"String3" ))
scala> rdd.foreach(x => (x, x.length) )
// Nothing happens
rdd.map(x => (x, x.length) ).collect
// Array[(String, Int)] = Array((String1,7), (String2,7), (String3,7))
I have an input file which is like this
A,1
B,2
C,3
val data = sc.textFile("myfile.txt")
How can i make this RDD to be in this format
data: RDD[(String, Int)]
I tried this but didnt work
case class foo (a: String, b: Int)
val data = sc.textFile("myfile.txt").map(
c => foo(c(0).toString, c(1).toInt))
If you want an rdd of type RDD[(String,Int)] you should map your input to Tuple2[String, Int] instead of foo. Like this
val data = sc.textFile("myfile.txt")
.map(line => line.split(","))
.map(s => (s(0), s(1).toInt))
(I added a map for splitting your data by ",", which I think you probably forgot to add to your example).
I think the most readable form would be:
sc.textFile("myfile.txt")
.map { line =>
val Array(first, second) = line.split(",")
Foo(first, second.toInt)
}
This doesn't handle errors though, both Array(...) and toInt can fail.
I'm looking for a way to split an RDD into two or more RDDs. The closest I've seen is Scala Spark: Split collection into several RDD? which is still a single RDD.
If you're familiar with SAS, something like this:
data work.split1, work.split2;
set work.preSplit;
if (condition1)
output work.split1
else if (condition2)
output work.split2
run;
which resulted in two distinct data sets. It would have to be immediately persisted to get the results I intend...
It is not possible to yield multiple RDDs from a single transformation*. If you want to split a RDD you have to apply a filter for each split condition. For example:
def even(x): return x % 2 == 0
def odd(x): return not even(x)
rdd = sc.parallelize(range(20))
rdd_odd, rdd_even = (rdd.filter(f) for f in (odd, even))
If you have only a binary condition and computation is expensive you may prefer something like this:
kv_rdd = rdd.map(lambda x: (x, odd(x)))
kv_rdd.cache()
rdd_odd = kv_rdd.filter(lambda kv: kv[1]).keys()
rdd_even = kv_rdd.filter(lambda kv: not kv[1]).keys()
It means only a single predicate computation but requires additional pass over all data.
It is important to note that as long as an input RDD is properly cached and there no additional assumptions regarding data distribution there is no significant difference when it comes to time complexity between repeated filter and for-loop with nested if-else.
With N elements and M conditions number of operations you have to perform is clearly proportional to N times M. In case of for-loop it should be closer to (N + MN) / 2 and repeated filter is exactly NM but at the end of the day it is nothing else than O(NM). You can see my discussion** with Jason Lenderman to read about some pros-and-cons.
At the very high level you should consider two things:
Spark transformations are lazy, until you execute an action your RDD is not materialized
Why does it matter? Going back to my example:
rdd_odd, rdd_even = (rdd.filter(f) for f in (odd, even))
If later I decide that I need only rdd_odd then there is no reason to materialize rdd_even.
If you take a look at your SAS example to compute work.split2 you need to materialize both input data and work.split1.
RDDs provide a declarative API. When you use filter or map it is completely up to Spark engine how this operation is performed. As long as the functions passed to transformations are side effects free it creates multiple possibilities to optimize a whole pipeline.
At the end of the day this case is not special enough to justify its own transformation.
This map with filter pattern is actually used in a core Spark. See my answer to How does Sparks RDD.randomSplit actually split the RDD and a relevant part of the randomSplit method.
If the only goal is to achieve a split on input it is possible to use partitionBy clause for DataFrameWriter which text output format:
def makePairs(row: T): (String, String) = ???
data
.map(makePairs).toDF("key", "value")
.write.partitionBy($"key").format("text").save(...)
* There are only 3 basic types of transformations in Spark:
RDD[T] => RDD[T]
RDD[T] => RDD[U]
(RDD[T], RDD[U]) => RDD[W]
where T, U, W can be either atomic types or products / tuples (K, V). Any other operation has to be expressed using some combination of the above. You can check the original RDD paper for more details.
** https://chat.stackoverflow.com/rooms/91928/discussion-between-zero323-and-jason-lenderman
*** See also Scala Spark: Split collection into several RDD?
As other posters mentioned above, there is no single, native RDD transform that splits RDDs, but here are some "multiplex" operations that can efficiently emulate a wide variety of "splitting" on RDDs, without reading multiple times:
http://silex.freevariable.com/latest/api/#com.redhat.et.silex.rdd.multiplex.MuxRDDFunctions
Some methods specific to random splitting:
http://silex.freevariable.com/latest/api/#com.redhat.et.silex.sample.split.SplitSampleRDDFunctions
Methods are available from open source silex project:
https://github.com/willb/silex
A blog post explaining how they work:
http://erikerlandson.github.io/blog/2016/02/08/efficient-multiplexing-for-spark-rdds/
def muxPartitions[U :ClassTag](n: Int, f: (Int, Iterator[T]) => Seq[U],
persist: StorageLevel): Seq[RDD[U]] = {
val mux = self.mapPartitionsWithIndex { case (id, itr) =>
Iterator.single(f(id, itr))
}.persist(persist)
Vector.tabulate(n) { j => mux.mapPartitions { itr => Iterator.single(itr.next()(j)) } }
}
def flatMuxPartitions[U :ClassTag](n: Int, f: (Int, Iterator[T]) => Seq[TraversableOnce[U]],
persist: StorageLevel): Seq[RDD[U]] = {
val mux = self.mapPartitionsWithIndex { case (id, itr) =>
Iterator.single(f(id, itr))
}.persist(persist)
Vector.tabulate(n) { j => mux.mapPartitions { itr => itr.next()(j).toIterator } }
}
As mentioned elsewhere, these methods do involve a trade-off of memory for speed, because they operate by computing entire partition results "eagerly" instead of "lazily." Therefore, it is possible for these methods to run into memory problems on large partitions, where more traditional lazy transforms will not.
One way is to use a custom partitioner to partition the data depending upon your filter condition. This can be achieved by extending Partitioner and implementing something similar to the RangePartitioner.
A map partitions can then be used to construct multiple RDDs from the partitioned RDD without reading all the data.
val filtered = partitioned.mapPartitions { iter => {
new Iterator[Int](){
override def hasNext: Boolean = {
if(rangeOfPartitionsToKeep.contains(TaskContext.get().partitionId)) {
false
} else {
iter.hasNext
}
}
override def next():Int = iter.next()
}
Just be aware that the number of partitions in the filtered RDDs will be the same as the number in the partitioned RDD so a coalesce should be used to reduce this down and remove the empty partitions.
If you split an RDD using the randomSplit API call, you get back an array of RDDs.
If you want 5 RDDs returned, pass in 5 weight values.
e.g.
val sourceRDD = val sourceRDD = sc.parallelize(1 to 100, 4)
val seedValue = 5
val splitRDD = sourceRDD.randomSplit(Array(1.0,1.0,1.0,1.0,1.0), seedValue)
splitRDD(1).collect()
res7: Array[Int] = Array(1, 6, 11, 12, 20, 29, 40, 62, 64, 75, 77, 83, 94, 96, 100)
I am trying to convert a list of strings to a vector of char vectors:
import collection.breakOut
def stringsToCharVectors(xs: List[String]) =
xs.map(stringToCharVector)(breakOut) : Vector[Vector[Char]]
def stringToCharVector(x: String) =
x.map(a => a)(breakOut) : Vector[Char]
Is there a way to implement stringToCharVector that does not involve mapping with the identity function? Generally, are there shorter/better ways to implement stringsToCharVectors?
You can pass a String directly to the varargs constructor for Vector:
def stringToCharVector(x: String) = Vector(x: _*)
at which point having a separate method seems kind of silly. breakOut is for optimization; if you just want to convert, you can
Vector(xs.map(x => Vector(x: _*)): _*)
at the relatively modest expense of one extra object per list element. (All the chars will most likely be the memory-intensive part.)
In Scala 2.10:
scala> val xs = List("hello")
xs: List[String] = List(hello)
scala> xs.map(_.to[Vector]).to[Vector]
res0: Vector[Vector[Char]] = Vector(Vector(h, e, l, l, o))
The other way is just to add all the elements to an empty Vector; this is what happens behind the scenes anyway when you call a conversion method:
def stringsToCharVectors(xs: List[String]) =
Vector() ++ xs.map(Vector() ++ _)
With too many arguments, String.format easily gets too confusing. Is there a more powerful way to format a String. Like so:
"This is #{number} string".format("number" -> 1)
Or is this not possible because of type issues (format would need to take a Map[String, Any], I assume; don’t know if this would make things worse).
Or is the better way doing it like this:
val number = 1
<plain>This is { number } string</plain> text
even though it pollutes the name space?
Edit:
While a simple pimping might do in many cases, I’m also looking for something going in the same direction as Python’s format() (See: http://docs.python.org/release/3.1.2/library/string.html#formatstrings)
In Scala 2.10 you can use string interpolation.
val height = 1.9d
val name = "James"
println(f"$name%s is $height%2.2f meters tall") // James is 1.90 meters tall
Well, if your only problem is making the order of the parameters more flexible, this can be easily done:
scala> "%d %d" format (1, 2)
res0: String = 1 2
scala> "%2$d %1$d" format (1, 2)
res1: String = 2 1
And there's also regex replacement with the help of a map:
scala> val map = Map("number" -> 1)
map: scala.collection.immutable.Map[java.lang.String,Int] = Map((number,1))
scala> val getGroup = (_: scala.util.matching.Regex.Match) group 1
getGroup: (util.matching.Regex.Match) => String = <function1>
scala> val pf = getGroup andThen map.lift andThen (_ map (_.toString))
pf: (util.matching.Regex.Match) => Option[java.lang.String] = <function1>
scala> val pat = "#\\{([^}]*)\\}".r
pat: scala.util.matching.Regex = #\{([^}]*)\}
scala> pat replaceSomeIn ("This is #{number} string", pf)
res43: String = This is 1 string
You can easily implement a richer formatting yourself (with the "enhance my library" approach):
scala> implicit def RichFormatter(string: String) = new {
| def richFormat(replacement: Map[String, Any]) =
| (string /: replacement) {(res, entry) => res.replaceAll("#\\{%s\\}".format(entry._1), entry._2.toString)}
| }
RichFormatter: (string: String)java.lang.Object{def richFormat(replacement: Map[String,Any]): String}
scala> "This is #{number} string" richFormat Map("number" -> 1)
res43: String = This is 1 string
Or on more recent Scala versions since the original answer:
implicit class RichFormatter(string: String) {
def richFormat(replacement: Map[String, Any]): String =
replacement.foldLeft(string) { (res, entry) =>
res.replaceAll("#\\{%s\\}".format(entry._1), entry._2.toString)
}
}
Maybe the Scala-Enhanced-Strings-Plugin can help you. Look here:
Scala-Enhanced-Strings-Plugin Documentation
This the answer I came here looking for:
"This is %s string".format(1)
If you're using 2.10 then go with built-in interpolation. Otherwise, if you don't care about extreme performance and are not afraid of functional one-liners, you can use a fold + several regexp scans:
val template = "Hello #{name}!"
val replacements = Map( "name" -> "Aldo" )
replacements.foldLeft(template)((s:String, x:(String,String)) => ( "#\\{" + x._1 + "\\}" ).r.replaceAllIn( s, x._2 ))
You might also consider the use of a template engine for really complex and long strings. On top of my head I have Scalate which implements amongst others the Mustache template engine.
Might be overkill and performance loss for simple strings, but you seem to be in that area where they start becoming real templates.