I am working on SPARK. And my objective is to read lines from a file and sorted them based on hash. I understood that we get the file as RDD of lines. So is there a way by which i can iterate over this RDD so that i can read line by line. So i want to be able to convert it to Iterator type.
Am i limited to applying some transformation function on it in order to get it working. Following the lazy execution concept of Spark
So far i have tried this following transformation technique code.
SparkConf sparkConf = new SparkConf().setAppName("Sorting1");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
JavaRDD<String> lines = ctx.textFile("hdfs://localhost:9000/hash-example-output/part-r-00000", 1);
lines = lines.filter(new Function<String, Boolean>()
{
#Override
public Boolean call(String s) {
String str[] = COMMA.split(s);
unsortedArray1[i] = Long.parseLong(str[str.length-1]);
i++;
return s.contains("error");
}
});
lines.count();
ctx.stop();
sort(unsortedArray1);
If you want to sort string in RDD, you could use takeOrdered function
takeOrdered java.util.List takeOrdered(int num,
java.util.Comparator comp) Returns the first K elements from this RDD as defined by the specified
Comparator[T] and maintains the order.
Parameters: num - the number of
top elements to return comp - the comparator that defines the order
Returns: an array of top elements
or
takeOrdered java.util.List takeOrdered(int num) Returns the first K
elements from this RDD using the natural ordering for T while maintain
the order. Parameters: num - the number of top elements to return
Returns: an array of top elements
so you could do
List<String> sortedLines = lines.takeOrdered(lines.count());
ctx.stop();
since RDD are distributed and shuffeled for each transformation, it's kinda useless to sort when it's still in RDD form, because when sorted RDD transformed, it will be shuffled (cmiiw)
but take a look at JavaPairRDD.sortByKey()
Try collect():
List<String> list = lines.collect();
Collections.sort(list);
Related
DataSet#foreach(f) applies the function f to each row in the dataset. In a clustered environment, the data is split across the cluster. How can the results from each of these functions be collected?
For example, say the function would count the number of characters stored in each row. How can you create a DataSet or RDD that contains the results of each of these functions applied to each row?
The definition for foreach looks something like :
final def foreach(f: (A) ⇒ Unit): Unit
f : The function that is applied for its side-effect to every element.
The result of function f is discarded
foreach in Scala is generally used to denote the usage of a function that involves a side-effect, e.g. printing to STDOUT.
If you want to return something by applying a particular function, you'll have to use map
final def map[B](f: (A) ⇒ B): List[B]
I copied the syntax from the documentation for List but it'll be something similar for RDDs as well.
As you can see, it works the function f on datatype A and returns a collection of datatype B where A and B can be the same data type as well.
val rdd = sc.parallelize(Array(
"String1",
"String2",
"String3" ))
scala> rdd.foreach(x => (x, x.length) )
// Nothing happens
rdd.map(x => (x, x.length) ).collect
// Array[(String, Int)] = Array((String1,7), (String2,7), (String3,7))
I have 2 JavaRDDs. The first one is
JavaRDD<CustomClass> data
and the second one is
JavaRDD<Vector> features
My Custom class has 2 fields, (String) text and (int) label.
I have 1000 instances of CustomClass in my JavaRDD data and 1000 instances of Vector in the JavaRDD features.
I have computed these 1000 vectors by using the JavaRDD data and applying a map function on it.
Now, I want to have a new JavaRDD of the form
JavaRDD<LabeledPoint>
Since the constructor of a LabeledPoint requires a label and a vector, I am unable to apply a map function which has both CustomClass and the Vector as an argument to the call function since it accepts only one argument.
Can someone please tell me how to combine these two JavaRDDs and get the new
JavaRDD<LabeledPoint>
?
Here are some snippets from the code I wrote :
Class CustomClass {
String text; int label;
}
JavaRDD<CustomClass> data = getDataFromFile(filename);
final HashingTF hashingTF = new HashingTF();
final IDF idf = new IDF();
final JavaRDD<Vector> td2 = data.map(
new Function<CustomClass, Vector>() {
#Override
public Vector call(CustomClass cd) throws Exception {
Vector v = new DenseVector(hashingTF.transform(Arrays.asList(cd.getText().split(" "))).toArray());
return v;
}
}
);
final JavaRDD<Vector> features = idf.fit(td2).transform(td2);
You can use JavaRDD#zip:
Zips this RDD with another one, returning key-value pairs with the
first element in each RDD, second element in each RDD, etc. Assumes
that the two RDDs have the same number of partitions and the same
number of elements in each partition (e.g. one was made through a map
on the other).
JavaPairRDD<CustomClass,Vector> dataAndFeatures = data.zip(features);
// TODO dataAndFeatures.map to LabeledPoint instances
The highlighted part of the docs holds, since you create td2 by simple map of the data. And then df (==features?) is result of transform on IDFModel instance, that also keeps the values aligned.
I'm looking for a way to split an RDD into two or more RDDs. The closest I've seen is Scala Spark: Split collection into several RDD? which is still a single RDD.
If you're familiar with SAS, something like this:
data work.split1, work.split2;
set work.preSplit;
if (condition1)
output work.split1
else if (condition2)
output work.split2
run;
which resulted in two distinct data sets. It would have to be immediately persisted to get the results I intend...
It is not possible to yield multiple RDDs from a single transformation*. If you want to split a RDD you have to apply a filter for each split condition. For example:
def even(x): return x % 2 == 0
def odd(x): return not even(x)
rdd = sc.parallelize(range(20))
rdd_odd, rdd_even = (rdd.filter(f) for f in (odd, even))
If you have only a binary condition and computation is expensive you may prefer something like this:
kv_rdd = rdd.map(lambda x: (x, odd(x)))
kv_rdd.cache()
rdd_odd = kv_rdd.filter(lambda kv: kv[1]).keys()
rdd_even = kv_rdd.filter(lambda kv: not kv[1]).keys()
It means only a single predicate computation but requires additional pass over all data.
It is important to note that as long as an input RDD is properly cached and there no additional assumptions regarding data distribution there is no significant difference when it comes to time complexity between repeated filter and for-loop with nested if-else.
With N elements and M conditions number of operations you have to perform is clearly proportional to N times M. In case of for-loop it should be closer to (N + MN) / 2 and repeated filter is exactly NM but at the end of the day it is nothing else than O(NM). You can see my discussion** with Jason Lenderman to read about some pros-and-cons.
At the very high level you should consider two things:
Spark transformations are lazy, until you execute an action your RDD is not materialized
Why does it matter? Going back to my example:
rdd_odd, rdd_even = (rdd.filter(f) for f in (odd, even))
If later I decide that I need only rdd_odd then there is no reason to materialize rdd_even.
If you take a look at your SAS example to compute work.split2 you need to materialize both input data and work.split1.
RDDs provide a declarative API. When you use filter or map it is completely up to Spark engine how this operation is performed. As long as the functions passed to transformations are side effects free it creates multiple possibilities to optimize a whole pipeline.
At the end of the day this case is not special enough to justify its own transformation.
This map with filter pattern is actually used in a core Spark. See my answer to How does Sparks RDD.randomSplit actually split the RDD and a relevant part of the randomSplit method.
If the only goal is to achieve a split on input it is possible to use partitionBy clause for DataFrameWriter which text output format:
def makePairs(row: T): (String, String) = ???
data
.map(makePairs).toDF("key", "value")
.write.partitionBy($"key").format("text").save(...)
* There are only 3 basic types of transformations in Spark:
RDD[T] => RDD[T]
RDD[T] => RDD[U]
(RDD[T], RDD[U]) => RDD[W]
where T, U, W can be either atomic types or products / tuples (K, V). Any other operation has to be expressed using some combination of the above. You can check the original RDD paper for more details.
** https://chat.stackoverflow.com/rooms/91928/discussion-between-zero323-and-jason-lenderman
*** See also Scala Spark: Split collection into several RDD?
As other posters mentioned above, there is no single, native RDD transform that splits RDDs, but here are some "multiplex" operations that can efficiently emulate a wide variety of "splitting" on RDDs, without reading multiple times:
http://silex.freevariable.com/latest/api/#com.redhat.et.silex.rdd.multiplex.MuxRDDFunctions
Some methods specific to random splitting:
http://silex.freevariable.com/latest/api/#com.redhat.et.silex.sample.split.SplitSampleRDDFunctions
Methods are available from open source silex project:
https://github.com/willb/silex
A blog post explaining how they work:
http://erikerlandson.github.io/blog/2016/02/08/efficient-multiplexing-for-spark-rdds/
def muxPartitions[U :ClassTag](n: Int, f: (Int, Iterator[T]) => Seq[U],
persist: StorageLevel): Seq[RDD[U]] = {
val mux = self.mapPartitionsWithIndex { case (id, itr) =>
Iterator.single(f(id, itr))
}.persist(persist)
Vector.tabulate(n) { j => mux.mapPartitions { itr => Iterator.single(itr.next()(j)) } }
}
def flatMuxPartitions[U :ClassTag](n: Int, f: (Int, Iterator[T]) => Seq[TraversableOnce[U]],
persist: StorageLevel): Seq[RDD[U]] = {
val mux = self.mapPartitionsWithIndex { case (id, itr) =>
Iterator.single(f(id, itr))
}.persist(persist)
Vector.tabulate(n) { j => mux.mapPartitions { itr => itr.next()(j).toIterator } }
}
As mentioned elsewhere, these methods do involve a trade-off of memory for speed, because they operate by computing entire partition results "eagerly" instead of "lazily." Therefore, it is possible for these methods to run into memory problems on large partitions, where more traditional lazy transforms will not.
One way is to use a custom partitioner to partition the data depending upon your filter condition. This can be achieved by extending Partitioner and implementing something similar to the RangePartitioner.
A map partitions can then be used to construct multiple RDDs from the partitioned RDD without reading all the data.
val filtered = partitioned.mapPartitions { iter => {
new Iterator[Int](){
override def hasNext: Boolean = {
if(rangeOfPartitionsToKeep.contains(TaskContext.get().partitionId)) {
false
} else {
iter.hasNext
}
}
override def next():Int = iter.next()
}
Just be aware that the number of partitions in the filtered RDDs will be the same as the number in the partitioned RDD so a coalesce should be used to reduce this down and remove the empty partitions.
If you split an RDD using the randomSplit API call, you get back an array of RDDs.
If you want 5 RDDs returned, pass in 5 weight values.
e.g.
val sourceRDD = val sourceRDD = sc.parallelize(1 to 100, 4)
val seedValue = 5
val splitRDD = sourceRDD.randomSplit(Array(1.0,1.0,1.0,1.0,1.0), seedValue)
splitRDD(1).collect()
res7: Array[Int] = Array(1, 6, 11, 12, 20, 29, 40, 62, 64, 75, 77, 83, 94, 96, 100)
I am looking for a Spark RDD operation like top or takeOrdered, but that returns another RDD, not an Array, that is, does not collect the full result to RAM.
It can be a sequence of operations, but ideally, in no step trying to collect the full result into the memory of a single node.
Let's say you want to have the top 50% of an RDD.
def top50(rdd: RDD[(Double, String)]) = {
val sorted = rdd.sortByKey(ascending = false)
val partitions = sorted.partitions.size
// Throw away the contents of the lower partitions.
sorted.mapPartitionsWithIndex { (pid, it) =>
if (pid <= partitions / 2) it else Nil
}
}
This is an approximation — you may get more or less than 50%. You could do better but it would cost an extra evaluation of the RDD. For the use cases I have in mind this would not be worth it.
Take a look at
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala
import org.apache.spark.mllib.rdd.MLPairRDDFunctions._
val rdd: RDD[(String, Int)] // the first string is the key, the rest is the value
val topByKey:RDD[(String, Array[Int])] = rdd.topByKey(n)
Or use aggregate with BoundedPriorityQueue.
This is my code so far, what am I doing wrong? I need to create a HashMap of <Integer,ArrayList<String>> which will map each word of the test data keyed to the length of the word. Then, display the test String and iterate through the keys and display the words ordered by length. I am not entirely sure what I am doing wrong. Should I use an iterator as opposed to a for-each loop, or would that not make much of a difference?
This is my code thus far
HashMap<Integer,String>map = new HashMap<Integer,String>();
// adds values
map.put(2, "The");
map.put(8, "Superbowl");
Instead of an HashMap, you should use a SortedMap: "A Map that further provides a total ordering on its keys" (JDK 1.7 API doc). A TreeMap would do.
A sorted map will return its keys in sorted order. Since your keys are Integer (hence Comparable), their natural ordering is used.
To list the words by increasing length:
SortedMap<Integer, List<String>> wordsByLength; // filled somewhere
// Iterates over entries in increasing key order
Set<Map.Entry<Integer, List<String>> entries = wordsByLength.entrySet();
for ( Map.Entry<Integer, List<String>> entry : entries ) {
int length = entry.getKey();
List<String> words = entry.getValue();
// do what you need to do
}
Since order is not gauranteed in a HashMap you would need to pull out the keys and sort them before using the sorted key list to loop through the values like so:
Map<Integer,String>map = new HashMap<Integer,String>();
map.put(2, "The");
map.put(8, "Superbowl");
List<Integer> keyList = new ArrayList<Integer>(map.keySet());
Collections.sort(keyList);
for (Integer key : keyList) {
System.out.println(map.get(key));
}
However in your use case where words are the same length you would end up overwriting previous entries in the Map which would not be what you want. A better solution would be to store the words in a List and sort them using a Comparator that compared String lengths like this:
List<String> words = new ArrayList<String>();
// Loop through adding words if they don't already exist in the List
// sort the List by word length
Collections.sort(words, new Comparator<String>() {
public int compare(String s1, String s2) {
return Integer.valueOf(s1.length()).compareTo(s2.length());
}
});
//Now loop through sorted List and print words
for (String word : words) {
System.out.println(word);
}