Spark - does sampleByKeyExact duplicate the data? - apache-spark

When performing sampleByKeyExact on a JavaPairRDD, does Spark save an actual copy of the data or pointers to the JavaPairRDD?
Meaning, if I perform 100 bootstrap sampling of the original dataset - does it keep 100 copies of the original RDD or keep 100 different indices with pointers?
UPDATE:
JavaPairRDD<String, String> dataPairs = ... // Load the data
boolean withReplacement = true;
double testFraction = 0.2;
long seed = 0;
Map classFractions = new HashMap();
classFractions.put("1", 1 - testFraction);
classFractions.put("0", 1 - testFraction);
dataPairs.cache();
for (1:100)
{
PredictionAlgorithm algo = new Algo();
JavaPairRDD<String, String> trainStratifiedData = dataPairs.sampleByKeyExact(withReplacement, classFractions, seed);
algo.fit(trainStratifiedData);
}

Related

how to write >1 file from a partition

First I wanted to spit a partition by a prefixed size, so I can update the file system. For example if a partition data size is 200MB(each row in the partition RDD can be different size), I want to write 4 files in that partition, each file 50MB, while trying to avoid shuffle. Is it possible to do that without a repartition or a coelesce which would cause shuffle? I dont have a fixed row size I cant really use the maxRecordsPerFile spark config.
Next option is to repartition the entire dataset causing shuffle. So to calculate the size I did the following, but it fails with: "Local variable numOfPartitions defined in an enclosing scope must be final or effectively final". what am I doing wrong. How can I fix this code?
...
int numOfPartitions = 1;
JavaRDD<String> tgg = xmlDataSet.toJavaRDD().mapPartitions ( xmlRows -> {
long totalSize = 0;
List<String> strLst = new ArrayList<String>();
while (xmlRows.hasNext()) {
String xmlString = xmlString = blobToString(xmlRows.next());
totalSize = totalSize + xmlString.getBytes().length;
strLst.add(xmlString);
if (totalSize > 10000) {
numOfPartitions++;
}
}
return strLst.iterator();
});
...

How to calculate standard deviation of an RDD in Spark?

I am a new Apache Spark user and am confused about the way that sparks run the programs.
For example, I have a large int RDD that is distributed over 10 nodes and want to run a scala code on the driver to calculate the (average/standard deviation) of each partition. (it is important to have these values for each partition, not for all of data).
Is it possible in Spark and could anyone give me a sample?
A sample java code to do the spark map operation.
List<List<Integer>> listOfLists = new ArrayList();
for(int j = 0 ; j < 10; j++){
List<Integer> l = new ArrayList();
for(int i = 0 ; i < 50; i++){
l.add(i);
}
listOfLists.add(l);
}
SparkConf conf = new SparkConf().setAppName("STD Application"); //set other params
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<List<Integer>> rddNumbers = sc.parallelize(listOfLists);
JavaRDD<Integer> partialSTD = rddNumbers.map(x -> Utils.std(x));
for(Integer value: partialSTD.collect()){
System.out.println(value);
}

How to compute the distance matrix in spark?

I have tried pairing the samples but it costs huge amount of memory as 100 samples leads to 9900 samples which is more costly. What could be the more effective way of computing distance matrix in distributed environment in spark
Here is a snippet of pseudo code what i'm trying
val input = (sc.textFile("AirPassengers.csv",(numPartitions/2)))
val i = input.map(s => (Vectors.dense(s.split(',').map(_.toDouble))))
val indexed = i.zipWithIndex() //Including the index of each sample
val indexedData = indexed.map{case (k,v) => (v,k)}
val pairedSamples = indexedData.cartesian(indexedData)
val filteredSamples = pairedSamples.filter{ case (x,y) =>
(x._1.toInt > y._1.toInt) //to consider only the upper or lower trainagle
}
filteredSamples.cache
filteredSamples.count
Above code creates the pairs but even if my dataset contains 100 samples, by pairing filteredSamples (above) results in 4950 sample which could be very costly for big data
I recently answered a similar question.
Basically, it will arrive to computing n(n-1)/2 pairs, which would be 4950 computations in your example. However, what makes this approach different is that I use joins instead of cartesian. With your code, the solution would look like this:
val input = (sc.textFile("AirPassengers.csv",(numPartitions/2)))
val i = input.map(s => (Vectors.dense(s.split(',').map(_.toDouble))))
val indexed = i.zipWithIndex()
// including the index of each sample
val indexedData = indexed.map { case (k,v) => (v,k) }
// prepare indices
val count = i.count
val indices = sc.parallelize(for(i <- 0L until count; j <- 0L until count; if i > j) yield (i, j))
val joined1 = indices.join(indexedData).map { case (i, (j, v)) => (j, (i,v)) }
val joined2 = joined1.join(indexedData).map { case (j, ((i,v1),v2)) => ((i,j),(v1,v2)) }
// after that, you can then compute the distance using your distFunc
val distRDD = joined2.mapValues{ case (v1, v2) => distFunc(v1, v2) }
Try this method and compare it with the one you already posted. Hopefully, this can speedup your code a bit.
As far as I can see from checking various sources and the Spark mllib clustering site, Spark doesn't currently support the distance or pdist matrices.
In my opinion, 100 samples will always output at least 4950 values; so manually creating a distributed matrix solver using a transformation (like .map) would be the best solution.
This can serve as the java version of jtitusj's answer..
public JavaPairRDD<Tuple2<Long, Long>, Double> getDistanceMatrix(Dataset<Row> ds, String vectorCol) {
JavaRDD<Vector> rdd = ds.toJavaRDD().map(new Function<Row, Vector>() {
private static final long serialVersionUID = 1L;
public Vector call(Row row) throws Exception {
return row.getAs(vectorCol);
}
});
List<Vector> vectors = rdd.collect();
long count = ds.count();
List<Tuple2<Tuple2<Long, Long>, Double>> distanceList = new ArrayList<Tuple2<Tuple2<Long, Long>, Double>>();
for(long i=0; i < count; i++) {
for(long j=0; j < count && i > j; j++) {
Tuple2<Long, Long> indexPair = new Tuple2<Long, Long>(i, j);
double d = DistanceMeasure.getDistance(vectors.get((int)i), vectors.get((int)j));
distanceList.add(new Tuple2<Tuple2<Long, Long>, Double>(indexPair, d));
}
}
return distanceList;
}

Spark - operate a non serialized method on every element in a JavaPairRDD

Is it possible to go over elements in the JavaPairRDD but to perform some calculation f() on one machine only?
Each element (x,y) in the JavaPairRDD is a record (features and label), and I would like to call some calculation f(x,y) which can't be serialized.
f() is an self-implemented algorithm that updates some weights $w_i$.
The weights $w_i$ should be the same object for data from all partitions (rater broke up into tasks - each of which is operated on by an executor).
Update:
What I'm trying to do is to run "algo" f() 100 times in such a way that every iteration of algo is single-threaded but the whole 100 iterations run in parallel on different nodes. In high level, the code is something like that:
JavaPairRDD<String, String> dataPairs = ... // Load the data
boolean withReplacement = true;
double testFraction = 0.2;
long seed = 0;
Map classFractions = new HashMap();
classFractions.put("1", 1 - testFraction);
classFractions.put("0", 1 - testFraction);
dataPairs.cache();
for (1:100)
{
PredictionAlgorithm algo = new Algo();
JavaPairRDD<String, String> trainStratifiedData = dataPairs.sampleByKeyExact(withReplacement, classFractions, seed);
algo.fit(trainStratifiedData);
}
I would like that algo.fit() will update the same weights $w_i$ instead of fitting weights for every partition.

Apache Spark Streaming: Median of windowed PairDStream by key

I want to calculate the median value of a PairDStream for the values of each key.
I already tried the following, which is very unefficient:
JavaPairDStream<String, Iterable<Float>> groupedByKey = pairDstream.groupByKey();
JavaPairDStream<String, Float> medianPerPlug1h = groupedByKey.transformToPair(new Function<JavaPairRDD<String,Iterable<Float>>, JavaPairRDD<String,Float>>() {
public JavaPairRDD<String, Float> call(JavaPairRDD<String, Iterable<Float>> v1) throws Exception {
return v1.mapValues(new Function<Iterable<Float>, Float>() {
public Float call(Iterable<Float> v1) throws Exception {
List<Float> buffer = new ArrayList<Float>();
long count = 0L;
Iterator<Float> iterator = v1.iterator();
while(iterator.hasNext()) {
buffer.add(iterator.next());
count++;
}
float[] values = new float[(int)count];
for(int i = 0; i < buffer.size(); i++) {
values[i] = buffer.get(i);
}
Arrays.sort(values);
float median;
int startIndex;
if(count % 2 == 0) {
startIndex = (int)(count / 2 - 1);
float a = values[startIndex];
float b = values[startIndex + 1];
median = (a + b) / 2.0f;
} else {
startIndex = (int)(count/2);
median = values[startIndex];
}
return median;
}
});
}
});
medianPerPlug1h.print();
Can somebody help me with a more efficient transaction? I have about 1950 different keys, each can get to 3600 (1 data point per second, window of 1 hour) values, where to find the median of.
Thank you!
First thing is that I don't know why are you using Spark for that kind of task. It doesn't seem to be related to big data considering you got just few thousand of values. It may make things more complicated. But let's assume you're planning to scale up to bigger datasets.
I would try to use some more optimized algorithm for finding median than just sorting values. Sorting an array of values runs in O(n * log n) time.
You could think about using some linear-time median algorithm like Median of medians
1) avoid using groupbykey; reducebykey is more efficient than groupbykey.
2) reduceByKeyAndWindow(Function2,windowduration,slideDuration) can serve you better.
example:
JavaPairDStream merged=yourRDD.reduceByKeyAndWindow(new Function2() {
public String call(String arg0, String arg1) throws Exception {
return arg0+","+arg1;
}
}, Durations.seconds(windowDur), Durations.seconds(slideDur));
Assume output from this RDD will be like this :
(key,1,2,3,4,5,6,7)
(key,1,2,3,4,5,6,7).
now for each key , you can parse this , you will have the count of values,
so : 1+2+3+4+5+6+7/count
Note: i used string just to concatenate.
I hope it helps :)

Resources