apache spark: Need inputs to optimize my code - apache-spark

I am building a application which reads 1.5 G data and translates. My code skeleton is as follows.
//here i pass a list of id’s to read 4000 files and form a union RDD of all the records and return it as unionbioSetId
run(){
JavaRDD<String> unionbioSetId = readDirectory(ctx, groupAID, groupBID);
JavaRDD<String> temp= unionbioSetId.coalesce(6, false);
JavaPairRDD<String, Tuple3<Double, Double, Double>> flatRDD = temp.flatMapToPair(
new PairFlatMapFunction<String, String, String>() {
return Arrays.asList(new Tuple2<String, String>(key,value));
}}).groupByKey().mapToPair(new PairFunction<Tuple2<String, Iterable<String>>, // input
String, // K
Tuple3<Double, Double, Double> // V
>() {
public Tuple2<String, Tuple3<Double, Double, Double>> call(
Tuple2<String, Iterable<String>> value) {
}).filter(new Function<Tuple2<String, Tuple3<Double, Double, Double>>, Boolean>() {
}});// group by key and map to pair,sort by key
}
String hadoopOutputPathAsString = directory;
flatRDD.saveAsTextFile(hadoopOutputPathAsString);
}
}
///////////////
num of executors:9
driver memory:2g
executor memory: 6g
execuotr cores: 12
My program is running slower than map/reduce(same code skeleton). Can any one help me in optimizing the above code skeleton to make it faster.

Don't call coalesce. You don't need less partitions, you need more. You have 108 worker cores, but you only use 6 of them if you go with 6 partitions. A rule of thumb is you want at least 3 * num_executors * cores_per_executor = 324 partitions.
JavaRDD<String> temp = unionbioSetId.repartition(350);
Or just don't change the number of partitions at all. When the files are read, the data is partitioned by Hadoop splits. In many cases this gives a good layout, and you would avoid the cost of repartitioning.
Read the files at once instead of reading them separately and then taking their union: sc.textFile("file1,file2,file3,...") or sc.textFile("dir/*"). This may also make a performance difference.

Related

how to write >1 file from a partition

First I wanted to spit a partition by a prefixed size, so I can update the file system. For example if a partition data size is 200MB(each row in the partition RDD can be different size), I want to write 4 files in that partition, each file 50MB, while trying to avoid shuffle. Is it possible to do that without a repartition or a coelesce which would cause shuffle? I dont have a fixed row size I cant really use the maxRecordsPerFile spark config.
Next option is to repartition the entire dataset causing shuffle. So to calculate the size I did the following, but it fails with: "Local variable numOfPartitions defined in an enclosing scope must be final or effectively final". what am I doing wrong. How can I fix this code?
...
int numOfPartitions = 1;
JavaRDD<String> tgg = xmlDataSet.toJavaRDD().mapPartitions ( xmlRows -> {
long totalSize = 0;
List<String> strLst = new ArrayList<String>();
while (xmlRows.hasNext()) {
String xmlString = xmlString = blobToString(xmlRows.next());
totalSize = totalSize + xmlString.getBytes().length;
strLst.add(xmlString);
if (totalSize > 10000) {
numOfPartitions++;
}
}
return strLst.iterator();
});
...

Optimization (in terms of speed )

is there any other way to optimize this code. Anyone can come up with better way because this is taking lot of time in main code. Thanks alot;)
HashMap<String, Integer> hmap = new HashMap<String, Integer>();
List<String> dup = new ArrayList<String>();
List<String> nondup = new ArrayList<String>();
for (String num : nums) {
String x= num;
String result = x.toLowerCase();
if (hmap.containsKey(result)) {
hmap.put(result, hmap.get(result) + 1);
}
else {
hmap.put(result,1);
}
}
for(String num:nums){
int count= hmap.get(num.toLowerCase());
if (count == 1){
nondup.add(num);
}
else{
dup.add(num);
}
}
output:
[A/tea, C/SEA.java, C/clock, aep, aeP, C/SEA.java]
Dups: [C/SEA.java, aep, aeP, C/SEA.java]
nondups: [A/tea, C/clock]
How much time is "a lot of time"? Is your input bigger than what you've actually shown us?
You could parallelize this with something like Arrays.parallelStream(nums).collect(Collectors.groupingByConcurrent(k -> k, Collectors.counting()), which would get you a Map<String, Long>, but that would only speed up your code if you have a lot of input, which it doesn't look like you have right now.
You could parallelize the next step, if you liked, like so:
Map<String, Long> counts = Arrays.parallelStream(nums)
.collect(Collectors.groupingByConcurrent(k -> k, Collectors.counting());
Map<Boolean, List<String>> hasDup =
counts.entrySet().parallelStream()
.collect(Collectors.partitioningBy(
entry -> entry.getValue() > 1,
Collectors.mapping(Entry::getKey, Collectors.toList())));
List<String> dup = hasDup.get(true);
List<String> nodup = hasDup.get(false);
The algorithms in the other answers can speed up execution using multiple threads.
This can theoretically reduce the processing time with a factor of M, where M is the maximum number of threads that your system can run concurrently. However, as M is a constant number, this does not change the order of complexity, which therefore remains O(N).
At a glance, I do not see a way to solve your problem in less than O(N), I am afraid.

does Spark unpersist() have different strategy?

Just did some experiment on spark unpersist() and feel confused on what it actually did. I googled a lot and almost all people say the unpersist() will immediately evict the RDD from excutor's memory. but in this test, we can see it's not always ture. see the simple test below:
private static int base = 0;
public static Integer[] getInts(){
Integer[] res = new Integer[5];
for(int i=0;i<5;i++){
res[i] = base++;
}
System.out.println("number generated:" + res[0] + " to " + res[4] + "---------------------------------");
return res;
}
public static void main( String[] args )
{
SparkSession sparkSession = SparkSession.builder().appName("spark test").getOrCreate();
JavaSparkContext spark = new JavaSparkContext(sparkSession.sparkContext());
JavaRDD<Integer> first = spark.parallelize(Arrays.asList(getInts()));
System.out.println("first: " + Arrays.toString(first.collect().toArray())); // action
first.unpersist();
System.out.println("first is unpersisted");
System.out.println("compute second ========================");
JavaRDD<Integer> second = first.map(i -> {
System.out.println("double " + i);
return i*2;
}).cache(); // transform
System.out.println("second: " + Arrays.toString(second.collect().toArray())); // action
second.unpersist();
System.out.println("compute third ========================");
JavaRDD<Integer> third = second.map(i -> i+100); // transform
System.out.println("third: " + Arrays.toString(third.collect().toArray())); // action
}
the output is:
number generated:0 to 4---------------------------------
first: [0, 1, 2, 3, 4]
first is unpersisted
compute second ========================
double 0
double 1
double 2
double 3
double 4
second: [0, 2, 4, 6, 8]
compute third ========================
double 0
double 1
double 2
double 3
double 4
third: [100, 102, 104, 106, 108]
As we can see, unpersist() 'first' is useless, it will not recalculate.
but unpersist() 'second' will trigger recalculation.
Anyone can help me to figure out why unpersist() 'first' will not trigger recalculation? if I want to force 'first' to be evicted out of memory, how should I do? is there any special for RDD from parallelize or textFile() API?
Thanks!
This behavior has nothing to do with caching and unpersisting. In fact first is not even persisted, although it wouldn't make much difference here.
When you parallelize, you pass a local, non-distributed object. parallelize takes its argument by value and its life cycle is completely out of Spark's scope. As a result Spark has no reason to recompute it at all, once ParallelCollectionRDD has been initialized. If you want to distribute different collection, just create a new RDD.
It is also worth noting that unpersist can be called in both blocking and non-blocking mode, depending on the blocking argument.

Spark - operate a non serialized method on every element in a JavaPairRDD

Is it possible to go over elements in the JavaPairRDD but to perform some calculation f() on one machine only?
Each element (x,y) in the JavaPairRDD is a record (features and label), and I would like to call some calculation f(x,y) which can't be serialized.
f() is an self-implemented algorithm that updates some weights $w_i$.
The weights $w_i$ should be the same object for data from all partitions (rater broke up into tasks - each of which is operated on by an executor).
Update:
What I'm trying to do is to run "algo" f() 100 times in such a way that every iteration of algo is single-threaded but the whole 100 iterations run in parallel on different nodes. In high level, the code is something like that:
JavaPairRDD<String, String> dataPairs = ... // Load the data
boolean withReplacement = true;
double testFraction = 0.2;
long seed = 0;
Map classFractions = new HashMap();
classFractions.put("1", 1 - testFraction);
classFractions.put("0", 1 - testFraction);
dataPairs.cache();
for (1:100)
{
PredictionAlgorithm algo = new Algo();
JavaPairRDD<String, String> trainStratifiedData = dataPairs.sampleByKeyExact(withReplacement, classFractions, seed);
algo.fit(trainStratifiedData);
}
I would like that algo.fit() will update the same weights $w_i$ instead of fitting weights for every partition.

Spark - does sampleByKeyExact duplicate the data?

When performing sampleByKeyExact on a JavaPairRDD, does Spark save an actual copy of the data or pointers to the JavaPairRDD?
Meaning, if I perform 100 bootstrap sampling of the original dataset - does it keep 100 copies of the original RDD or keep 100 different indices with pointers?
UPDATE:
JavaPairRDD<String, String> dataPairs = ... // Load the data
boolean withReplacement = true;
double testFraction = 0.2;
long seed = 0;
Map classFractions = new HashMap();
classFractions.put("1", 1 - testFraction);
classFractions.put("0", 1 - testFraction);
dataPairs.cache();
for (1:100)
{
PredictionAlgorithm algo = new Algo();
JavaPairRDD<String, String> trainStratifiedData = dataPairs.sampleByKeyExact(withReplacement, classFractions, seed);
algo.fit(trainStratifiedData);
}

Resources