how to write >1 file from a partition - apache-spark

First I wanted to spit a partition by a prefixed size, so I can update the file system. For example if a partition data size is 200MB(each row in the partition RDD can be different size), I want to write 4 files in that partition, each file 50MB, while trying to avoid shuffle. Is it possible to do that without a repartition or a coelesce which would cause shuffle? I dont have a fixed row size I cant really use the maxRecordsPerFile spark config.
Next option is to repartition the entire dataset causing shuffle. So to calculate the size I did the following, but it fails with: "Local variable numOfPartitions defined in an enclosing scope must be final or effectively final". what am I doing wrong. How can I fix this code?
...
int numOfPartitions = 1;
JavaRDD<String> tgg = xmlDataSet.toJavaRDD().mapPartitions ( xmlRows -> {
long totalSize = 0;
List<String> strLst = new ArrayList<String>();
while (xmlRows.hasNext()) {
String xmlString = xmlString = blobToString(xmlRows.next());
totalSize = totalSize + xmlString.getBytes().length;
strLst.add(xmlString);
if (totalSize > 10000) {
numOfPartitions++;
}
}
return strLst.iterator();
});
...

Related

Spark,Graphx program does not utilize cpu and memory

I have a function that takes the neighbors of a node ,for the neighbors i use broadcast variable and the id of the node itself and it calculates the closeness centrality for that node.I map each node of the graph with the result of that function.When i open the task manager the cpu is not utilized at all as if it is not working in parallel , the same goes for memory , but the every node executes the function in parallel and also the data is large and it takes time to complete ,its not like it does not need the resources.Every help is truly appreciated , thank you.
For loading the graph i use val graph = GraphLoader.edgeListFile(sc, path).cache
object ClosenessCentrality {
case class Vertex(id: VertexId)
def run(graph: Graph[Int, Float],sc: SparkContext): Unit = {
//Have to reverse edges and make graph undirected because is bipartite
val neighbors = CollectNeighbors.collectWeightedNeighbors(graph).collectAsMap()
val bNeighbors = sc.broadcast(neighbors)
val result = graph.vertices.map(f => shortestPaths(f._1,bNeighbors.value))
//result.coalesce(1)
result.count()
}
def shortestPaths(source: VertexId, neighbors: Map[VertexId, Map[VertexId, Float]]): Double ={
val predecessors = new mutable.HashMap[VertexId, ListBuffer[VertexId]]()
val distances = new mutable.HashMap[VertexId, Double]()
val q = new FibonacciHeap[Vertex]
val nodes = new mutable.HashMap[VertexId, FibonacciHeap.Node[Vertex]]()
distances.put(source, 0)
for (w <- neighbors) {
if (w._1 != source)
distances.put(w._1, Int.MaxValue)
predecessors.put(w._1, ListBuffer[VertexId]())
val node = q.insert(Vertex(w._1), distances(w._1))
nodes.put(w._1, node)
}
while (!q.isEmpty) {
val u = q.minNode
val node = u.data.id
q.removeMin()
//discover paths
//println("Current node is:"+node+" "+neighbors(node).size)
for (w <- neighbors(node).keys) {
//print("Neighbor is"+w)
val alt = distances(node) + neighbors(node)(w)
// if (distances(w) > alt) {
// distances(w) = alt
// q.decreaseKey(nodes(w), alt)
// }
// if (distances(w) == alt)
// predecessors(w).+=(node)
if(alt< distances(w)){
distances(w) = alt
predecessors(w).+=(node)
q.decreaseKey(nodes(w), alt)
}
}//For
}
val sum = distances.values.sum
sum
}
To provide somewhat of an answer to your original question, I suspect that your RDD only has a single partition, thus using a single core to process.
The edgeListFile method has an argument to specify the minimum number of partitions you want.
Also, you can use repartition to get more partitions.
You mentionned coalesce but that only reduces the number of partitions by default, see this question : Spark Coalesce More Partitions

Spark - operate a non serialized method on every element in a JavaPairRDD

Is it possible to go over elements in the JavaPairRDD but to perform some calculation f() on one machine only?
Each element (x,y) in the JavaPairRDD is a record (features and label), and I would like to call some calculation f(x,y) which can't be serialized.
f() is an self-implemented algorithm that updates some weights $w_i$.
The weights $w_i$ should be the same object for data from all partitions (rater broke up into tasks - each of which is operated on by an executor).
Update:
What I'm trying to do is to run "algo" f() 100 times in such a way that every iteration of algo is single-threaded but the whole 100 iterations run in parallel on different nodes. In high level, the code is something like that:
JavaPairRDD<String, String> dataPairs = ... // Load the data
boolean withReplacement = true;
double testFraction = 0.2;
long seed = 0;
Map classFractions = new HashMap();
classFractions.put("1", 1 - testFraction);
classFractions.put("0", 1 - testFraction);
dataPairs.cache();
for (1:100)
{
PredictionAlgorithm algo = new Algo();
JavaPairRDD<String, String> trainStratifiedData = dataPairs.sampleByKeyExact(withReplacement, classFractions, seed);
algo.fit(trainStratifiedData);
}
I would like that algo.fit() will update the same weights $w_i$ instead of fitting weights for every partition.

Spark - does sampleByKeyExact duplicate the data?

When performing sampleByKeyExact on a JavaPairRDD, does Spark save an actual copy of the data or pointers to the JavaPairRDD?
Meaning, if I perform 100 bootstrap sampling of the original dataset - does it keep 100 copies of the original RDD or keep 100 different indices with pointers?
UPDATE:
JavaPairRDD<String, String> dataPairs = ... // Load the data
boolean withReplacement = true;
double testFraction = 0.2;
long seed = 0;
Map classFractions = new HashMap();
classFractions.put("1", 1 - testFraction);
classFractions.put("0", 1 - testFraction);
dataPairs.cache();
for (1:100)
{
PredictionAlgorithm algo = new Algo();
JavaPairRDD<String, String> trainStratifiedData = dataPairs.sampleByKeyExact(withReplacement, classFractions, seed);
algo.fit(trainStratifiedData);
}

How to ensure garbage collection of unused accumulators?

I meet a problem that Accumulator on Spark can not be GC.
def newIteration (lastParams: Accumulable[Params, (Int, Int, Int)], lastChosens: RDD[Document], i: Int): Params = {
if (i == maxIteration)
return lastParams.value
val size1: Int = 100
val size2: Int = 1000
// each iteration generates a new accumulator
val params = sc.accumulable(Params(size1, size2))
// there is map operation here
// if i only use lastParams, the result in not updated
// but params can solve this problem
val chosen = data.map {
case(Document(docID, content)) => {
lastParams += (docID, content, -1)
val newContent = lastParams.localValue.update(docID, content)
lastParams += (docID, newContent, 1)
params += (docID, newContent, 1)
Document(docID, newContent)
}
}.cache()
chosen.count()
lastChosens.unpersist()
return newIteration(params, chosen, i + 1)
}
The problem is that the memory it allocates is always growing, until memory limits. It seems that lastParms is not GC. Class RDD and Broadcast have a method unpersist(), but I cannot find any method like this in documentation.
Why Accumulable cannot be GC automatically, or is there a better solution?
UPDATE (April 22nd, 2016): SPARK-3885 Provide mechanism to remove accumulators once they are no longer used is now resolved.
There's ongoing work to add support for automatically garbage-collecting accumulators once they are no longer referenced. See SPARK-3885 for tracking progress on this feature. Spark PR #4021, currently under review, is a patch for this feature. I expect this to be included in Spark 1.3.0.

apache spark: Need inputs to optimize my code

I am building a application which reads 1.5 G data and translates. My code skeleton is as follows.
//here i pass a list of id’s to read 4000 files and form a union RDD of all the records and return it as unionbioSetId
run(){
JavaRDD<String> unionbioSetId = readDirectory(ctx, groupAID, groupBID);
JavaRDD<String> temp= unionbioSetId.coalesce(6, false);
JavaPairRDD<String, Tuple3<Double, Double, Double>> flatRDD = temp.flatMapToPair(
new PairFlatMapFunction<String, String, String>() {
return Arrays.asList(new Tuple2<String, String>(key,value));
}}).groupByKey().mapToPair(new PairFunction<Tuple2<String, Iterable<String>>, // input
String, // K
Tuple3<Double, Double, Double> // V
>() {
public Tuple2<String, Tuple3<Double, Double, Double>> call(
Tuple2<String, Iterable<String>> value) {
}).filter(new Function<Tuple2<String, Tuple3<Double, Double, Double>>, Boolean>() {
}});// group by key and map to pair,sort by key
}
String hadoopOutputPathAsString = directory;
flatRDD.saveAsTextFile(hadoopOutputPathAsString);
}
}
///////////////
num of executors:9
driver memory:2g
executor memory: 6g
execuotr cores: 12
My program is running slower than map/reduce(same code skeleton). Can any one help me in optimizing the above code skeleton to make it faster.
Don't call coalesce. You don't need less partitions, you need more. You have 108 worker cores, but you only use 6 of them if you go with 6 partitions. A rule of thumb is you want at least 3 * num_executors * cores_per_executor = 324 partitions.
JavaRDD<String> temp = unionbioSetId.repartition(350);
Or just don't change the number of partitions at all. When the files are read, the data is partitioned by Hadoop splits. In many cases this gives a good layout, and you would avoid the cost of repartitioning.
Read the files at once instead of reading them separately and then taking their union: sc.textFile("file1,file2,file3,...") or sc.textFile("dir/*"). This may also make a performance difference.

Resources