Combining RDD's in a loop - apache-spark

Here is my scenario :
I have a list of RDD called fileNamesList .
List<JavaRDD<Tuple2<String, String>>> fileNamesList = new ArrayList<JavaRDD<Tuple2<String, String>>>();
fileNamesList.add(newRDD); //adding RDD's to list
I am adding multiple newRDD's into the list, this is inside a loop.So the list grows up to 10 maximum .
I want to combine(union) all the RDD's inside the list fileNamesList . Is it possible to do like below ;
JavaPairRDD<String, String> finalFileNames ;
for (int j = 0; j < IdList.size()-1; j++) {
finalFileNames = JavaPairRDD.fromJavaRDD(fileNamesList.get(j)).
union(JavaPairRDD.fromJavaRDD(fileNamesList.get(j + 1)));
}
Or what is the other option that I could use .

Use SparkContext.union or JavaSparkContext.union. It can union many RDD, in result you have much simpler DAG. See RDD.union vs SparkContex.union

Related

Looking for an Excel COUNTA equivalent for a DataTable VB.Net

I'm trying to find an equivalent to the Excel CountA function for a DataTable.
'This code works for searching through a range of columns in Excel
If xlApp.WorksheetFunction.CountA(WS.Range("A" & i & ":G" & i)) > 0 Then
DataExists = True
End If
'This is the code I need help with for searching though a DataTable
If DataTbl.Rows(i).Item(0:6).ToString <> "" Then
DataExists = True
End If
Hoping someone can help with this.
I think you simply need a for-each loop.
internal static int CountForEach(this DataTable? dt)
{
if (dt == null)
return 0;
int count = 0;
foreach (DataRow row in dt.Rows)
foreach (object? o in row.ItemArray)
if (o != DBNull.Value)
count++;
return count;
}
Usage:
DataTable dt = GetYourDataTable();
int countValues = dt.CountNotNullValues_ForEach();
This is also doable with LINQ but I think it would be slower -- I'll run some benchmarks later and update my answer.
EDIT
I added these two LINQ methods:
internal static int CountLinqList(this DataTable? dt)
{
int count = 0;
dt?.Rows.Cast<DataRow>().ToList().ForEach(row => count += row.ItemArray.Where(g => g != DBNull.Value).Count());
return count;
}
internal static int CountLinqParallel(this DataTable? dt)
{
ConcurrentBag<int> ints = new();
dt?.AsEnumerable().AsParallel().ForAll(row => ints.Add(row.ItemArray.Where(g => g != DBNull.Value).Count()));
int count = ints.Sum();
return count;
}
These are the statistics obtained with BenchmarkDotNet:
I used a pseudo-randomly generated datatable of around 5.5 million rows and three columns as test.
I think these results may change with larger datatables, but for smaller (around 500k rows and less) the fastest method will probably be the simple for-each loop.
Fastest methods:
For each loop
Linq parallel
Linq list > for each
I'm surely not a LINQ-guru but I'd like to be, so if someone has a better LINQ implementation please let me know.
By the way, I don’t think this could be the typical LINQ use case.

how to write >1 file from a partition

First I wanted to spit a partition by a prefixed size, so I can update the file system. For example if a partition data size is 200MB(each row in the partition RDD can be different size), I want to write 4 files in that partition, each file 50MB, while trying to avoid shuffle. Is it possible to do that without a repartition or a coelesce which would cause shuffle? I dont have a fixed row size I cant really use the maxRecordsPerFile spark config.
Next option is to repartition the entire dataset causing shuffle. So to calculate the size I did the following, but it fails with: "Local variable numOfPartitions defined in an enclosing scope must be final or effectively final". what am I doing wrong. How can I fix this code?
...
int numOfPartitions = 1;
JavaRDD<String> tgg = xmlDataSet.toJavaRDD().mapPartitions ( xmlRows -> {
long totalSize = 0;
List<String> strLst = new ArrayList<String>();
while (xmlRows.hasNext()) {
String xmlString = xmlString = blobToString(xmlRows.next());
totalSize = totalSize + xmlString.getBytes().length;
strLst.add(xmlString);
if (totalSize > 10000) {
numOfPartitions++;
}
}
return strLst.iterator();
});
...

Spark Java union/concat Multiple Dataframe/sql in loop

I have a requirement wherein I want to union/concat multiple dataframes. overall we have around 14000 such dataframes/sql which we generate at run time and then union all before writing to hive. I tried two ways but both are very slow. Is there any way to optimize below or run them in parallel.
Note I need the solution only in spark java.
Psuedo code
1st way:
Dataset dfunion = null;
for (int i = 0; i <= 14000; i++) {
String somesql = "select columns from table where conditions(depending on each loop)"
if (i == 1)
dfunion = spark.sql(somesql);
else{
dfunion = dfunion.union(spark.sql(somesql));
}
}
dfunion.writetohive
2nd way:
for (int i = 0; i <= 14000; i++) {
String somesql = "select columns from table where conditions(depending on each loop)"
if (i == 1)
spark.sql(somesql).write.mode(overwrite).parquet;
else {
spark.sql(somesql).write.mode(append).parquet;
}
}
Dataset dfread = spark.read.parquet().writetohive;
Any help would be appreciated.

How to compute the distance matrix in spark?

I have tried pairing the samples but it costs huge amount of memory as 100 samples leads to 9900 samples which is more costly. What could be the more effective way of computing distance matrix in distributed environment in spark
Here is a snippet of pseudo code what i'm trying
val input = (sc.textFile("AirPassengers.csv",(numPartitions/2)))
val i = input.map(s => (Vectors.dense(s.split(',').map(_.toDouble))))
val indexed = i.zipWithIndex() //Including the index of each sample
val indexedData = indexed.map{case (k,v) => (v,k)}
val pairedSamples = indexedData.cartesian(indexedData)
val filteredSamples = pairedSamples.filter{ case (x,y) =>
(x._1.toInt > y._1.toInt) //to consider only the upper or lower trainagle
}
filteredSamples.cache
filteredSamples.count
Above code creates the pairs but even if my dataset contains 100 samples, by pairing filteredSamples (above) results in 4950 sample which could be very costly for big data
I recently answered a similar question.
Basically, it will arrive to computing n(n-1)/2 pairs, which would be 4950 computations in your example. However, what makes this approach different is that I use joins instead of cartesian. With your code, the solution would look like this:
val input = (sc.textFile("AirPassengers.csv",(numPartitions/2)))
val i = input.map(s => (Vectors.dense(s.split(',').map(_.toDouble))))
val indexed = i.zipWithIndex()
// including the index of each sample
val indexedData = indexed.map { case (k,v) => (v,k) }
// prepare indices
val count = i.count
val indices = sc.parallelize(for(i <- 0L until count; j <- 0L until count; if i > j) yield (i, j))
val joined1 = indices.join(indexedData).map { case (i, (j, v)) => (j, (i,v)) }
val joined2 = joined1.join(indexedData).map { case (j, ((i,v1),v2)) => ((i,j),(v1,v2)) }
// after that, you can then compute the distance using your distFunc
val distRDD = joined2.mapValues{ case (v1, v2) => distFunc(v1, v2) }
Try this method and compare it with the one you already posted. Hopefully, this can speedup your code a bit.
As far as I can see from checking various sources and the Spark mllib clustering site, Spark doesn't currently support the distance or pdist matrices.
In my opinion, 100 samples will always output at least 4950 values; so manually creating a distributed matrix solver using a transformation (like .map) would be the best solution.
This can serve as the java version of jtitusj's answer..
public JavaPairRDD<Tuple2<Long, Long>, Double> getDistanceMatrix(Dataset<Row> ds, String vectorCol) {
JavaRDD<Vector> rdd = ds.toJavaRDD().map(new Function<Row, Vector>() {
private static final long serialVersionUID = 1L;
public Vector call(Row row) throws Exception {
return row.getAs(vectorCol);
}
});
List<Vector> vectors = rdd.collect();
long count = ds.count();
List<Tuple2<Tuple2<Long, Long>, Double>> distanceList = new ArrayList<Tuple2<Tuple2<Long, Long>, Double>>();
for(long i=0; i < count; i++) {
for(long j=0; j < count && i > j; j++) {
Tuple2<Long, Long> indexPair = new Tuple2<Long, Long>(i, j);
double d = DistanceMeasure.getDistance(vectors.get((int)i), vectors.get((int)j));
distanceList.add(new Tuple2<Tuple2<Long, Long>, Double>(indexPair, d));
}
}
return distanceList;
}

faster min and max of different array components with CouchDb map/reduce?

I have a CouchDB database with a view whose values are paired numbers of the form [x,y]. For documents with the same key, I need (simultaneously) to compute the minimum of x and the maximum of y. The database I am working with contains about 50000 documents. Building the view takes several hours, which seems somewhat excessive. (The keys are themselves length-three arrays.) I show the map and reduce functions below, but the basic question is: how can I speed up this process?
Note that the builtin functions won't work because the values have to be numbers, not length-two arrays. It is possible that I could make two different views (one for min(x) and one for max(y)), but it is unclear to me how to combine them to get both results simultaneously.
My current map function looks basically like
function(doc) {
emit ([doc.a, doc.b, doc.c], [doc.x, doc.y])
}
and my reduce function looks like
function(keys, values) {
var x = null;
var y = null;
for (i = 0; i < values.length; i++) {
if (values[i][0] == null) break;
if (values[i][1] == null) break;
if (x == null) x = values[i][0];
if (y == null) y = values[i][1];
if (values[i][0] < x) x = values[i][0];
if (values[i][1] > y) y = values[i][1];
}
emit([x, y]);
}
Just two more notes. Using Math.max() and Math.min() should be a little faster.
function(keys, values) {
var x = -Infinity,
y = Infinity;
for (var i = 0, v; v = values[i]; i++) {
x = Math.max(x, v[0]);
y = Math.min(y, v[1]);
}
return [x, y];
}
And if CouchDB is treating the values as strings, it is because you are storing them as strings in the document.
Hope it helps.
This turned out to be a combination of two factors. One is obvious in the code posted above, where uses "emit" when it should use "return".
The other factor is less obvious and was only found by making a smaller version of the database and logging the steps in the reduce function. Although the entries in "values" were meant to be integers, they were being treated by CouchDB as character strings. Using the parseInt function corrected that problem.
After those two fixes, the entire build of the reduced view took about five minutes, so the speed problem evaporated.
Please check http://www.geeksforgeeks.org/archives/4583 . This may be extended to your application.

Resources