I have a requirement wherein I want to union/concat multiple dataframes. overall we have around 14000 such dataframes/sql which we generate at run time and then union all before writing to hive. I tried two ways but both are very slow. Is there any way to optimize below or run them in parallel.
Note I need the solution only in spark java.
Psuedo code
1st way:
Dataset dfunion = null;
for (int i = 0; i <= 14000; i++) {
String somesql = "select columns from table where conditions(depending on each loop)"
if (i == 1)
dfunion = spark.sql(somesql);
else{
dfunion = dfunion.union(spark.sql(somesql));
}
}
dfunion.writetohive
2nd way:
for (int i = 0; i <= 14000; i++) {
String somesql = "select columns from table where conditions(depending on each loop)"
if (i == 1)
spark.sql(somesql).write.mode(overwrite).parquet;
else {
spark.sql(somesql).write.mode(append).parquet;
}
}
Dataset dfread = spark.read.parquet().writetohive;
Any help would be appreciated.
Here is my scenario :
I have a list of RDD called fileNamesList .
List<JavaRDD<Tuple2<String, String>>> fileNamesList = new ArrayList<JavaRDD<Tuple2<String, String>>>();
fileNamesList.add(newRDD); //adding RDD's to list
I am adding multiple newRDD's into the list, this is inside a loop.So the list grows up to 10 maximum .
I want to combine(union) all the RDD's inside the list fileNamesList . Is it possible to do like below ;
JavaPairRDD<String, String> finalFileNames ;
for (int j = 0; j < IdList.size()-1; j++) {
finalFileNames = JavaPairRDD.fromJavaRDD(fileNamesList.get(j)).
union(JavaPairRDD.fromJavaRDD(fileNamesList.get(j + 1)));
}
Or what is the other option that I could use .
Use SparkContext.union or JavaSparkContext.union. It can union many RDD, in result you have much simpler DAG. See RDD.union vs SparkContex.union
I have tried pairing the samples but it costs huge amount of memory as 100 samples leads to 9900 samples which is more costly. What could be the more effective way of computing distance matrix in distributed environment in spark
Here is a snippet of pseudo code what i'm trying
val input = (sc.textFile("AirPassengers.csv",(numPartitions/2)))
val i = input.map(s => (Vectors.dense(s.split(',').map(_.toDouble))))
val indexed = i.zipWithIndex() //Including the index of each sample
val indexedData = indexed.map{case (k,v) => (v,k)}
val pairedSamples = indexedData.cartesian(indexedData)
val filteredSamples = pairedSamples.filter{ case (x,y) =>
(x._1.toInt > y._1.toInt) //to consider only the upper or lower trainagle
}
filteredSamples.cache
filteredSamples.count
Above code creates the pairs but even if my dataset contains 100 samples, by pairing filteredSamples (above) results in 4950 sample which could be very costly for big data
I recently answered a similar question.
Basically, it will arrive to computing n(n-1)/2 pairs, which would be 4950 computations in your example. However, what makes this approach different is that I use joins instead of cartesian. With your code, the solution would look like this:
val input = (sc.textFile("AirPassengers.csv",(numPartitions/2)))
val i = input.map(s => (Vectors.dense(s.split(',').map(_.toDouble))))
val indexed = i.zipWithIndex()
// including the index of each sample
val indexedData = indexed.map { case (k,v) => (v,k) }
// prepare indices
val count = i.count
val indices = sc.parallelize(for(i <- 0L until count; j <- 0L until count; if i > j) yield (i, j))
val joined1 = indices.join(indexedData).map { case (i, (j, v)) => (j, (i,v)) }
val joined2 = joined1.join(indexedData).map { case (j, ((i,v1),v2)) => ((i,j),(v1,v2)) }
// after that, you can then compute the distance using your distFunc
val distRDD = joined2.mapValues{ case (v1, v2) => distFunc(v1, v2) }
Try this method and compare it with the one you already posted. Hopefully, this can speedup your code a bit.
As far as I can see from checking various sources and the Spark mllib clustering site, Spark doesn't currently support the distance or pdist matrices.
In my opinion, 100 samples will always output at least 4950 values; so manually creating a distributed matrix solver using a transformation (like .map) would be the best solution.
This can serve as the java version of jtitusj's answer..
public JavaPairRDD<Tuple2<Long, Long>, Double> getDistanceMatrix(Dataset<Row> ds, String vectorCol) {
JavaRDD<Vector> rdd = ds.toJavaRDD().map(new Function<Row, Vector>() {
private static final long serialVersionUID = 1L;
public Vector call(Row row) throws Exception {
return row.getAs(vectorCol);
}
});
List<Vector> vectors = rdd.collect();
long count = ds.count();
List<Tuple2<Tuple2<Long, Long>, Double>> distanceList = new ArrayList<Tuple2<Tuple2<Long, Long>, Double>>();
for(long i=0; i < count; i++) {
for(long j=0; j < count && i > j; j++) {
Tuple2<Long, Long> indexPair = new Tuple2<Long, Long>(i, j);
double d = DistanceMeasure.getDistance(vectors.get((int)i), vectors.get((int)j));
distanceList.add(new Tuple2<Tuple2<Long, Long>, Double>(indexPair, d));
}
}
return distanceList;
}
When performing sampleByKeyExact on a JavaPairRDD, does Spark save an actual copy of the data or pointers to the JavaPairRDD?
Meaning, if I perform 100 bootstrap sampling of the original dataset - does it keep 100 copies of the original RDD or keep 100 different indices with pointers?
UPDATE:
JavaPairRDD<String, String> dataPairs = ... // Load the data
boolean withReplacement = true;
double testFraction = 0.2;
long seed = 0;
Map classFractions = new HashMap();
classFractions.put("1", 1 - testFraction);
classFractions.put("0", 1 - testFraction);
dataPairs.cache();
for (1:100)
{
PredictionAlgorithm algo = new Algo();
JavaPairRDD<String, String> trainStratifiedData = dataPairs.sampleByKeyExact(withReplacement, classFractions, seed);
algo.fit(trainStratifiedData);
}
I have a pretty simple question which I can't find an answer to on the Internet or on stackoverflow:
Is the number of Parameters in the IN-Operator in Cassandra limited?
I have made some tests with a simple table with Integer-Keys from 1 to 100000. If I put the keys from 0 to 1000 in my IN-Operator (like SELECT * FROM test.numbers WHERE id IN (0,..,1000)) I get the correct number of rows back. But for example for 0 to 100000 I always get only 34464 rows back. And for 0 to 75000 its 9464.
I am using the Datastax Java Driver 2.0 and the relevant codeparts look like the following:
String query = "SELECT * FROM test.numbers WHERE id IN ?;";
PreparedStatement ps = iot.getSession().prepare(query);
bs = new BoundStatement(ps);
List<Integer> ints = new ArrayList<Integer>();
for (int i = 0; i < 100000; i++) {
ints.add(i);
}
bs.bind(ints);
ResultSet rs = iot.getSession().execute(bs);
int rowCount = 0;
for (Row row : rs) {
rowCount++;
}
System.out.println("counted rows: " + rowCount);
It's also possible that I'm binding the list of Integers in a wrong way. If that's the case I would appreciate any hints too.
I am using Cassandra 2.0.7 with CQL 3.1.1.
This is not a real-limitation but a PreparedStatement one.
Using a BuiltStatement and QueryBuilder I didn't have any of these problems.
Try it yourself:
List<Integer> l = new ArrayList<>();
for (int i = 0; i < 100000; i++) {
l.add(i);
}
BuiltStatement bs = QueryBuilder.select().column("id").from("test.numbers").where(in("id", l.toArray()));
ResultSet rs = Cassandra.DB.getSession().execute(bs);
System.out.println("counted rows: " + rs.all().size());
HTH,
Carlo