Apache spark support sparse data.
For example, we can use MLUtils.loadLibSVMFile(...) to load data into an RDD.
I was wondering how does spark deal with those missing values.
Spark creates an RDD of Labeled points, and each labeled point has a label and a vector of features. Note that this is a Spark Vector which does support sparse elements (currently Sparse vectors are represented by an array of non-indices and a second array of doubles for each of the non-null value).
Related
I want to understand the behavior of DF.intersect().
so the question came to mind, especially when we have complex Rows having complex fields. (deep tree)
If we are talking about dataframe intersect transformation, then, according to the Dataset documentation and source, the comparison is done directly on the encoded content. Which is as deep as it can possibly go.
def intersect(other: Dataset[T]): Dataset[T] Returns a new Dataset
containing rows only in both this Dataset and another Dataset. This is
equivalent to INTERSECT in SQL.
Since
1.6.0
Note: Equality checking is performed directly on the encoded
representation of the data and thus is not affected by a custom equals
function defined on T.
This is a follow-up to my previous question.
Row is an ordered set of key value pairs. DataFrame is a collection of Rows.
What a data structure is DataFrame actually ? Is it a list, set, or other "collection" ? Is it a relation as in SQL ?
It's an abstraction over a RDD[Row], or Dataset[Row] in Spark2, with a defined schema set through a series Column classes
Is it a list, set, or other "collection" ?
Not in Java terms of those words. Similar to how RDD is none of those, but rather a "lazy collection"
Is it a relation as in SQL ?
You're welcome to run SparkSQL over a Dataframe, but it's a table. Relations are optional
Although Dataframe is an abstraction over RDD, the internal representation of Dataframe is quite different than RDD.
RDD is represented as a JAVA objects and uses JVM for all operations. However Dataframe is represented in tungsten.
Here is an excellent article which elaborate how dataframes are represented in tungsten.
Is it possible to create a distributed blockmatrix containing single precision entries in spark?
From what i gather from the documentation, the scala/java implementation of blockmatrix requires a mllib.Matrix object, which holds the values as doubles.
Is there any way around this limitation?
Background:
Im using GPU's to accelerate Sparks distributed matrix multiplication routines, and my GPU performs 20 times slower when multiplying double precision matrices rather than single precision matrices.
Is there any function or method that calculates dissimilarity matrix for a given data set? I've found All-pairs similarity via DIMSUM but it looks like it works for sparse data only. Mine is really dense.
Even though the original DIMSUM paper is talking about a matrix which:
each dimension is sparse with at most L nonzeros per row
And which values are:
the entries of A have been scaled to be in [−1, 1]
This is not a requirement and you can run it on a dense matrix. Actually if you check the sample code by the DIMSUM author from the databricks blog you'll notice that the RowMatrix is in fact created from an RDD of dense vectors:
// Load and parse the data file.
val rows = sc.textFile(filename).map { line =>
val values = line.split(' ').map(_.toDouble)
Vectors.dense(values)
}
val mat = new RowMatrix(rows)
Similarly the comment in CosineSimilarity Spark example gives as input a dense matrix which is not scaled.
You need to be aware that the only available method is the columnSimilarities(), which calculates similarities between columns. Hence if your input data file is structured in a way record = row, then you will have to do a matrix transpose first and then run the similarity. To answer your question, no there is no transpose on RowMatrix, other types of matrices in MLlib do have that feature so you'd have to do some transformations first.
Row similarity is in the works and did not make it into the newest Spark 1.5 unfortunately.
As for other options, you would have to implement them yourself. The naive brute force solution which requires O(mL^2) shuffles is very easy to implement (cartesian + your similiarity measure of choice) but performs very badly (speaking from experience).
You can also have a look at a different algorithm from the same person called DISCO but it's not implemented in Spark (and the paper also assumes L-sparsity).
Finally be advised that both DIMSUM and DISCO are estimates (although extremely good ones).
In Spark, it is possible to compose multiple RDD into one, using zip, union, join, etc...
Is it possible to decompose RDD efficiently? Namely, without performing multiple passes on the original RDD? What I am looking for is some thing similar to:
val rdd: RDD[T] = ...
val grouped: Map[K, RDD[T]] = rdd.specialGroupBy(...)
One of the strengths of RDDs is that they enable performing iterative computations efficiently. In some (machine learning) use cases I encountered, we need to perform iterative algorithms on each of the groups separately.
The current possibilities I am aware of are:
GroupBy: groupBy returns an RDD[(K, Iterable[T])] which does not give you the RDD benefits on the group itself (the iterable).
Aggregations: Such as reduceByKey, foldByKey, etc. perform only one "iteration" over the data, and do not have the expression power for implementing iterative algorithms.
Creating separate RDD using the filter method and multiple passes on the data (where the number of passes is equal to the number of keys), which is not feasible when the number of keys is not very small.
Some of the use cases I am considering are, given a very large (tabular) dataset:
We wish to execute some iterative algorithm on each of the different columns separately. For example, some automated feature extraction, A natural way to do so, would have been to decompose the dataset such that each of the columns will be represented by a separate RDD.
We wish to decompose the dataset into disjoint datasets (for example a dataset per day) and execute some machine learning modeling on each of them.
I think the best option is to write out the data in a single pass to one file per key (see Write to multiple outputs by key Spark - one Spark job) then load the per-key files into one RDD each.