Are GraphFrames compatible with typed Dataset? - apache-spark

We currently use typed Dataset in our work. And we are currently exploring using Graphframes.
However, Graphframes seem to be based on Dataframe which is Dataset[Row]. Would Graphframes be compatible with typed Dataset. e.g. Dataset[Person]

GrahpFrames support only DataFrames. To use statically Dataset you have convert it to DataFrame, apply graph operations, and convert back to statically structure.
You can follow this issue: https://github.com/graphframes/graphframes/issues/133

Related

Spark Datasets available in Python?

Here, it is stated:
..you can create Datasets within a Scala or Python..
while here, the following is stated:
Python does not have the support for the Dataset API
Are datasets available in python?
Perhaps the question is about Typed Spark Datasets.
If so, then the answer is no.
Mentioned spark datasets are only available in Scala and Java.
In Python implementation of Spark (or PySpark) you have to choose between DataFrames as the preferred choice and RDD.
Reference:
RDD vs. DataFrame vs. Dataset
Update 2022-09-26: Clarification regarding typed spark datasets

Is it possible to load data on Spark workers directly into Apache Arrow in-memory format without first loading it into Spark's in-memory format?

We have a use case for doing a large number of vector multiplications and summing the results such that the input data typically will not fit into the RAM of a single host, even if using 0.5 TB RAM EC2 instances (fitting OLS regression models). Therefore we would like to:
Leverage PySpark for Spark's traditional capabilities (distributing the data, handling worker failures transparently, etc.)
But also leverage C/C++-based numerical computing for doing the actual math on the workers
The leading path seems to be to leverage Apache Arrow with PySpark, and use Pandas functions backed by NumPy (in turn written in C) for the vector products. However, I would like to load the data directly to Arrow format on Spark workers. The existing PySpark/Pandas/Arrow documentation seems to imply that the data is in fact loaded into Spark's internal representation first, then converted into Arrow when Pandas UDFs are called: https://spark.apache.org/docs/3.0.1/sql-pyspark-pandas-with-arrow.html#setting-arrow-batch-size
I found one related paper to this in which the authors developed a zero-copy Arrow-based interface for Spark, so I take it that this is a highly custom thing to do not currently supported in Spark: https://users.soe.ucsc.edu/~carlosm/dev/publication/rodriguez-arxiv-21/rodriguez-arxiv-21.pdf
I would like to ask if anyone knows of a simple way other than what is described in this paper. Thank you!

How to convert csv to RDD and use RDD in pyspark for some detection?

I'm currently working on a research of heart disease detection and want to use spark to process big data as it is a part of a solution of my work. But i'm having difficulty in using spark with python because i cannot grasp how to use spark. Converting csv file to RDD and then i don't understand how to work with RDD to implement classification algorithms like knn, logistic Regression etc.
So i would really appreciate it if anyone can help me in anyway.
I have tried to understand pyspark on internet but there are very few codes available and some which are available are too easy or too hard to understand. I cannot find any proper example of classification on pyspark.
To read the csv into a dataframe you can just call spark.read.option('header', 'true').csv('path/to/csv').
The dataframe will contain the columns and rows of your csv, and you can convert it into a RDD of rows with df.rdd.

Can Spark and the ScalaNLP library Breeze be used together?

I'm developing a Scala-based extreme learning machine, in Apache Spark. My model has to be a Spark Estimator and use the Spark framework in order to fit into the machine learning pipeline. Does anyone know if Breeze can be used in tandem with Spark? All of my data is in Spark data frames and conceivably I could import it using Breeze, use Breeze DenseVectors as the data structure then convert to a DataFrame for the Estimator part. The advantage of Breeze is that it has a function pinv for the Moore-Penrose pseudo-inverse, which is an inverse for a non-square matrix. There is no equivalent function in the Spark MLlib, as far as I can see. I have no idea whether it's possible to convert Breeze tensors to Spark DataFrames so if anyone has experience of this it would be really useful. Thanks!
Breeze can be used with Spark. In fact is used internally for many MLLib functions, but required conversions are not exposed as public. You can add your own conversions and use Breeze to process individual records.
For example for Vectors you can find conversion code:
SparseVector.asBreeze
DenseVector.asBreeze
Vector.fromBreeze
For Matrices please see asBreeze / fromBreeze in Matrices.scala
It cannot however, be used on distributed data structures. Breeze objects use low level libraries, which cannot be used for distributed processing. Therefore DataFrame - Breeze objects conversions are possible only if you collect data to the driver and are limited to the scenarios where data can be stored in the driver memory.
There exist other libraries, like SysteML, which integrate with Spark and provide more comprehensive linear algebra routines on distributed objects.

Does spark support matrices?

Most algorithms that use matrix operations in spark have to use either Vectors or store their data in a different way. Is there support for building matrices directly in spark?
Apache recently released Spark-1.0. It has support for creating Matrices in Spark, which is a really appealing idea. Although right now it is in experimental phase and has support for limited operations that can be performed over the Matrix you create but this is sure to grow in future releases. The idea of Matrix operations being performed with the speed of Spark is amazing.
The way I use matrices in Spark is through python and with numpy scipy. Pull the data into the matrices from a csv file and use as needed. I treated the matrices the same as I would in normal python scipy. It is how you parallelize the data that makes it slightly different.
Something like this:
for i in range(na+2):
data.append(LabeledPoint(b[i], A[i,:]))
model = WhatYouDo.train(sc.parallelize(data), iterations=40, step=0.01,initialWeights=wa)
The pain was getting numpy scipy into spark. Found the best way to make sure all the other libraries and files need were included was to use:
sudo yum install numpy scipy python-matplotlib ipython python-pandas sympy python-nose

Resources