Spark: convert DataFrame column into vector - apache-spark

I have a DataFrame df with a column column and I would like to convert column into a vector (e.g. a DenseVector) so that I can use it in vector and matrix products.
Beware: I don't need a column of vectors; I need a vector object.
How to do this?
I found out the vectorAssembler function (link) but this doesn't help me, as it converts some DataFrame columns into a vector columns, which is still a DataFrame column; my desired output should instead be a vector.
About the goal of this question: why am I trying to convert a DF column into a vector? Assume I have a DF with a numerical column and I need to compute a product between a matrix and this column. How can I achieve this? (The same could hold for a DF numerical row.) Any alternative approach is welcome.

How:
DenseVector(df.select("column_name").rdd.map(lambda x: x[0]).collect())
but it doesn't make sense in any practical scenario.
Spark Vectors are not distributed, therefore are applicable only if data fits in memory of one (driver) node. If this is the case you wouldn't use Spark DataFrame for processing.

Related

How to efficiently perform a lookup from a column in a large spark dataframe into a small (broadcastable) array

I have a smallish (a couple of thousand) list/array of pairs of doubles and a very large (> 100 million rows) spark dataframe. In the large dataframe I have a column containing an integer which i want to use to index into the smaller list. I want to return a dataframe with all the original columns and the related two values from the list.
I could obviously create a dataframe from the list and do an inner join but that seems inefficient as the optimiser doesn't know it only needs to get the single pair from the small list and that it can index directly into the list using the integer column from the large dataframe.
What's the most efficient way of doing this? Happy for answers using any api - scala, pyspark, sql, dataframe or rdd.

Is there a way to slice dataframe based on index in pyspark?

In python or R, there are ways to slice DataFrame using index.
For example, in pandas:
df.iloc[5:10,:]
Is there a similar way in pyspark to slice data based on location of rows?
Short Answer
If you already have an index column (suppose it was called 'id') you can filter using pyspark.sql.Column.between:
from pyspark.sql.functions import col
df.where(col("id").between(5, 10))
If you don't already have an index column, you can add one yourself and then use the code above. You should have some ordering built in to your data based on some other columns (orderBy("someColumn")).
Full Explanation
No it is not easily possible to slice a Spark DataFrame by index, unless the index is already present as a column.
Spark DataFrames are inherently unordered and do not support random access. (There is no concept of a built-in index as there is in pandas). Each row is treated as an independent collection of structured data, and that is what allows for distributed parallel processing. Thus, any executor can take any chunk of the data and process it without regard for the order of the rows.
Now obviously it is possible to perform operations that do involve ordering (lead, lag, etc), but these will be slower because it requires spark to shuffle data between the executors. (The shuffling of data is typically one of the slowest components of a spark job.)
Related/Futher Reading
PySpark DataFrames - way to enumerate without converting to Pandas?
PySpark - get row number for each row in a group
how to add Row id in pySpark dataframes
You can convert your spark dataframe to koalas dataframe.
Koalas is a dataframe by Databricks to give an almost pandas like interface to spark dataframe. See here https://pypi.org/project/koalas/
import databricks.koalas as ks
kdf = ks.DataFrame(your_spark_df)
kdf[0:500] # your indexes here

Does it make sense to do vecterindex after stringindex on categorical features?

Say I have bunch of categorical string columns in my dataframe. Then I do below transform:
StringIndex the columns
then I use VectorAssembler to assemble all the transformed columns into one vector feature column
do VectorIndexer on the new vector feature column.
Question: for step 3, does it make sense, or is it duplicated effort? I think step 1 already did the index.
Yes it makes sense if you're going to use Spark tree based algorithm (RandomForestClassifier or GBMClassifier) and you have high cardinality features.
E.g. for criteo dataset StringIndexer would convert values in categorical column to integers in range 1 to 65000. It will save this in metadata as a NominalAttribute. Then in RFClassifier it would extract this from metadata as categorical features.
For tree based algorithms you have to specify maxBins parameter that
Must be >= 2 and >= number of categories in any categorical feature.
Too high maxBins parameter would lead to slow performance. To solve this need to use VectorIndexer with .setMaxCategories(64) for example. This will treat as categorical variables only those that has <64 unique values.

Using apply on pandas dataframe with strings without looping over series

I have a pandas DataFrame filled with strings. I would like to apply a string operation to all entries, for example capitalize(). I know that for a series we can use series.str.capitlize(). I also know that I can loop over the column of the Dataframe and do this for each of the columns. But I want something more efficient and elegant, without looping. Thanks
use stack + unstack
stack makes a dataframe with a single level column index into a series. You can then perform your str.capitalize() and unstack to get back your original form.
df.stack().str.capitalize().unstack()

A more efficient way of getting the nlargest values of a Pyspark Dataframe

I am trying to get the top 5 values of a column of my dataframe.
A sample of the dataframe is given below. In fact the original dataframe has thousands of rows.
Row(item_id=u'2712821', similarity=5.0)
Row(item_id=u'1728166', similarity=6.0)
Row(item_id=u'1054467', similarity=9.0)
Row(item_id=u'2788825', similarity=5.0)
Row(item_id=u'1128169', similarity=1.0)
Row(item_id=u'1053461', similarity=3.0)
The solution I came up with is to sort all of the dataframe and then to take the first 5 values. (the code below does that)
items_of_common_users.sort(items_of_common_users.similarity.desc()).take(5)
I am wondering if there is a faster way of achieving this.
Thanks
You can use RDD.top method with key:
from operator import attrgetter
df.rdd.top(5, attrgetter("similarity"))
There is a significant overhead of DataFrame to RDD conversion but it should be worth it.

Resources