Used CountVectorizerModel in spark ml and got the td-idf of the data.
Output column of a df looks like:
(63709,[0,1,2,3,6,7,8,10,11,13],[0.6095235999680518,0.9946971867717818,0.5151611294911758,0.4371112749198506,3.4968901993588046,0.06806241719930584,1.1156025996012633,3.0425756717399217,0.3760235829400124])
Wanted to get top n words which are mapped with this ranking.
Related
I have a smallish (a couple of thousand) list/array of pairs of doubles and a very large (> 100 million rows) spark dataframe. In the large dataframe I have a column containing an integer which i want to use to index into the smaller list. I want to return a dataframe with all the original columns and the related two values from the list.
I could obviously create a dataframe from the list and do an inner join but that seems inefficient as the optimiser doesn't know it only needs to get the single pair from the small list and that it can index directly into the list using the integer column from the large dataframe.
What's the most efficient way of doing this? Happy for answers using any api - scala, pyspark, sql, dataframe or rdd.
I m using this dataset of crop agriculture. In order to use it for creating a neural network, I preprocessed the data using MinMaxScalar, this would scale the data between 0 and 1. But my dataset also consist of categorical columns, because of which I got an error during preprocessing. So I tried encoding the categorical columns using OneHotEncoder and LabelEncoder but I don't understand what to do with it then.
My aim is to predict "Crop_Damage".
How do I proceed ?
Link to the dataset -
https://www.kaggle.com/aniketng21600/crop-damage-information-in-india
You have several options.
You may use one hot encoding and pass your categorical variable to network as one-hot network.
You may get inspiration from NLP and their processing. One hot vectors are sparse and may be really huge(depends on unique values of your categorical variable). Please look at techniques Word2vec(cat2vec) or GloVe. Both of them aims to create from categorical element, nonsparse numeric vector(meaningful).
Beside of these two keras offer another way how to obtain this numeric vector. Its called embeded layer. For example, lets consider that you have variable Crop damage with these values:
Huge
Medium
Little
First you assign unique integer for every unique value of your categorical variable.
Huge = 0
Medium = 1
Little= 2
Than you pass translated categorical values(unique integers) to emebeded layer. Embeded layer takes at input sequence of unique integers and produce sequence of dense vectors. Values of these vectors are firstly random, but during training are optimized like regular weights of neural network. So we can say that during the training neural network build vector representation of categories according to loss function.
For me is embeded layer the easiest way to obtain good enough vector representation of categorical variables. But you can try first with one hot if accuracy satisfy you.
here is a one hot encoder. df is the data frame you are working with, column is the name
of the column you want to encode. prefix is a string that gets appended to the column names created by pandas dummies. What happens is the new dummy columns are created and
appended to the data frame as new columns. The original column is then deleted.
There is an excellent series of videos on encoding data frames and other topics on Youtube here.
def onehot_encode(df, column, prefix):
df = df.copy()
dummies = pd.get_dummies(df[column], prefix=prefix)
df = pd.concat([df, dummies], axis=1)
df = df.drop(column, axis=1)
return df
In python or R, there are ways to slice DataFrame using index.
For example, in pandas:
df.iloc[5:10,:]
Is there a similar way in pyspark to slice data based on location of rows?
Short Answer
If you already have an index column (suppose it was called 'id') you can filter using pyspark.sql.Column.between:
from pyspark.sql.functions import col
df.where(col("id").between(5, 10))
If you don't already have an index column, you can add one yourself and then use the code above. You should have some ordering built in to your data based on some other columns (orderBy("someColumn")).
Full Explanation
No it is not easily possible to slice a Spark DataFrame by index, unless the index is already present as a column.
Spark DataFrames are inherently unordered and do not support random access. (There is no concept of a built-in index as there is in pandas). Each row is treated as an independent collection of structured data, and that is what allows for distributed parallel processing. Thus, any executor can take any chunk of the data and process it without regard for the order of the rows.
Now obviously it is possible to perform operations that do involve ordering (lead, lag, etc), but these will be slower because it requires spark to shuffle data between the executors. (The shuffling of data is typically one of the slowest components of a spark job.)
Related/Futher Reading
PySpark DataFrames - way to enumerate without converting to Pandas?
PySpark - get row number for each row in a group
how to add Row id in pySpark dataframes
You can convert your spark dataframe to koalas dataframe.
Koalas is a dataframe by Databricks to give an almost pandas like interface to spark dataframe. See here https://pypi.org/project/koalas/
import databricks.koalas as ks
kdf = ks.DataFrame(your_spark_df)
kdf[0:500] # your indexes here
I have a DataFrame df with a column column and I would like to convert column into a vector (e.g. a DenseVector) so that I can use it in vector and matrix products.
Beware: I don't need a column of vectors; I need a vector object.
How to do this?
I found out the vectorAssembler function (link) but this doesn't help me, as it converts some DataFrame columns into a vector columns, which is still a DataFrame column; my desired output should instead be a vector.
About the goal of this question: why am I trying to convert a DF column into a vector? Assume I have a DF with a numerical column and I need to compute a product between a matrix and this column. How can I achieve this? (The same could hold for a DF numerical row.) Any alternative approach is welcome.
How:
DenseVector(df.select("column_name").rdd.map(lambda x: x[0]).collect())
but it doesn't make sense in any practical scenario.
Spark Vectors are not distributed, therefore are applicable only if data fits in memory of one (driver) node. If this is the case you wouldn't use Spark DataFrame for processing.
I am running a logistic regression on data frame, and as logistic regression function in spark does not take in categorical vriable I am transforming it.
I am using string indexer transformer.
indexer=StringIndexer(inputCol="classname",outputCol="ClassCategory")
I want to append this transform column back to dataframe.
df.withColumn does not let me do that because object indexer is not a column.
Is there a way to transform and append.
As can be seen in the examples of the Spark ML Documentation, you can try the following:
// Original data is in "df"
indexer = StringIndexer(inputCol="classname",outputCol="ClassCategory")
indexed = indexer.fit(df).transform(df)
indexed.show()
The indexed object will be a dataframe with a new column called "ClassCategory" (the name passed as outputCol).