How to pass SparseVectors to `mllib` in pyspark - python-3.x

I am using pyspark 1.6.3 through Zeppelin with python 3.5.
I am trying to implement Latent Dirichlet Allocation using the pyspark CountVectorizer and LDA functions. First, the problem: here is the code I am using. Let df be a spark dataframe with tokenized text in a column 'tokenized'
vectors = 'vectors'
cv = CountVectorizer(inputCol = 'tokenized', outputCol = vectors)
model = cv.fit(df)
df = model.transform(df)
corpus = df.select(vectors).rdd.zipWithIndex().map(lambda x: [x[1], x[0]]).cache()
ldaModel = LDA.train(corpus, k=25)
This code is taken more or less from the pyspark api docs.
On the call to LDA I get the following error:
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)
The internet tells me that this is due to a type mismatch.
So lets look at the types for LDA and from CountVectorizer. From spark docs here is another example of a sparse vector going into LDA:
>>> from pyspark.mllib.linalg import Vectors, SparseVector
>>> data = [
... [1, Vectors.dense([0.0, 1.0])],
... [2, SparseVector(2, {0: 1.0})],
... ]
>>> rdd = sc.parallelize(data)
>>> model = LDA.train(rdd, k=2, seed=1)
I implement this myself and this is what rdd looks like:
>> testrdd.take(2)
[[1, DenseVector([0.0, 1.0])], [2, SparseVector(2, {0: 1.0})]]
On the other hand, if I go to my original code and look at corpus the rdd with the output of CountVectorizer, I see (edited to remove extraneous bits):
>> corpus.take(3)
[[0, Row(vectors=SparseVector(130593, {0: 30.0, 1: 13.0, ...
[1, Row(vectors=SparseVector(130593, {0: 52.0, 1: 44.0, ...
[2, Row(vectors=SparseVector(130593, {0: 14.0, 1: 6.0, ...
]
So the example I used (from the docs!) doesn't produce a tuple of (index, SparseVector), but a (index, Row(SparseVector))... or something?
Questions:
Is the Row wrapper around the SparseVector what is causing this error?
If so, how do I get rid of the Row object? Row is a property of a df, but I used df.rdd to convert to an rdd; what else would I need to do?

It maybe the problem. Just extract vectors from the Row object.
corpus = df.select(vectors).rdd.zipWithIndex().map(lambda x: [x[1], x[0]['vectors']]).cache()

Related

appending pytorch object in pandas DataFrame

I want to append tensor objects c, s in empty dataframe df_data1_cluster
df_data1_cluster = pd.DataFrame(columns = ["cluster", "text"])
label, center = detect_clusters(torch.as_tensor(embeddings), 50)
for c, s in zip(label, phrases):
df_data1_cluster.append(c,s)
It is resulting in error.
TypeError: cannot concatenate object of type '<class 'torch.Tensor'>'; only Series and DataFrame objs are valid
Hi this is a pandas problem, the append command requires u append a dataframe, there are ways to insert to a dataframe
import pandas as pd
import torch
df = pd.DataFrame({'numbers': [1, 2, 3], 'colors': ['red', 'white', 'blue']})
df['numbers'].loc[-1] = torch.tensor([2, 3, 4]) # adding a row

How to drop original columns in a spark ML transformer

When I run a spark ml transformer, we provide input and output columns. The transformed data set contains both types of columns, i.e. old columns and transformed columns
e.g.
from pyspark.ml.feature import Imputer
df = spark.createDataFrame([
(1.0, float("nan")),
(2.0, float("nan")),
(float("nan"), 3.0),
(4.0, 4.0),
(5.0, 5.0)
], ["a", "b"])
imputer = Imputer(inputCols=["a", "b"], outputCols=["out_a", "out_b"])
model = imputer.fit(df)
model.transform(df).columns
This will print out
['a','b','out_a','out_b']
Is it possible to ask the transformer to spit out the transformed column only?
I want this to happen inside the transformer, and do not want to remove the columns using the drop method in spark dataframe

FPGrowth: Input data is not cached pyspark

I am trying to run following example code. Even-though I have cached my data, I am getting "Input data is not cached pyspark" warning. Because of this issue, I am not able to use fp growth algorithm for large datasets.
from pyspark.ml.fpm import FPGrowth
from pyspark.sql import SparkSession
"""
An example demonstrating FPGrowth.
Run with:
bin/spark-submit examples/src/main/python/ml/fpgrowth_example.py
"""
if __name__ == "__main__":
spark = SparkSession\
.builder\
.appName("FPGrowthExample")\
.getOrCreate()
# $example on$
df = spark.createDataFrame([
(0, [1, 2, 5]),
(1, [1, 2, 3, 5]),
(2, [1, 2])
], ["id", "items"])
df = df.cache()
fpGrowth = FPGrowth(itemsCol="items", minSupport=0.5, minConfidence=0.6)
model = fpGrowth.fit(df)
# Display frequent itemsets.
model.freqItemsets.show()
# Display generated association rules.
model.associationRules.show()
# transform examines the input items against all the association rules and summarize the
# consequents as prediction
model.transform(df).show()
spark.stop()
Why:
Because ml.fpm.FPGrowth converts data to RDD and runs mllib.fpm.FPGrowth on this RDD. RDD is not cached and this causes the warning in mllib code.
What can you do about it:
In your code nothing. If you think this is a big issue (shouldn't be) open a JIRA ticket and create a pull request.
Because of this issue, I am not able to use fp growth algorithm for large datasets.
It can cause unnecessary allocation and slowdown, but shouldn't be limiting. If you experience failure it is possible that parameters require tuning.

Column features must be of type org.apache.spark.ml.linalg.VectorUDT

I want to run this code in pyspark (spark 2.1.1):
from pyspark.ml.feature import PCA
bankPCA = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
pcaModel = bankPCA.fit(bankDf)
pcaResult = pcaModel.transform(bankDF).select("label", "pcaFeatures")
pcaResult.show(truncate= false)
But I get this error:
requirement failed: Column features must be of type
org.apache.spark.ml.linalg.Vect orUDT#3bfc3ba7 but was actually
org.apache.spark.mllib.linalg.VectorUDT#f71b0bce.
Example that you can find here:
from pyspark.ml.feature import PCA
from pyspark.ml.linalg import Vectors
data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),
(Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
(Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = spark.createDataFrame(data, ["features"])
pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)
... other code ...
As you can see above, df is a dataframe which contains Vectors.sparse() and Vectors.dense() that are imported from pyspark.ml.linalg.
Probably, your bankDf contains Vectors imported from pyspark.mllib.linalg.
So you have to set that Vectors in your dataframes are imported
from pyspark.ml.linalg import Vectors
instead of:
from pyspark.mllib.linalg import Vectors
Maybe you can find interesting this stackoverflow question.

How to access element of a VectorUDT column in a Spark DataFrame?

I have a dataframe df with a VectorUDT column named features. How do I get an element of the column, say first element?
I've tried doing the following
from pyspark.sql.functions import udf
first_elem_udf = udf(lambda row: row.values[0])
df.select(first_elem_udf(df.features)).show()
but I get a net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict(for numpy.dtype) error. Same error if I do first_elem_udf = first_elem_udf(lambda row: row.toArray()[0]) instead.
I also tried explode() but I get an error because it requires an array or map type.
This should be a common operation, I think.
Convert output to float:
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import lit, udf
def ith_(v, i):
try:
return float(v[i])
except ValueError:
return None
ith = udf(ith_, DoubleType())
Example usage:
from pyspark.ml.linalg import Vectors
df = sc.parallelize([
(1, Vectors.dense([1, 2, 3])),
(2, Vectors.sparse(3, [1], [9]))
]).toDF(["id", "features"])
df.select(ith("features", lit(1))).show()
## +-----------------+
## |ith_(features, 1)|
## +-----------------+
## | 2.0|
## | 9.0|
## +-----------------+
Explanation:
Output values have to be reserialized to equivalent Java objects. If you want to access values (beware of SparseVectors) you should use item method:
v.values.item(0)
which return standard Python scalars. Similarly if you want to access all values as a dense structure:
v.toArray().tolist()
If you prefer using spark.sql, you can use the follow custom function 'to_array' to convert the vector to array. Then you can manipulate it as an array.
from pyspark.sql.types import ArrayType, DoubleType
def to_array_(v):
return v.toArray().tolist()
from pyspark.sql import SQLContext
sqlContext=SQLContext(spark.sparkContext, sparkSession=spark, jsqlContext=None)
sqlContext.udf.register("to_array",to_array_, ArrayType(DoubleType()))
example
from pyspark.ml.linalg import Vectors
df = sc.parallelize([
(1, Vectors.dense([1, 2, 3])),
(2, Vectors.sparse(3, [1], [9]))
]).toDF(["id", "features"])
df.createOrReplaceTempView("tb")
spark.sql("""select * , to_array(features)[1] Second from tb """).toPandas()
output
id features Second
0 1 [1.0, 2.0, 3.0] 2.0
1 2 (0.0, 9.0, 0.0) 9.0
I ran into the same problem with not being able to use explode(). One thing you can do is use VectorSlice from the pyspark.ml.feature library. Like so:
from pyspark.ml.feature import VectorSlicer
from pyspark.ml.linalg import Vectors
from pyspark.sql.types import Row
slicer = VectorSlicer(inputCol="features", outputCol="features_one", indices=[0])
output = slicer.transform(df)
output.select("features", "features_one").show()
For anyone trying to split the probability columns generated after training a PySpark ML model into usable columns. This does not use UDF or numpy. And this will only work for binary classification. Here lr_pred is the dataframe which has the predictions from the Logistic Regression Model.
prob_df1=lr_pred.withColumn("probability",lr_pred["probability"].cast("String"))
prob_df =prob_df1.withColumn('probabilityre',split(regexp_replace("probability", "^\[|\]", ""), ",")[1].cast(DoubleType()))
Since Spark 3.0.0 this can be done without using UDF.
from pyspark.ml.functions import vector_to_array
https://discuss.dizzycoding.com/how-to-split-vector-into-columns-using-pyspark/
Why is Vector[Double] is used in the results? That's not a very nice data type.

Resources