When I run a spark ml transformer, we provide input and output columns. The transformed data set contains both types of columns, i.e. old columns and transformed columns
e.g.
from pyspark.ml.feature import Imputer
df = spark.createDataFrame([
(1.0, float("nan")),
(2.0, float("nan")),
(float("nan"), 3.0),
(4.0, 4.0),
(5.0, 5.0)
], ["a", "b"])
imputer = Imputer(inputCols=["a", "b"], outputCols=["out_a", "out_b"])
model = imputer.fit(df)
model.transform(df).columns
This will print out
['a','b','out_a','out_b']
Is it possible to ask the transformer to spit out the transformed column only?
I want this to happen inside the transformer, and do not want to remove the columns using the drop method in spark dataframe
Related
I need to build a method that receives a pyspark.sql.Column 'c' and returns a new pyspark.sql.Column that contains the information to build a list with True/False depending if the values on the column are nulls/nan
PySpark has the column method c.isNotNull() which will work in the case of not null values. It also has pyspark.sql.functions.isnan, which receives a pyspark.sql.Column, which works with nans (but does not work with datetime/bool cols)
I'm trying to build a function that looks like this:
from pyspark.sql import functions as F
def notnull(c):
return c.isNotNull() & ~F.isnan(c)
And then I want to use that functions with any column type in my DataFrame to obtain if there are not null/not nan values within that column. But this fails when the provided column type is bool or datetime:
import datetime
import numpy as np
import pandas as pd
from pyspark import SparkConf
from pyspark.sql import SparkSession
# Building SparkSession 'spark'
conf = (SparkConf().setAppName("example")
.setMaster("local[*]")
.set("spark.sql.execution.arrow.enabled", "true"))
spark = SparkSession.builder.config(conf=conf).enableHiveSupport().getOrCreate()
# Data input and initialazing pd_df
data = {
'string_col': ['1', '1', '1', None],
'bool_col': [True, True, True, False],
'datetime_col': [
datetime.datetime(2018, 12, 9),
datetime.datetime(2018, 12, 9),
datetime.datetime(2018, 12, 9),
pd.NaT],
'float_col': [1.0, 1.0, 1.0, np.nan]
}
pd_df = pd.DataFrame(data)
# Creating spark_df from pd_df
spark_df = spark.createDataFrame(pd_df)
# This should return a new dataframe with the column 'notnulls' added
# Note: This works fine with 'float_col' and 'string_col' but does not
# work with 'bool_col' or 'datetime_col'
spark_df.withColumn('notnulls', notnull(spark_df['datetime_col'])).collect()
Running this snippet (using 'datetime_col') will throw the following exception:
pyspark.sql.utils.AnalysisException: "cannot resolve 'isnan(`datetime_col`)'
due to data type mismatch: argument 1 requires (double or float) type, however,
'`datetime_col`' is of timestamp type.;;\n'Project [category#217,
float_col#218, string_col#219, bool_col#220, CASE WHEN isnan(datetime_col#221)
THEN NOT isnan(datetime_col#221) ELSE isnotnull(datetime_col#221) END AS
datetime_col#231]\n+- LogicalRDD [category#217, float_col#218, string_col#219,
bool_col#220, datetime_col#221], false\n"
I understand this is because the isnan function cannot be applied to 'datetime_col', since it's not float/double type. Since 'c' is a pyspark.sql.Column Object, I can't access it's dtype to behave differently based on the column type. I want to avoid using a pandas_udf to solve this issue, but I'm not being able to find any different way to do it.
I'm using the following dependencies:
numpy==1.19.1
pandas==1.0.4
pyarrow==1.0.0
pyspark==2.4.5
I am using pyspark 1.6.3 through Zeppelin with python 3.5.
I am trying to implement Latent Dirichlet Allocation using the pyspark CountVectorizer and LDA functions. First, the problem: here is the code I am using. Let df be a spark dataframe with tokenized text in a column 'tokenized'
vectors = 'vectors'
cv = CountVectorizer(inputCol = 'tokenized', outputCol = vectors)
model = cv.fit(df)
df = model.transform(df)
corpus = df.select(vectors).rdd.zipWithIndex().map(lambda x: [x[1], x[0]]).cache()
ldaModel = LDA.train(corpus, k=25)
This code is taken more or less from the pyspark api docs.
On the call to LDA I get the following error:
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.sql.types._create_row)
The internet tells me that this is due to a type mismatch.
So lets look at the types for LDA and from CountVectorizer. From spark docs here is another example of a sparse vector going into LDA:
>>> from pyspark.mllib.linalg import Vectors, SparseVector
>>> data = [
... [1, Vectors.dense([0.0, 1.0])],
... [2, SparseVector(2, {0: 1.0})],
... ]
>>> rdd = sc.parallelize(data)
>>> model = LDA.train(rdd, k=2, seed=1)
I implement this myself and this is what rdd looks like:
>> testrdd.take(2)
[[1, DenseVector([0.0, 1.0])], [2, SparseVector(2, {0: 1.0})]]
On the other hand, if I go to my original code and look at corpus the rdd with the output of CountVectorizer, I see (edited to remove extraneous bits):
>> corpus.take(3)
[[0, Row(vectors=SparseVector(130593, {0: 30.0, 1: 13.0, ...
[1, Row(vectors=SparseVector(130593, {0: 52.0, 1: 44.0, ...
[2, Row(vectors=SparseVector(130593, {0: 14.0, 1: 6.0, ...
]
So the example I used (from the docs!) doesn't produce a tuple of (index, SparseVector), but a (index, Row(SparseVector))... or something?
Questions:
Is the Row wrapper around the SparseVector what is causing this error?
If so, how do I get rid of the Row object? Row is a property of a df, but I used df.rdd to convert to an rdd; what else would I need to do?
It maybe the problem. Just extract vectors from the Row object.
corpus = df.select(vectors).rdd.zipWithIndex().map(lambda x: [x[1], x[0]['vectors']]).cache()
I use VectorAssembler to create a vector of features from >2000 columns so that I can run a PCA on it. I would normally explicitly state which columns need to be included in the feature vector:
val dataset = (spark.createDataFrame(
Seq((0, 1.2, 1.3, 1.7, 1.9), (1, 2.2, 2.3, 2.7, 2.9), (2, 3.2, 3.3, 3.5, 3.7))
).toDF("id", "f1", "f2", "f3", "f4"))
val assembler = (new VectorAssembler()
.setInputCols(Array("f2", "f3"))
.setOutputCol("featureVec"))
But in case of more than 2000 columns how can I specify that all columns except for "id" and "f1" should be included?
Any help appreciated!
One of the easiest way is to get all the column names, convert to a set and subtract the columns you don't need and again use it as an array as
val datasetColumnsToBeUsed = dataset.columns.toSet - "id" - "f1" toArray
import org.apache.spark.ml.feature.VectorAssembler
val assembler = (new VectorAssembler()
.setInputCols(Array(datasetColumnsToBeUsed: _*))
.setOutputCol("featureVec"))
And another easiest way is to use filter on column names as
val columnNames = dataset.columns
val datasetColumnsToBeUsed = columnNames.filterNot(x => Array("id", "f1").contains(x))
And use it as above
I got the following dataframe (it is assumed that it is already a dataframe):
val df = sc.parallelize(Seq((1, 2, 10), (3, 4, 11), (5, 6, 12)))
.toDF("a", "b", "c")
and I want to combine the columns(not all) to one column and make it an rdd of Array[Double]. I am doing the following:
import org.apache.spark.ml.feature.VectorAssembler
val colSelected = List("a","b")
val assembler = new VectorAssembler()
.setInputCols(colSelected.toArray)
.setOutputCol("features")
val output = assembler.transform(df).select("features").rdd
Till here it is ok. Now the output is a dataframe of the format RDD[spark.sql.Row]. I am unable to transform this to a format of RDD[Array[Double]]. Any way?
I have tried something like the following but with no success:
output.map { case Row(a: Vector[Double]) => a.getAs[Array[Double]]("features")}
The correct solution (this assumes Spark 2.0+, in 1.x use o.a.s.mllib.linalg.Vector):
import org.apache.spark.ml.linalg.Vector
output.map(_.getAs[Vector]("features").toArray)
ml / mllib Vector created by VectorAssembler is not the same as scala.collection.Vector.
Row.getAs should be used with expected type. It doesn't perform any type conversions and o.a.s.ml(lib).linalg.Vector is not an Array[Double].
I have a dataframe df with a VectorUDT column named features. How do I get an element of the column, say first element?
I've tried doing the following
from pyspark.sql.functions import udf
first_elem_udf = udf(lambda row: row.values[0])
df.select(first_elem_udf(df.features)).show()
but I get a net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict(for numpy.dtype) error. Same error if I do first_elem_udf = first_elem_udf(lambda row: row.toArray()[0]) instead.
I also tried explode() but I get an error because it requires an array or map type.
This should be a common operation, I think.
Convert output to float:
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import lit, udf
def ith_(v, i):
try:
return float(v[i])
except ValueError:
return None
ith = udf(ith_, DoubleType())
Example usage:
from pyspark.ml.linalg import Vectors
df = sc.parallelize([
(1, Vectors.dense([1, 2, 3])),
(2, Vectors.sparse(3, [1], [9]))
]).toDF(["id", "features"])
df.select(ith("features", lit(1))).show()
## +-----------------+
## |ith_(features, 1)|
## +-----------------+
## | 2.0|
## | 9.0|
## +-----------------+
Explanation:
Output values have to be reserialized to equivalent Java objects. If you want to access values (beware of SparseVectors) you should use item method:
v.values.item(0)
which return standard Python scalars. Similarly if you want to access all values as a dense structure:
v.toArray().tolist()
If you prefer using spark.sql, you can use the follow custom function 'to_array' to convert the vector to array. Then you can manipulate it as an array.
from pyspark.sql.types import ArrayType, DoubleType
def to_array_(v):
return v.toArray().tolist()
from pyspark.sql import SQLContext
sqlContext=SQLContext(spark.sparkContext, sparkSession=spark, jsqlContext=None)
sqlContext.udf.register("to_array",to_array_, ArrayType(DoubleType()))
example
from pyspark.ml.linalg import Vectors
df = sc.parallelize([
(1, Vectors.dense([1, 2, 3])),
(2, Vectors.sparse(3, [1], [9]))
]).toDF(["id", "features"])
df.createOrReplaceTempView("tb")
spark.sql("""select * , to_array(features)[1] Second from tb """).toPandas()
output
id features Second
0 1 [1.0, 2.0, 3.0] 2.0
1 2 (0.0, 9.0, 0.0) 9.0
I ran into the same problem with not being able to use explode(). One thing you can do is use VectorSlice from the pyspark.ml.feature library. Like so:
from pyspark.ml.feature import VectorSlicer
from pyspark.ml.linalg import Vectors
from pyspark.sql.types import Row
slicer = VectorSlicer(inputCol="features", outputCol="features_one", indices=[0])
output = slicer.transform(df)
output.select("features", "features_one").show()
For anyone trying to split the probability columns generated after training a PySpark ML model into usable columns. This does not use UDF or numpy. And this will only work for binary classification. Here lr_pred is the dataframe which has the predictions from the Logistic Regression Model.
prob_df1=lr_pred.withColumn("probability",lr_pred["probability"].cast("String"))
prob_df =prob_df1.withColumn('probabilityre',split(regexp_replace("probability", "^\[|\]", ""), ",")[1].cast(DoubleType()))
Since Spark 3.0.0 this can be done without using UDF.
from pyspark.ml.functions import vector_to_array
https://discuss.dizzycoding.com/how-to-split-vector-into-columns-using-pyspark/
Why is Vector[Double] is used in the results? That's not a very nice data type.