How to attach metadata to a double column in PySpark

How to attach metadata to a double column in PySpark - apache-spark

I have a double-typed column in a dataframe that holds the class label for a Random Forest training set.
I would like to manually attach metadata to the column so that I don't have to pass the dataframe into a StringIndexer as suggested in another question.
The easiest method of doing this seems to be by using the as method of Column.
However, this method is not available in Python.
Is there an easy workaround?
If there is no easy workaround and the best approach is a Python port of as, then why is the method not ported in Python?
Is there a difficult technical reason and not simply because it conflicts with the as keyword in Python and that no one has volunteered to port it?
I looked at the source code and found that the alias method in Python internally calls the as method in Scala.

import json
from pyspark.sql.column import Column
def add_meta(col, metadata):
meta = sc._jvm.org.apache.spark.sql.types\
.Metadata.fromJson(json.dumps(metadata))
return Column(getattr(col._jc, "as")('', meta))
# sample invocation
df.withColumn('label',
add_meta(df.classification,
{"ml_attr": {
"name": "label",
"type": "nominal",
"vals": ["0.0", "1.0"]
}
}))\
.show()
This solution involves calling the as(alias: String, metadata: Metadata) Scala method in Python. It can be retrieved by getattr(col._jc, "as") where col is a dataframe column (Column object).
This returned function must then be called with two arguments. The first argument is just a string and the second argument is a Metadata. The object is created by calling Metadata.fromJson() which expects a JSON string as parameter. The method is retrieved via the _jvm attribute of the Spark context.

Spark 3.3+
df.withMetadata("col_name", meta_dict)
Spark 2.2+
df.withColumn("col_name", df.col_name.alias("", metadata=meta_dict))
meta_dict can be a complex dictionary, as provided in the other answer:
meta_dict = {
"ml_attr": {
"name": "label",
"type": "nominal",
"vals": ["0.0", "1.0"]
}
}

Related

Spark : put hashmap into Dataset column?

I have a dataset Dataset<Row> which comes from reading a parquet file. Knowing that one column inside InfoMap is of type Map.
Now I want to update this column, but when I use withColumn, it tells me that I cannot put a hashmap inside because it's not a litteral.
I want to know what is the correct way to update a column of type Map for a dataset ?

Try using typedLit instead of lit
typedLit
"...The difference between this function and lit() is that this
function can handle parameterized scala types e.g.: List, Seq and Map"
data.withColumn("dictionary", typedLit(Map("foo" -> 1, "bar" -> 2)))

How to get back a normal DataFrame after invoking groupBy

For a simple grouping operation apparently the returned type is no longer a DataFrame ??
val itemsQtyDf = pkgItemsDf.groupBy($"packageid").withColumn("totalqty",sum("qty"))
We can not however invoke the DataFrame ops after the groupBy - since it is a GroupedData:
Error:(26, 55) value withColumn is not a member of org.apache.spark.sql.GroupedData
So, then how to get my DataFrame back after a grouping? Is it necessary to use DataFrame.agg() instead??

Grouping only without an aggregate function implies you may want to use the distinct() function instead which does return a DataFrame. But your example shows you want sum("qty"), so just change your code to be like this:
pkgItemsDf.groupBy($"packageid").agg(sum("qty").alias("totalqty"))

How to apply function to each row of specified column of PySpark DataFrame

I have a PySpark DataFrame consists of three columns, whose structure is as below.
In[1]: df.take(1)
Out[1]:
[Row(angle_est=-0.006815859163590619, rwsep_est=0.00019571401752467945, cost_est=34.33651951754235)]
What I want to do is to retrieve each value of the first column (angle_est), and pass it as parameter xMisallignment to a defined function to set a particular property of a class object. The defined function is:
def setMisAllignment(self, xMisallignment):
if np.abs(xMisallignment) > 0.8:
warnings.warn('You might set misallignment angle too large.')
self.MisAllignment = xMisallignment
I am trying to select the first column and convert it into rdd, and apply the above function to a map() function, but it seems it does not work, the MisAllignment did not change anyway.
df.select(df.angle_est).rdd.map(lambda row: model0.setMisAllignment(row))
In[2]: model0.MisAllignment
Out[2]: 0.00111511718224
Anyone has ideas to help me let that function work? Thanks in advance!

You can register your function as spark UDF something similar to follows:
spark.udf.register("misallign", setMisAllignment)
You can get many examples of creating and registering UDF's in this test suite:
https://github.com/apache/spark/blob/master/sql/core/src/test/java/test/org/apache/spark/sql/JavaUDFSuite.java
Hope it answers your question

Apache Spark DataSet API : head(n:Int) vs take(n:Int)

Apache Spark Dataset API has two methods i.e, head(n:Int) and take(n:Int).
Dataset.Scala source contains
def take(n: Int): Array[T] = head(n)
Couldn't find any difference in execution code between these two functions. why do API has two different methods to yield the same result?

The reason is because, in my view, Apache Spark Dataset API is trying to mimic Pandas DataFrame API which contains head https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html.

I have experimented & found that head(n) and take(n) gives exactly same replica output. Both produces output in the form of ROW object only.
DF.head(2)
[Row(Transaction_date=u'1/2/2009 6:17', Product=u'Product1', Price=u'1200', Payment_Type=u'Mastercard', Name=u'carolina', City=u'Basildon', State=u'England', Country=u'United Kingdom'), Row(Transaction_date=u'1/2/2009 4:53', Product=u'Product2', Price=u'1200', Payment_Type=u'Visa', Name=u'Betina', City=u'Parkville', State=u'MO', Country=u'United States')]
DF.take(2)
[Row(Transaction_date=u'1/2/2009 6:17', Product=u'Product1', Price=u'1200', Payment_Type=u'Mastercard', Name=u'carolina', City=u'Basildon', State=u'England', Country=u'United Kingdom'), Row(Transaction_date=u'1/2/2009 4:53', Product=u'Product2', Price=u'1200', Payment_Type=u'Visa', Name=u'Betina', City=u'Parkville', State=u'MO', Country=u'United States')]

package org.apache.spark.sql
/* ... */
def take(n: Int): Array[T] = head(n)

I think this is because spark developers tends to give it a rich API, there also the two methods where and filter which does exactly the same thing.

Spark custom estimator including persistence

I want to develop a custom estimator for spark which handles persistence of the great pipeline API as well. But as How to Roll a Custom Estimator in PySpark mllib put it there is not a lot of documentation out there (yet).
I have some data cleansing code written in spark and would like to wrap it in a custom estimator. Some na-substitutions, column deletions, filtering and basic feature generation are included (e.g. birthdate to age).
transformSchema will use the case class of the dataset ScalaReflection.schemaFor[MyClass].dataType.asInstanceOf[StructType]
fit will only fit e.g. mean age as na. substitutes
What is still pretty unclear to me:
transform in the custom pipeline model will be used to transform the "fitted" Estimator on new data. Is this correct? If yes how should I transfer the fitted values e.g. the mean age from above into the model?
how to handle persistence? I found some generic loadImpl method within private spark components but am unsure how to transfer my own parameters e.g. the mean age into the MLReader / MLWriter which are used for serialization.
It would be great if you could help me with a custom estimator - especially with the persistence part.

First of all I believe you're mixing a bit two different things:
Estimators - which represent stages that can be fit-ted. Estimator fit method takes Dataset and returns Transformer (model).
Transformers - which represent stages that can transform data.
When you fit Pipeline it fits all Estimators and returns PipelineModel. PipelineModel can transform data sequentially calling transform on all Transformers in the the model.
how should I transfer the fitted values
There is no single answer to this question. In general you have two options:
Pass parameters of the fitted model as the arguments of the Transformer.
Make parameters of the fitted model Params of the Transformer.
The first approach is typically used by the built-in Transformer, but the second one should work in some simple cases.
how to handle persistence
If Transformer is defined only by its Params you can extend DefaultParamsReadable.
If you use more complex arguments you should extend MLWritable and implement MLWriter that makes sense for your data. There are multiple examples in Spark source which show how to implement data and metadata reading / writing.
If you're looking for an easy to comprehend example take a look a the CountVectorizer(Model) where:
Estimator and Transformer share common Params.
Model vocabulary is a constructor argument, model parameters are inherited from the parent.
Metadata (parameters) is written an read using DefaultParamsWriter / DefaultParamsReader.
Custom implementation handles data (vocabulary) writing and reading.

The following uses the Scala API but you can easily refactor it to Python if you really want to...
First things first:
Estimator: implements .fit() that returns a Transformer
Transformer: implements .transform() and manipulates the DataFrame
Serialization/Deserialization: Do your best to use built-in Params and leverage simple DefaultParamsWritable trait + companion object extending DefaultParamsReadable[T]. a.k.a Stay away from MLReader / MLWriter and keep your code simple.
Parameters passing: Use a common trait extending the Params and share it between your Estimator and Model (a.k.a. Transformer)
Skeleton code:
// Common Parameters
trait MyCommonParams extends Params {
final val inputCols: StringArrayParam = // usage: new MyMeanValueStuff().setInputCols(...)
new StringArrayParam(this, "inputCols", "doc...")
def setInputCols(value: Array[String]): this.type = set(inputCols, value)
def getInputCols: Array[String] = $(inputCols)
final val meanValues: DoubleArrayParam =
new DoubleArrayParam(this, "meanValues", "doc...")
// more setters and getters
}
// Estimator
class MyMeanValueStuff(override val uid: String) extends Estimator[MyMeanValueStuffModel]
with DefaultParamsWritable // Enables Serialization of MyCommonParams
with MyCommonParams {
override def copy(extra: ParamMap): Estimator[MeanValueFillerModel] = defaultCopy(extra) // deafult
override def transformSchema(schema: StructType): StructType = schema // no changes
override def fit(dataset: Dataset[_]): MyMeanValueStuffModel = {
// your logic here. I can't do all the work for you! ;)
this.setMeanValues(meanValues)
copyValues(new MyMeanValueStuffModel(uid + "_model").setParent(this))
}
}
// Companion object enables deserialization of MyCommonParams
object MyMeanValueStuff extends DefaultParamsReadable[MyMeanValueStuff]
// Model (Transformer)
class MyMeanValueStuffModel(override val uid: String) extends Model[MyMeanValueStuffModel]
with DefaultParamsWritable // Enables Serialization of MyCommonParams
with MyCommonParams {
override def copy(extra: ParamMap): MyMeanValueStuffModel = defaultCopy(extra) // default
override def transformSchema(schema: StructType): StructType = schema // no changes
override def transform(dataset: Dataset[_]): DataFrame = {
// your logic here: zip inputCols and meanValues, toMap, replace nulls with NA functions
// you have access to both inputCols and meanValues here!
}
}
// Companion object enables deserialization of MyCommonParams
object MyMeanValueStuffModel extends DefaultParamsReadable[MyMeanValueStuffModel]
With the code above you can Serialize/Deserialize a Pipeline containing a MyMeanValueStuff stage.
Want to look at some real simple implementation of an Estimator? MinMaxScaler! (My example is actually simpler though...)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string