Class[_] equivalent in python - python-3.x

I want to perform the following scala code into python3 language
class xyz{
def abc():Unit={
val clazz:Class[_] = this.getClass()
var fields: List[String] = getFields(clazz);
val method = clazz.getDeclaredMethods()
val methodname=method.getName()
val supper= clazz.getSuperclass()
println(clazz)
println(fields)
println(method)
}}

Class[_] equivalent in python
Class[_] is a static type. Python doesn't have static types, so there is no equivalent to the static type Class[_] in Python.
I want to perform the following scala code into python3 language
class xyz{
def abc():Unit={
val clazz:Class[_] = this.getClass()
var fields: List[String] = getFields(clazz);
val method = clazz.getDeclaredMethods()
val methodname=method.getName()
val supper= clazz.getSuperclass();}
def mno():Unit={
println("hello")}}
abc is simply a NO-OP(*). mno just prints to stdout. So, the equivalent in Python is
class xyz:
def abc(self):
pass
def mno(self):
print("hello")
Note that I made abc and mno instance methods, even though it makes no sense. (But that's the same for the Scala version.)
(*) Some who knows more about corner cases and side-effects of Java Reflection can correct me here. Maybe, this triggers some kind of Classloader refresh or something like that?

You can't get one-to-one correspondence simply because Python classes are organized very differently from JVM classes.
The equivalent of getClass() is type;
there is no equivalent to Class#getFields because fields aren't necessarily defined on a class in Python, but see How to list all fields of a class (and no methods)?.
Similarly getSuperclass(); Python classes can have more than one superclass, so __bases__ returns a tuple of base classes instead of just one.

Related

Scala interning: how does different initialisation affect comparison?

I am new to Scala but I know Java. Thus, as far as I understand, the difference is that == in Scala acts as .equals in Java, which means we are looking for the value; and eq in Scala acts as == in Java, which means we are looking for the reference address and not value.
However, after running the code below:
val greet_one_v1 = "Hello"
val greet_two_v1 = "Hello"
println(
(greet_one_v1 == greet_two_v1),
(greet_one_v1 eq greet_two_v1)
)
val greet_one_v2 = new String("Hello")
val greet_two_v2 = new String("Hello")
println(
(greet_one_v2 == greet_two_v2),
(greet_one_v2 eq greet_two_v2)
)
I get the following output:
(true,true)
(true,false)
My theory is that the initialisation of these strings differs. Hence, how is val greet_one_v1 = "Hello" different from val greet_one_v2 = new String("Hello")? Or, if my theory is incorrect, why do I have different outputs?
As correctly answered by Luis Miguel Mejía Suárez, the answer lies in String Interning which is part of what JVM (Java Virtual Machine) does automatically. To initiate a new String it needs to be initiated explicitly like in my example above; otherwise, Java will allocate the same memory for same values for optimisation reasons.

Calling UDF on Dataframe with Serialization Issue

I was looking at some examples on blogs of UDFs that appear to work, but in fact when I run them they give the infamous task not serializable error.
I find it strange that this is published and no such mention made. Running Spark 2.4.
Code, pretty straight forward something must have changed in Spark?:
def lowerRemoveAllWhitespace(s: String): String = {
s.toLowerCase().replaceAll("\\s", "")
}
val lowerRemoveAllWhitespaceUDF = udf[String, String](lowerRemoveAllWhitespace)
import org.apache.spark.sql.functions.col
val df = sc.parallelize(Seq(
("r1 ", 1, 1, 3, -2),
("r 2", 6, 4, -2, -2),
("r 3", 4, 1, 1, 0),
("r4", 1, 2, 4, 5)
)).toDF("ID", "a", "b", "c", "d")
df.select(lowerRemoveAllWhitespaceUDF(col("ID"))).show(false)
returns:
org.apache.spark.SparkException: Task not serializable
From this blog that I find good: https://medium.com/#mrpowers/spark-user-defined-functions-udfs-6c849e39443b
Something must have changed???
I looked at the top voted item here with an Object and extends Serializable but no joy either. Puzzled.
EDIT
Things seems to have changed, this format needed:
val squared = udf((s: Long) => s * s)
The Object approach still interest me why it failed.
I couldn't reproduce the error (tried on spark 1.6, 2.3, and 2.4), but I do remember facing this kind of error (long time ago). I'll put in my best guess.
The problem happens due to difference between Method and Function in scala. As described in detail here.
Short version of that is when you write def it is equivalent to methods in java, i.e part of a a class and can be invoked using the instance of the class.
When you write udf((s: Long) => s * s) it creates an instance of trait Function1. For this to happen an anonymous class implementing Function1 is generated whose apply method is is something like def apply(s: Long):Long= {s * s}, and the instance of this class is passed as parameter to udf.
However when you write udf[String, String](lowerRemoveAllWhitespace) the method lowerRemoveAllWhitespace needs to be converted to Function1 instance and passed to udf. This is where the serialization fails, since the apply method on this instance will try to invoke lowerRemoveAllWhitespace on instance of another object (which could not be serialized and sent to the worker jvm process) causing the exception.
The example that was posted was from a reputable source, but I cannot get to run without a Serialization error in Spark 2.4, trying of Objects etc. did not help either.
I solved the issue as follows using the udf(( .. approach which looks like a single statement only possible and indeed I could that and voila no serialization. A slightly different example though using primitives.
val sumContributionsPlus = udf((n1: Int, n2: Int, n3: Int, n4: Int) => Seq(n1,n2,n3,n4).foldLeft(0)( (acc, a) => if (a > 0) acc + a else acc))
On a final note, the whole discussion on UDF, Spark native, columns UDFs is confusing when things no longer appear to work.

Pyspark applying foreach

I'm nooby in Pyspark and I pretend to play a bit with a couple of functions to understand better how could I use them in more realistic scenarios. for a while, I trying to apply a specific function to each number coming in a RDD. My problem is basically that, when I try to print what I grabbed from my RDD the result is None
My code:
from pyspark import SparkConf , SparkContext
conf = SparkConf().setAppName('test')
sc = SparkContext(conf=conf)
sc.setLogLevel("WARN")
changed = []
def div_two (n):
opera = n / 2
return opera
numbers = [8,40,20,30,60,90]
numbersRDD = sc.parallelize(numbers)
changed.append(numbersRDD.foreach(lambda x: div_two(x)))
#result = numbersRDD.map(lambda x: div_two(x))
for i in changed:
print(i)
I appreciate a clear explanation about why this is coming Null in the list and what should be the right approach to achieve that using foreach whether it's possible.
thanks
Your function definition of div_two seems fine which can yet be reduced to
def div_two (n):
return n/2
And you have converted the arrays of integers to rdd which is good too.
The main issue is that you are trying to add rdds to an array changed by using foreach function. But if you look at the definition of foreach
def foreach(self, f) Inferred type: (self: RDD, f: Any) -> None
which says that the return type is None. And thats what is getting printed.
You don't need an array variable for printing the changed elements of an RDD. You can simply write a function for printing and call that function in foreach function
def printing(x):
print x
numbersRDD.map(div_two).foreach(printing)
You should get the results printed.
You can still add the rdd to an array variable but rdds are distributed collection in itself and Array is a collection too. So if you add rdd to an array you will have collection of collection which means you should write two loops
changed.append(numbersRDD.map(div_two))
def printing(x):
print x
for i in changed:
i.foreach(printing)
The main difference between your code and mine is that I have used map (which is a transformation) instead of foreach ( which is an action) while adding rdd to changed variable. And I have use two loops for printing the elements of rdd

Spark custom estimator including persistence

I want to develop a custom estimator for spark which handles persistence of the great pipeline API as well. But as How to Roll a Custom Estimator in PySpark mllib put it there is not a lot of documentation out there (yet).
I have some data cleansing code written in spark and would like to wrap it in a custom estimator. Some na-substitutions, column deletions, filtering and basic feature generation are included (e.g. birthdate to age).
transformSchema will use the case class of the dataset ScalaReflection.schemaFor[MyClass].dataType.asInstanceOf[StructType]
fit will only fit e.g. mean age as na. substitutes
What is still pretty unclear to me:
transform in the custom pipeline model will be used to transform the "fitted" Estimator on new data. Is this correct? If yes how should I transfer the fitted values e.g. the mean age from above into the model?
how to handle persistence? I found some generic loadImpl method within private spark components but am unsure how to transfer my own parameters e.g. the mean age into the MLReader / MLWriter which are used for serialization.
It would be great if you could help me with a custom estimator - especially with the persistence part.
First of all I believe you're mixing a bit two different things:
Estimators - which represent stages that can be fit-ted. Estimator fit method takes Dataset and returns Transformer (model).
Transformers - which represent stages that can transform data.
When you fit Pipeline it fits all Estimators and returns PipelineModel. PipelineModel can transform data sequentially calling transform on all Transformers in the the model.
how should I transfer the fitted values
There is no single answer to this question. In general you have two options:
Pass parameters of the fitted model as the arguments of the Transformer.
Make parameters of the fitted model Params of the Transformer.
The first approach is typically used by the built-in Transformer, but the second one should work in some simple cases.
how to handle persistence
If Transformer is defined only by its Params you can extend DefaultParamsReadable.
If you use more complex arguments you should extend MLWritable and implement MLWriter that makes sense for your data. There are multiple examples in Spark source which show how to implement data and metadata reading / writing.
If you're looking for an easy to comprehend example take a look a the CountVectorizer(Model) where:
Estimator and Transformer share common Params.
Model vocabulary is a constructor argument, model parameters are inherited from the parent.
Metadata (parameters) is written an read using DefaultParamsWriter / DefaultParamsReader.
Custom implementation handles data (vocabulary) writing and reading.
The following uses the Scala API but you can easily refactor it to Python if you really want to...
First things first:
Estimator: implements .fit() that returns a Transformer
Transformer: implements .transform() and manipulates the DataFrame
Serialization/Deserialization: Do your best to use built-in Params and leverage simple DefaultParamsWritable trait + companion object extending DefaultParamsReadable[T]. a.k.a Stay away from MLReader / MLWriter and keep your code simple.
Parameters passing: Use a common trait extending the Params and share it between your Estimator and Model (a.k.a. Transformer)
Skeleton code:
// Common Parameters
trait MyCommonParams extends Params {
final val inputCols: StringArrayParam = // usage: new MyMeanValueStuff().setInputCols(...)
new StringArrayParam(this, "inputCols", "doc...")
def setInputCols(value: Array[String]): this.type = set(inputCols, value)
def getInputCols: Array[String] = $(inputCols)
final val meanValues: DoubleArrayParam =
new DoubleArrayParam(this, "meanValues", "doc...")
// more setters and getters
}
// Estimator
class MyMeanValueStuff(override val uid: String) extends Estimator[MyMeanValueStuffModel]
with DefaultParamsWritable // Enables Serialization of MyCommonParams
with MyCommonParams {
override def copy(extra: ParamMap): Estimator[MeanValueFillerModel] = defaultCopy(extra) // deafult
override def transformSchema(schema: StructType): StructType = schema // no changes
override def fit(dataset: Dataset[_]): MyMeanValueStuffModel = {
// your logic here. I can't do all the work for you! ;)
this.setMeanValues(meanValues)
copyValues(new MyMeanValueStuffModel(uid + "_model").setParent(this))
}
}
// Companion object enables deserialization of MyCommonParams
object MyMeanValueStuff extends DefaultParamsReadable[MyMeanValueStuff]
// Model (Transformer)
class MyMeanValueStuffModel(override val uid: String) extends Model[MyMeanValueStuffModel]
with DefaultParamsWritable // Enables Serialization of MyCommonParams
with MyCommonParams {
override def copy(extra: ParamMap): MyMeanValueStuffModel = defaultCopy(extra) // default
override def transformSchema(schema: StructType): StructType = schema // no changes
override def transform(dataset: Dataset[_]): DataFrame = {
// your logic here: zip inputCols and meanValues, toMap, replace nulls with NA functions
// you have access to both inputCols and meanValues here!
}
}
// Companion object enables deserialization of MyCommonParams
object MyMeanValueStuffModel extends DefaultParamsReadable[MyMeanValueStuffModel]
With the code above you can Serialize/Deserialize a Pipeline containing a MyMeanValueStuff stage.
Want to look at some real simple implementation of an Estimator? MinMaxScaler! (My example is actually simpler though...)

Sorting a table by date in Slick 3.1.x

I have the following Slick class that includes a date:
import java.sql.Date
import java.time.LocalDate
class ReportDateDB(tag: Tag) extends Table[ReportDateVO](tag, "report_dates") {
def reportDate = column[LocalDate]("report_date")(localDateColumnType)
def * = (reportDate) <> (ReportDateVO.apply, ReportDateVO.unapply)
implicit val localDateColumnType = MappedColumnType.base[LocalDate, Date](
d => Date.valueOf(d),
d => d.toLocalDate
)
}
When I attempt to sort the table by date:
val query = TableQuery[ReportDateDB]
val action = query.sortBy(_.reportDate).result
I get the following compilation error
not enough arguments for method sortBy: (implicit evidence$2: slick.lifted.Rep[java.time.LocalDate] ⇒
slick.lifted.Ordered)slick.lifted.Query[fdic.ReportDateDB,fdic.ReportDateDB#TableElementType,Seq].
Unspecified value parameter evidence$2.
No implicit view available from slick.lifted.Rep[java.time.LocalDate] ⇒ slick.lifted.Ordered.
How to specify the implicit default order?
You need to make your implicit val localDateColumnType available where you run the query. For example, this will work:
implicit val localDateColumnType = MappedColumnType.base[LocalDate, Date](
d => Date.valueOf(d),
d => d.toLocalDate)
val query = TableQuery[ReportDateDB]
val action = query.sortBy(_.reportDate).result
I'm not sure where the best place to put this is, but I usually put all these conversions in a package object.
It should work like described here:
implicit def localDateOrdering: Ordering[LocalDate] = Ordering.fromLessThan(_ isBefore _)
Try add this line to your import list:
import slick.driver.MySQLDriver.api._

Resources