What is imported with spark.implicits._? - apache-spark

What is imported with import spark.implicits._? Does "implicits" refer to some package? If so, why could I not find it in the Scala Api documentation on https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package?

Scala allows you to import "dynamically" things into scope. You can also do something like that:
final case class Greeting(hi: String)
def greet(greeting: Greeting): Unit = {
import greeting._ // everything in greeting is now available in scope
println(hi)
}
The SparkSession instance carries along some implicits that you import in your scope with that import statement. The most important thing that you get are the Encoders necessary for a lot of operations on DataFrames and Datasets. It also brings into the scope the StringContext necessary for you to use the $"column_name" notation.
The implicits member is an instance of SQLImplicits, whose source code (for version 2.3.1) you can view here.

It's scala's feature to import through object, so the api documentation did not describe about it. From Apache spark source code, implicits is an object class inside SparkSession class. The implicits class has extended the SQLImplicits like this :
object implicits extends org.apache.spark.sql.SQLImplicits with scala.Serializable. The SQLImplicits provides some more functionalities like:
Convert scala object to dataset. (By toDS)
Convert scala object to dataframe. (By toDF)
Convert "$name" into Column.
By importing implicits through import spark.implicits._ where spark is a SparkSession type object, the functionalities are imported implicitly.

Related

Pattern to add functions to existing Python classes

I'm writing a helper library for pandas.
Similarly to scala implicits, I would like to add my custom functions to all the instances of an existing Python class (pandas.DataFrame in this case), of which I have no control: I cannot modify it, I cannot extend it and ask users to use my extension instead of the original class.
import pandas as pd
df = pd.DataFrame(...)
df.my_function()
What's the suggested pattern to achieve this with Python 3.6+?
If exactly this is not achievable, what's the most common, robust, clear and least-surprising pattern used in Python for a similar goal? Can we get anything better by requiring Python 3.7+, 3.8+ or 3.9+?
I know it's possible to patch at runtime single instances or classes to add methods. This is not what I would like to do: I would prefer a more elegant and robust solution, applicable to a whole class and not single instances, IDE-friendly so code completion can suggest my_function.
My case is specific to pandas.DataFrame, so a solution applicable only to this class could be also fine, as long as it uses documented, official APIs of pandas.
In the below code I am creating a function with a single self argument.
This function is the assigned to an attribute of the pd.DataFrame class and if the callable as a method.
import pandas as pd
def my_new_method(self):
print(type(self))
print(self)
pd.DataFrame.new_method = my_new_method
df = pd.DataFrame({'col1': [1, 2, 3]})
df.new_method()

How do you project geometries from one EPSG to another with Spark/Geomesa?

I am "translating" some Postgis code to Geomesa and I have some Postgis code like this:
select ST_Transform(ST_SetSRID(ST_Point(longitude, latitude), 4326), 27700)
which converts a point geometry from 4326 to 27700 for example.
On Geomesa-Spark-sql documentation https://www.geomesa.org/documentation/user/spark/sparksql_functions.html I can see ST_Point but I cannot find any equivalent ST_Transform function. Any idea?
I have used sedona library for the geoprocessing and it has the st_transform
function which I have used and working fine so if you want you can use it. Please find below link for the official documentation - https://sedona.apache.org/api/sql/GeoSparkSQL-Function/#st_transform
Even Geomesa is now supporting the function -
https://www.geomesa.org/documentation/3.1.2/user/spark/sparksql_functions.html#st-transform
For GeoMesa 1.x, 2.x, and the upcoming 3.0 release, there is not an ST_Transform presently. One could make their own UDF using GeoTools (or another library) to do the transformation.
Admittedly, this would require some work.
I recently run with the same issue on Azure Databricks. I was able to do it manually installing the JAR library from here.
And then running the following Scala code.
%scala
import org.locationtech.jts.geom._
import org.locationtech.geomesa.spark.jts._
import org.locationtech.geomesa.spark.geotools._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import spark.implicits._
spark.withJTS
data_points = (
data_points
.withColumn("geom", st_makePoint(col("LONGITUDE"), col("LATITUDE")))
.withColumn("geom_5347", st_transform(col("geom"), lit("EPSG:4326"), lit("EPSG:5347")))
)
display(data_points)
Good luck.

About pandas python 3

I am a beginner to python 3 and I have a question. In pandas, is read_csv() a class or a method ?
I suspect read_csv() is a class because after you call data = pd.read_csv()
, you can subsequently call data.head(), an action only possible with a class, because of plenty of methods within this class.
For example:
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer( strategy='median')
imp_mean.fit(impute_num)
imputed_num = imp_mean.transform(impute_num)
imputed_num
As shown above, with the SimpleImputer class, first create an object, and then call the methods from that same object. It appears to be just the same as the pd.read_csv() , so I think read_csv() must be a class.
I just checked the documentation for read_csv(), which claims it returns a dataframe .But if it is a method, why can it continue to use other methods after read_csv()?
from my understanding so far, the method should only return a value and shouldn't continue to use other methods.
Is it necessary to differentiate what type it is when using a new function, method or class? Or should I just think of all of them as an object, because everything in Python is an object.
It's not a class or a method. It's a function. The resulting DataFrame is just the return value of read_csv(), not an instance of the read_csv class or something like that.
Well my understanding is pd is a class, read_csv() is a method of pd.
And the method returns an instance/object of pd (in your case data)
As data is an instance of a class, it should have all the member methods of the class pd.

Make import module return a variable

I would like to make an imported module behave like an object, ie. an dictionary.
E.g.
import module
print(module['key'])
and in module.py
return {'key':'access'}
This is very easy for Class by inheriting from dict, but how do I do this on a module level?
In particular I want to dynamically built the dictionary in module and return it when module is imported.
I know that there are other solutions such as defining the dict as a var in the module workspace and accessing it via module.var, but I am interested if something like this is possible.
As you point out, you can do this with a class, but not with a module, as modules are not subscriptable.Now I'm not going to ask why you want to do this with an import, but it can be done.
What you do is create a class that does what you want, and then have the module replace itself with the class when imported. This is of course a 'bit of a hack' (tm). Here I'm using UserDict as it gives easy access to the dict via the class attr data, but you could do anything you like in this class:
# module.py
from collections import UserDict
import sys
import types
class ModuleDict(types.ModuleType, UserDict):
data = {'key': 'access}
sys.modules[__name__] = ModuleDict(__name__)
Then you can import the module and use it as desired:
# code.py
import module
print(module['key']
# access

Mutual imports; difference between import's standart, "from" and "as" syntax

Given this simple folder structure
/main.py
/project/a.py
/project/b.py
main.py is executed by the python interpreter and contains a single line, import project.a.
a and b are modules, they need to import each other. A way to achieve this would be
import project.[a|b]
When working with deeper nested folder structures you don't want to write the entire path everytime you use a module e.g.
import project.foo.bar
project.foo.bar.set_flag(project.foo.bar.SUPER)
Both from project import [a|b] and import project.[a|b] as [a|b] result in an import error (when used in both, a and b).
What is different between the standart import syntax and the from or as syntax? Why is only the standart syntax working for mutual imports?
And more importantly, is there a simple and clean way to import modules that allows mutual imports and assigning shorter names to them (ideally the modules basename e.g. bar in the case of project.foo.bar)?
When you do either import project.a or from project import a, the following happens:
The module object for project.a is placed into sys.modules. This is a dictionary that maps each module name to its module object, so you'll have sys.modules = {..., 'p.a': <module 'p.a' from '.../project/a.py'>, ...}.
The code for the module is executed.
The a attribute is added to project.
Now, here is the difference between import project.a and from project import a:
import project.a just looks for sys.modules['project.a']. If it exists, it binds the name project using sys.modules['project'].
from project import a looks for sys.modules['project'] and then checks if the project module has an a attribute.
You can think of from project import a as an equivalent to the following two lines:
import project.a # not problematic
a = project.a # causes an error
That why you are seeing an exception only when doing from project import a: sys.modules['project.a'] exists, but project does not yet have a a attribute.
The quickest solution would be to simply avoid circular imports. But if you can't, then the usual strategies are:
Import as late as possible. Suppose that your a.py looks like this:
from project import b
def something():
return b.something_else()
Rewrite it as follows:
def something():
from project import b
return b.something_else()
Of course, you would have to repeat imports in all your functions.
Use lazy imports. Lazy imports are not standard feature of Python, but you can find many implementations around. They work by using the "import as late as possible" principle, but they add some syntactic sugar to let you write fewer code.
Cheat, and use sys.modules, like this:
import sys
import project.a
a = sys.modules['project.a']
Very un-pythonic, but works.
Obviously, whatever solution you choose, you won't be able to access the attributes from a or b until the modules have been fully loaded.

Resources