I was wondering if the LOESS (locally estimated scatterplot smoothing) regression was a function built-in Spark/PySpark (I'm more interested in the PySpark answer but both would be interesting).
I did some research and couldn't find one so decided to try and code it myself using pandas-udf functions but while doing it, when I displayed the scatter_plot of the manufactured data I created to begin testing my algo, Azure Databricks (on which I'm coding) proposed to me to automatically compute/display the LOESS of my dataset :
So maybe there is indeed a built-in LOESS that I just couldn't find ? If not (and Databricks is the only one responsible for this), is there any way to access the result of databricks's LOESS computation/access the function Databricks is using to do that ?
Thank you in advance :)
Related
I am working on something where I have a SQL code in place already. Now we are migrating to Azure. So I created an Azure databricks for the piece of transformation and used the same SQL code with some minor changes.
I want to know - Is there any recommended way or best practice to work with Azure databricks ?
Should we re-write the code in PySpark for the better performance?
Note : End results from the previous SQL code has no bugs. Its just that we are migrating to Azure. Instead of spending time over re-writing the code, I made use of same SQL code. Now I am looking for suggestions to understand the best practices and how it will make a difference.
Looking for your help.
Thanks !
Expecting -
Along with the migration from on prem to Azure. I am looking for some best practices for better performance.
Under the hood, all of the code (SQL/Python/Scala, if written correctly) is executed by the same execution engine. You can always compare execution plans of SQL & Python (EXPLAIN <query for SQL, and dataframe.explain() for Python) and see that they are the same for same operations.
So if your SQL code is working already you may continue to use it:
You can trigger SQL queries/dashboards/alerts from Databricks Workflows
You can use SQL operations in Delta Live Tables (DLT)
You can use DBT together with Dataricks Workflows
But often you can get more flexibility or functionality when using Python. For example (this is not a full list):
You can programmatically generate DLT tables that are performing the same transformations but on different tables
You can use streaming sources (SQL support for streaming isn't very broad yet)
You need to integrate your code with some 3rd party libraries
But really, on Databricks you can usually mix & match SQL & Python code together, for example, you can expose Python code as user-defined function and call it from SQL (small example of DLT pipeline that is doing that), etc.
You asked a lot of questions there but I'll address the one you asked in the title:
Any benefits of using Pyspark code over SQL?
Yes.
PySpark is easier to test. For example, a transformation written in PySpark can be abstracted to a python function which can then be executed in isolation within a test, thus you can employ the use of one of the myriad of of python testing frameworks (personally I'm a fan of pytest). This isn't as easy with SQL where a transformation exists within the confines of the entire SQL statement and can't be abstracted without use of views or user-defined-functions which are physical database objects that need to be created.
PySpark is more composable. One can pull together custom logic from different places (perhaps written by different people) to define an end-to-end ETL process.
PySpark's lazy evaluation is a beautiful thing. It allows you to compose an ETL process in an exploratory fashion, making changes as you go. It really is what makes PySpark (and Spark in general) a great thing and the benefits of lazy evaluation can't really be explained, it has to be experienced.
Don't get me wrong, I love SQL and for ad-hoc exploration it can't be beaten. There are good, justifiable reasons, for using SQL over PySpark, but that wasn't your question.
These are just my opinions, others may beg to differ.
After getting help on the posted question and doing some research I came up with below response --
It does not matter which language do you choose (SQL or python). Since it uses Spark cluster, so Sparks distributes it across cluster. It depends on specific use cases where to use what.
Both SQL and PySpark dataframe intermediate results gets stored in memory.
In a same notebook we can use both the languages depending upon the situation.
Use Python - For heavy transformation (more complex data processing) or for analytical / machine learning purpose
Use SQL - When we are dealing with relational data source (focused on querying and manipulating structured data stored in a relational database)
Note: There may be some optimization techniques in both the languages which we can use to make the performance better.
Summary : Choose language based on the use cases. Both has the distributed processing because its running on Spark cluster.
Thank you !
Is there a way to manually create a OneHotEncoderModel without learning it?
This is a quiet simple model and the only learned parameter (as far as I understand) is "categorySizes" which can be accessed using the _java_obj. But, I can not find a way to set it without using OneHotEncoder.fit(...) on a real dataset!
Sample code for what I want to achieve
model=OneHotEncoderModel(input='dayOfWeek',output='dayOfWeek_1hot',categorySizes=[7])
model.transform(data)
Is it possible to save a Spark ML pipeline to a database (Cassandra for example)? From the documentation I can only see the save to path option:
myMLWritable.save(toPath);
Is there a way to somehow wrap or change the myMLWritable.write() MLWriter instance and redirect the output to the database?
It is not possible (or at least no supported) at this moment. ML writer is not extendable and depends on Parquet files and directory structure to represent models.
Technically speaking you can extract individual components and use internal private API to recreate models from scratch, but it is likely the only option.
Spark 2.0.0+
At first glance all Transformers and Estimators implement MLWritable. If you use Spark <= 1.6.0 and experience some issues with model saving I would suggest switching version.
Spark >= 1.6
Since Spark 1.6 it's possible to save your models using the save method. Because almost every model implements the MLWritable interface. For example LogisticRegressionModel, has it, and therefore it's possible to save your model to the desired path using it.
Spark < 1.6
Some operations on a DataFrames can be optimized and it translates to improved performance compared to plain RDDs. DataFramesprovide efficient caching and SQLish API is arguably easier to comprehend than RDD API.
ML Pipelinesare extremely useful and tools like cross-validator or differentevaluators are simply must-have in any machine pipeline and even if none of the above is particularly hard do implement on top of low level MLlib API it is much better to have ready to use, universal and relatively well tested solution.
I believe that at the end of the day what you get by using ML over MLLibis quite elegant, high level API. One thing you can do is to combine both to create a custom multi-step pipeline:
use ML to load, clean and transform data,
extract required data (see for example [extractLabeledPoints ]4 method) and pass to MLLib algorithm,
add custom cross-validation / evaluation
save MLLib model using a method of your choice (Spark model or PMML)
In Jira also there is temporary solution provided . Temporary Solution
The Spark version 1.5+ has windowing functions. I believe there were a comprehensive documentation for SQL somewhere but have been unsuccessful to find it .
Here is the docs for spark dataframe and sql: it does NOT have the content sought:
http://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations
I have googled a number of different ways and unable to find the comprehensive guide to available sql functions. The closest I could find is "spark 1.5 new Dataframe operations" here:
https://databricks.com/blog/2015/09/16/spark-1-5-dataframe-api-highlights-datetimestring-handling-time-intervals-and-udafs.html
Update I am looking specifically for a SQL reference - not an API (/scaladoc) reference. I.e. a reference showing the provided sql functions, what their arguments are, semantics, and maybe example usage.
There is a page about Windowing and analytics in the Wiki which covers the window specification, aggregate functions, and it contains some examples.
How about this? This is spark2.4.0
https://spark.apache.org/docs/2.4.0/api/sql/index.html#last_value
Databricks had a good introduction to window functions at https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html but the definitive documentation should always be the API docs, scroll right to the bottom
Dataframe functions API documentation
What is the best way to write Google Cloud Dataflow output to Cassandra?
I don't seem to find many people doing it. After searching for a while, the only thing I found was: https://github.com/benjumanji/cassandra-dataflow which has only 3 commits and is 4 months old.
In general, is it a good idea to write Dataflow's output to Cassandra?
One possible approach would be to implement a custom sink (for batch): https://cloud.google.com/dataflow/model/custom-io#creating-sinks.