Snowflake SnowPark Python -Clarifications - python-3.x

Have a few questions regarding SnowPark with Python.
Why do we need Snowpark when we already have Snowflake python connector(freely) that can use to connect to Python jupyter with Snowflake DW?
If we use snowpark and connect with Local jupyter file to run ML model. Is it use our local machine computing power or Snowflake computing power?If its our local machine computing power how can we use Snowflake computing power to run the ml model?

Snowpark with Python allows you to treat a Snowflake table like a Spark DF. This means you can run pyspark code against Snowflake tables without the need to pull the data out of Snowflake, and the compute is Snowflake compute, not your local machine, which is fully elastic.
As long as you are executing spark dataframe logic in python, the compute will be on the Snowflake side. If you pull that data back to your machine to execute other logic (pandas, for example), then Snowpark will be pulling the data back to your local machine and the compute will happen there as normal.
I recommend starting here to learn more:
https://docs.snowflake.com/en/developer-guide/snowpark/index.html

A couple of things to have in mind is that we are talking about multiple things here and it could be good with some clarification.
Snowpark is a library that you install through pip/conda and it's a dataframe library, meaning you will be able to define a dataframe object that points to data in Snowflake (there is also ways to get data into Snowflake using it as well). It does not pull back the data to the client, unless you explicit tells it too, and all computation is done on the Snowflake side.
When you do operations on a Snowpark dataframe you are using Python code that will generate SQL that is executed in Snowflake, using the same mechanism as if you wrote your own SQL. The execution of the generated SQL is triggered by action methods such as .show(), .collect(), save_as_table() and so on.
More information here
As part of the Snowflake Python support there is also Python UDFs and
Python Stored Procedures, you do not need Snowpark to create or use those since you can do that with SQL using CREATE FUNCTION/CREATE STORED PROCEDURE, but you can use Snowpark as well.
With Python UDFs and Python Stored Procedures you can bring Python code into Snowflake that will be executed on the Snowflake compute, it will not be translated into SQL but will use Python sandboxes that run on the compute nodes.
In order to use Python Stored Procedures or Python UDFs you do not have to do anything, it is there like any other built in feature of Snowflake.
More information about Python UDFs and information about Python Stored Procedures.
The Snowflake Python Connector allows you to write SQL that is executed on Snowflake and the the result is pulled back to the client to be used there, using the client memory etc. If you want your manipulation to be executed in Snowflake you need to write SQL for it.

Using the existing Snowflake Python Connector you bring the Snowflake data to the system that is executing the Python program, limiting you to the compute and memory of that system. With Snowpark for Python, you are bringing your Python code to Snowflake to leverage the compute and memory of the cloud platform.

Snowpark python provides the following benefits which are not there with the Snowflake python connector
User can bring their custom python client code into Snowflake in the form of a UDF (user defined function) and use these functions on Dataframe.
It allows data engineers, data scientists and data developers to code in their familiar way with their language of choice, and execute pipeline, ML workflow and data apps faster and more securely, in a single platform.
User can build/work with queries using the familiar syntax of Dataframe APIs ( Dataframe style of programming)
User can use all popular Anaconda's libraries, all these libraries are pre-installed. User has access to hundreds of curated, open-source Python packages from Anaconda's libraries.
Snowpark operations are executed lazily on the server, which reduces the amount of data transferred between your client and the Snowflake database.
For more details, please refer to the documentation

I think that understanding Snowpark is complex. I think #Mats answer is really good. I created blog post that I think provides some high level guidance: https://www.mobilize.net/blog/lost-in-the-snowpark

Related

What does SnowPark pushdown feature mean? Can i run my code on separate cluster with out moving it to Snowlfake?

I am trying to understand what exactly Snowpark pushdown feature do? from the documentation it already looks like the code is getting executed directly on Snowflake rather than the external clusters. Also, is it possible for me to run the code on my cluster instead of Snowflake using the Snowpark?
Snowpark supports pushdown for all operations, including Snowflake
UDFs.
That means the data operations (applying filters, transformations etc) are pushed to the Snowflake engine, the database handles this workload and Snowpark just deals with the rest.
Snowpark does not require a separate cluster outside of Snowflake for computations. All of the computations are done within Snowflake.
Snowpark runs on the Snowflake warehouses. You don't need to run your code on your own external cluster such as EMR etc. You can create a development environment but Snowpark is designed to run on Snowflake:
https://docs.snowflake.com/en/developer-guide/snowpark/python/setup.html

Read a Databricks table via Databricks api in Python?

Using Python-3, I am trying to compare an Excel (xlsx) sheet to an identical spark table in Databricks. I want to avoid doing the compare in Databricks. So I am looking for a way to read the spark table via the Databricks api. Is this possible? How can I go on to read a table: DB.TableName?
There is no way to read the table from the DB API as far as I am aware unless you run it as a job as LaTreb already mentioned. However, if you really wanted to, you could use either the ODBC or JDBC drivers to get the data through your databricks cluster.
Information on how to set this up can be found here.
Once you have the DSN set up you can use pyodbc to connect to databricks and run a query. At this time the ODBC driver will only allow you to run Spark-SQL commands.
All that being said, it will probably still be easier to just load the data into Databricks, unless you have some sort of security concern.
I can recomend you write pyspark code in notebook, call the notebook from previously defined job, and establish connection between your local machine and databricks workspace.
You could perfom comaprision directly on spark or convert data frames to pandas if you wish. If noteebok will end comaprision, could retrun result from particular job. I think that sending all databricks tables could be impossible because of API limitation you have spark cluster to perform complex operation, API should be use to send small messages.
Officical documentation:
https://learn.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/jobs#--runs-get-output
Retrieve the output and metadata of a run. When a notebook task
returns a value through the dbutils.notebook.exit() call, you can use
this endpoint to retrieve that value. Azure Databricks restricts this
API to return the first 5 MB of the output. For returning a larger
result, you can store job results in a cloud storage service.

Jupyter as Zeppelin replacement: multi-lingual Spark

My team is trying to transition from Zeppelin to Jupyter for an application we've built, because Jupyter seems to have more momentum, more opportunities for customization, and be generally more flexible. However, there are a couple of things Zeppelin we haven't been able to equivalents for in Jupyter.
The main one is to have multi-lingual Spark support - is it possible in Jupyter to create a Spark data frame that's accessible via R, Scala, Python, and SQL, all within the same notebook? We've written a Scala Spark library to create data frames and hand them back to the user, and the user may want to use various languages to manipulate/interrogate the data frame once they get their hands on it.
Is Livy a solution to this in the Jupyter context, i.e. will it allow multiple connections (from the various language front-ends) to a common Spark back-end so they can manipulate the same data objects? I can't quite tell from Livy's web site whether a given connection only supports one language, or whether each session can have multiple connections to it.
If Livy isn't a good solution, can BeakerX fill this need? The BeakerX website says two of its main selling points are:
Polyglot magics and autotranslation, allowing you to access multiple languages in the same notebook, and seamlessly communicate between them;
Apache Spark integration including GUI configuration, status, progress, interrupt, and tables;
However, we haven't been able to use BeakerX to connect to anything other than a local Spark cluster, so we've been unable to verify how the polyglot implementation actually works. If we can get a connection to a Yarn cluster (e.g. an EMR cluster in AWS), would the polyglot support give us access to the same session using different languages?
Finally, if neither of those work, would a custom Magic work? Maybe something that would proxy requests through to other kernels, e.g. spark and pyspark and sparkr kernels? The problem I see with this approach is that I think each of those back-end kernels would have their own Spark context, but is there a way around that I'm not thinking of?
(I know SO questions aren't supposed to ask for opinions or recommendations, so what I'm really asking for here is whether a possible path to success actually exists for the three alternatives above, not necessarily which of them I should choose.)
Another possible is the SoS (Script of Scripts) polyglot notebook https://vatlab.github.io/sos-docs/index.html#documentation.
It supports multiple Jupyter kernels in one notebook. SoS has several natively supported languages (R, Ruby, Python 2 & 3, Matlab, SAS, etc). Scala is not supported natively, but it's possible to pass information to the Scala kernel and capture output. There's also a seemingly straightforward way to add a new language (already with a Jupyter kernel); see https://vatlab.github.io/sos-docs/doc/documentation/Language_Module.html
I am using Livy in my application. The way it works is any user can connect to a already established spark session using REST (asynchronous calls). We have a cluster on which Livy sends Scala code for execution. It is up to you whether you want to close the session after sending the scala code or not. If the session is open then any one having access can send Scala code once again to do further processing. I have not tried sending different languages in the same session created through Livy but I know that Livy supports 3 languages in interactive mode i.e. R, Python and Scala. So, theoretically you would be able to send code in any language for execution.
Hope it helps to some extent.

What specific benefits can we get by using SparkSQL to access Hive tables compared to using JDBC to read tables from SQL server?

I just got this question while designing the storage part for a Hadoop-based platform. If we want to have data scientists to have access to the tables which have already been stored in a relational database (e.g.SQL-server of a Azure Virtual Machine), then will there be any particular benefits if we import the tables from SQL-server to HDFS (e.g. WASB) and create Hive tables on top of them?
In other words, since Spark allows users to read data from other databases using JDBC,is there any performance improvement if we persist the tables from the database in appropriate format (avro, parquet etc.) in HDFS and use SparkSQL to access them using HQL?
I am sorry if this question has been asked, I have done some research but could not get a comparison between the two methodologies.
I think there will be a big performance improvement as the data is local (assuming Spark is running on same Hadoop cluster where the data is stored on HDFS). Using JDBC if the actions/processing performed is interactive then user has to wait for the data to be loaded through JDBC from another machine (N/W latency and IO throughput) whereas if that is done upfront then user (data scientist) can concentrate on performing the actions straight away.

Big Data Analytics using Redshift vs Spark, Oozie Workflow Scheduler with Redshift Analytics

We want to do Big Data Analytics on our data stored in Amazon Redshift (currently in Terabytes, but will grow with time).
Currently, it seems that all our Analytics can be done through Redshift queries (and hence, no distributed processing might be required at our end) but we are not sure if that will remain to be the case in future.
In order to build a generic system that should be able to cater our future needs as well, we are looking to use Apache Spark for data analytics.
I know that data can be read into Spark RDDs from HDFS, HBase and S3, but does it support data reading from Redshift directly?
If not, we can look to transfer our data to S3 and then read it in Spark RDDs.
My question is if we should carry out our Data Analytics through Redshift's queries directly or should we look to go with the approach above and do analytics through Apache Spark (Problem here is that Data Locality optimization might not be available)?
In case we do analytics through Redshift queries directly, can anyone please suggest a good Workflow Scheduler to write our Analytics jobs with. Our requirement is to be able to execute jobs as a DAG (Job2 should execute only if Job1 succeeds, etc) and be able to schedule our workflows through the proposed Workflow Engine.
Oozie seems like a good fit for our requirements but it turns out that Oozie cannot be used without Hadoop. Does it make sense to set up Hadoop on our machines and then use Oozie Workflow Scheduler to schedule our Data Analysis jobs through Redshift queries?
You cannot access data stored on Redshift nodes directly (each via Spark), only via SQL queries submitted the cluster as a whole.
My suggestion would be to use Redshift as long as possible and only take on the complexity of Spark/Hadoop when you absolutely need it.
If, in the future, you move to Hadoop then Cascading Lingual gives you the option of running your existing Redshift analytics more or less unchanged.
Regarding workflow, Oozie is not a good fit for Redshift. I would suggest you look at Azkaban (true DAG) or Luigi (uses a Python DSL).

Resources