Using Python-3, I am trying to compare an Excel (xlsx) sheet to an identical spark table in Databricks. I want to avoid doing the compare in Databricks. So I am looking for a way to read the spark table via the Databricks api. Is this possible? How can I go on to read a table: DB.TableName?
There is no way to read the table from the DB API as far as I am aware unless you run it as a job as LaTreb already mentioned. However, if you really wanted to, you could use either the ODBC or JDBC drivers to get the data through your databricks cluster.
Information on how to set this up can be found here.
Once you have the DSN set up you can use pyodbc to connect to databricks and run a query. At this time the ODBC driver will only allow you to run Spark-SQL commands.
All that being said, it will probably still be easier to just load the data into Databricks, unless you have some sort of security concern.
I can recomend you write pyspark code in notebook, call the notebook from previously defined job, and establish connection between your local machine and databricks workspace.
You could perfom comaprision directly on spark or convert data frames to pandas if you wish. If noteebok will end comaprision, could retrun result from particular job. I think that sending all databricks tables could be impossible because of API limitation you have spark cluster to perform complex operation, API should be use to send small messages.
Officical documentation:
https://learn.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/jobs#--runs-get-output
Retrieve the output and metadata of a run. When a notebook task
returns a value through the dbutils.notebook.exit() call, you can use
this endpoint to retrieve that value. Azure Databricks restricts this
API to return the first 5 MB of the output. For returning a larger
result, you can store job results in a cloud storage service.
Related
Have a few questions regarding SnowPark with Python.
Why do we need Snowpark when we already have Snowflake python connector(freely) that can use to connect to Python jupyter with Snowflake DW?
If we use snowpark and connect with Local jupyter file to run ML model. Is it use our local machine computing power or Snowflake computing power?If its our local machine computing power how can we use Snowflake computing power to run the ml model?
Snowpark with Python allows you to treat a Snowflake table like a Spark DF. This means you can run pyspark code against Snowflake tables without the need to pull the data out of Snowflake, and the compute is Snowflake compute, not your local machine, which is fully elastic.
As long as you are executing spark dataframe logic in python, the compute will be on the Snowflake side. If you pull that data back to your machine to execute other logic (pandas, for example), then Snowpark will be pulling the data back to your local machine and the compute will happen there as normal.
I recommend starting here to learn more:
https://docs.snowflake.com/en/developer-guide/snowpark/index.html
A couple of things to have in mind is that we are talking about multiple things here and it could be good with some clarification.
Snowpark is a library that you install through pip/conda and it's a dataframe library, meaning you will be able to define a dataframe object that points to data in Snowflake (there is also ways to get data into Snowflake using it as well). It does not pull back the data to the client, unless you explicit tells it too, and all computation is done on the Snowflake side.
When you do operations on a Snowpark dataframe you are using Python code that will generate SQL that is executed in Snowflake, using the same mechanism as if you wrote your own SQL. The execution of the generated SQL is triggered by action methods such as .show(), .collect(), save_as_table() and so on.
More information here
As part of the Snowflake Python support there is also Python UDFs and
Python Stored Procedures, you do not need Snowpark to create or use those since you can do that with SQL using CREATE FUNCTION/CREATE STORED PROCEDURE, but you can use Snowpark as well.
With Python UDFs and Python Stored Procedures you can bring Python code into Snowflake that will be executed on the Snowflake compute, it will not be translated into SQL but will use Python sandboxes that run on the compute nodes.
In order to use Python Stored Procedures or Python UDFs you do not have to do anything, it is there like any other built in feature of Snowflake.
More information about Python UDFs and information about Python Stored Procedures.
The Snowflake Python Connector allows you to write SQL that is executed on Snowflake and the the result is pulled back to the client to be used there, using the client memory etc. If you want your manipulation to be executed in Snowflake you need to write SQL for it.
Using the existing Snowflake Python Connector you bring the Snowflake data to the system that is executing the Python program, limiting you to the compute and memory of that system. With Snowpark for Python, you are bringing your Python code to Snowflake to leverage the compute and memory of the cloud platform.
Snowpark python provides the following benefits which are not there with the Snowflake python connector
User can bring their custom python client code into Snowflake in the form of a UDF (user defined function) and use these functions on Dataframe.
It allows data engineers, data scientists and data developers to code in their familiar way with their language of choice, and execute pipeline, ML workflow and data apps faster and more securely, in a single platform.
User can build/work with queries using the familiar syntax of Dataframe APIs ( Dataframe style of programming)
User can use all popular Anaconda's libraries, all these libraries are pre-installed. User has access to hundreds of curated, open-source Python packages from Anaconda's libraries.
Snowpark operations are executed lazily on the server, which reduces the amount of data transferred between your client and the Snowflake database.
For more details, please refer to the documentation
I think that understanding Snowpark is complex. I think #Mats answer is really good. I created blog post that I think provides some high level guidance: https://www.mobilize.net/blog/lost-in-the-snowpark
Pretty new to Databricks.
I've got a requirement to access data in the Lakehouse using a JDBC driver. This works fine.
I now want to stub the Lakehouse using a docker image for some tests I want to write. Is it possible to get a Databricks / spark docker image with a database in it? I would also want to bootstrap the database on startup to create a bunch of tables.
No - Databricks is not a database but a hosted service (PaaS). You can theoretically you can use OSS Spark with Thriftserver started on it, but the connections strings and other functionality would be very different, so it makes no sense to spend time on it (imho). Real solution would depend on the type of tests that you want to do.
Regarding bootstrapping database & create a bunch of tables - just issue these commands, like, create database if not exists or create table if not exists when you application starts up (see documentation for an exact syntax)
I'm currently using databricks and in order to test my databricks code I'm using databricks connect in VS code. While I'm using databricks connect, since yesterday suddenly it started behaving strange, while I'm submitting a code from VS Code, the databricks-connect is taking the access of the person who has created the databricks cluster, now he doesn't have adequate access over all the resources and my test cases/ code is failing due to access issue.
Some more inputs: I have a function which does an update operation in a delta table, so this means i can't use a normal hive or temp table as they doesn't support update operation.
I have tried delta-lake in local, but that seems tobe not working and also not that convenient, as i wouldn't be able to access my ADLs location (until i specifically do the configuration change)
So, mu question is , how you guys are doing a spark specific testing? (I'm using pytest).
I found we don't have much material for a databricks code testing in the internet..any help?
I'm looking for, with no success, how to read a Azure Synapse table from Scala Spark. I found in https://learn.microsoft.com connectors for others Azure Databases with Spark but nothing with the new Azure Data Warehouse.
Does anyone know if it is possible?
It is now directly possible, and with trivial effort (there is even a right-click option added in the UI for this), to read data from a DEDICATED SQL pool in Azure Synapse (the new Analytics workspace, not just the DWH) for Scala (and unfortunately, ONLY Scala right now).
Within Synapse workspace (there is of course a write API as well):
val df = spark.read.sqlanalytics("<DBName>.<Schema>.<TableName>")
If outside of the integrated notebook experience, need to add imports:
import com.microsoft.spark.sqlanalytics.utils.Constants
import org.apache.spark.sql.SqlAnalyticsConnector._
It sounds like they are working on expanding to SERVERLESS SQL pool, as well as other SDKs (e.g. Python).
Read top portion of this article as reference: https://learn.microsoft.com/en-us/learn/modules/integrate-sql-apache-spark-pools-azure-synapse-analytics/5-transfer-data-between-sql-spark-pool
maybe I misunderstood your question, but normally you would use jdbc connection in Spark to use data from remote database
check this doc
https://docs.databricks.com/data/data-sources/azure/synapse-analytics.html
keep in mind, Spark would have to ingest data from Synapse tables into memory for processing and perform transformations there, so it is not going to push down operations into Synapse.
Normally, you want to run SQL query against source database and only bring results of SQL into Spark dataframe.
I am currently are trying to retrieve information from the EPA into our web app, which needs to utilize ibm bluemix and apache spark. The information that we are gathering from the EPA is this:
https://aqs.epa.gov/api and ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/surface/
But not only are we gathering historical data, we also want to update the data by inserting new data every hour into the web app. Hence concerning this I have a few questions:
1) Do we need to open a hdfs to store all the data? Or could we just retrieve the data by its URL and store it in a dataframe? IBM bluemix said it would provide 5 GB of storage, so how would one utilize that to store the historical data and store updated data per hour?
2) If we are going to update the data per hour by inserting new data into the data storage / data frame, should we still use spark streaming? If yes, how would we use spark streaming for URL data? A lot of resources I see online is only useful if one has an hdfs / formal database.
What we are doing currently is that we import the URLs through pandas:
url = "https://aqs.epa.gov/api/rawData?user=sogun3#gmail.com&pw=baycrane57&format=JSON¶m=44201&bdate=20110501&edate=20110501&state=37&county=063"
import urllib2
content = urllib2.urlopen(url).read()
print content
However, if we use this method, it means that spark needs to be running 24-7 to ensure that the most updated data is utilized. How does one configure spark to run 24-7? Or is there a better method to process all the data and put them nicely in a dataframe so that the data could be accessed easily later?
Also, in a web app, can one still use iPython for data processing? Or is iPython just for interacting with the data and understanding the data experimentally?
Thanks a lot!
You have options ;-) If you need to read the source EPA data and then process it before you use it in your web app, then you can use the spark service to ETL (Extract Transform Load) the source data from EPA web site, manipulate or wrangle the data into the shape and size you want, and then save it into a storage service like Bluemix Object Storage. You web app would then read the data in the format you want directly from object storage. However, if the source EPA data is largely in a format you want to use in the web app, then you most certainly can create RDDs directly from web site and pull in the data as and when you need it. These datasets look small from my quick peek, so I don't think you need to worry about spark pulling it directly into memory for you to work on it; i.e. no need to try to store it locally with spark in the bluemix service cluster. Besides, there is no HDFS currently provided by the spark service; so as mentioned earlier, you would use an external storage service. re: "IBM bluemix said it would provide 5 GB of storage", that is intended for storing your personal and 3rd-party spark libraries and such.
re: "spark needs to be running 24-7". The spark service runs 24x7. Your spark code running on the service will run for as long as you program it to run ;-)
IPython (or Jupyter notebooks) is intended as a REPL for the web. So, yes, interactive. In your case, you can certainly write your spark code in an IPython notebook and have that run for as long as necessary, pulling and processing the EPA data for the web app, storing it in say object storage. The web app can then pull the data it needs from object storage. It is said that in the future APIs we will provided for the spark service, at which point your web app could talk directly to the spark service; in the meantime, you can certainly make something work with notebooks.