Databricks : Export data profiling report - databricks

Databricks can create a data profiling report after using the display(dataframe_name).
I have created a data profiling report using Azure Databricks but I do not know how do I export it.
Can you please suggest How to export/download this report to my local system?

There is no direct option to download the data profiling report from Azure Databricks to local machine in a tabular format.
Data profiling itself is a new feature that was introduced to reduce manual work that is needed to summarize the statistics of our dataframes.
And as specified in this official Microsoft documentation, we can only add the data profile to our dashboard.
There are also no other API's that can be used to download this data in tabular format.
As a possible workaround, it might be possible to complete this operation manually using pandas/ pandas on spark API to calculate all the required attributes.
In general, some of these stats can be directly obtained using df.describe as shown below. Here df is a pyspark dataframe:

Related

Writing data to datastore using jupyter notebook on Azure Ml studio

Hi I have prepared some data from saved table from Datastore on Jupyter notebook from Azre ML studio. Now, I want to write the prepared data back to a datastore using the same notebook.
Please help me with some examples.
Note: Here i have connected my ADLS Gen2 to datastore.
Integration work include enabling all datastore types to be consumable by data prep/dataset. This is very important as data prep/dataset is the engine that powers the data ingestion story for AzureML and being able to support all datastore types is crucial in making this a reality. Runs that involves reading and writing to datastore using data prep/dataset.
The table below presents what we currently support.

Read a Databricks table via Databricks api in Python?

Using Python-3, I am trying to compare an Excel (xlsx) sheet to an identical spark table in Databricks. I want to avoid doing the compare in Databricks. So I am looking for a way to read the spark table via the Databricks api. Is this possible? How can I go on to read a table: DB.TableName?
There is no way to read the table from the DB API as far as I am aware unless you run it as a job as LaTreb already mentioned. However, if you really wanted to, you could use either the ODBC or JDBC drivers to get the data through your databricks cluster.
Information on how to set this up can be found here.
Once you have the DSN set up you can use pyodbc to connect to databricks and run a query. At this time the ODBC driver will only allow you to run Spark-SQL commands.
All that being said, it will probably still be easier to just load the data into Databricks, unless you have some sort of security concern.
I can recomend you write pyspark code in notebook, call the notebook from previously defined job, and establish connection between your local machine and databricks workspace.
You could perfom comaprision directly on spark or convert data frames to pandas if you wish. If noteebok will end comaprision, could retrun result from particular job. I think that sending all databricks tables could be impossible because of API limitation you have spark cluster to perform complex operation, API should be use to send small messages.
Officical documentation:
https://learn.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/jobs#--runs-get-output
Retrieve the output and metadata of a run. When a notebook task
returns a value through the dbutils.notebook.exit() call, you can use
this endpoint to retrieve that value. Azure Databricks restricts this
API to return the first 5 MB of the output. For returning a larger
result, you can store job results in a cloud storage service.

Ingest JDBC/ODBC data to Snowflake

Does Snowflake support JDBC data sources, and if so how? I'm using Netsuite Analytics as a datasource and would like to load that to a Snowflake warehouse. The examples I'm finding for SnowFlake are file readers, I realise I can convert my netsuite data to a file and then ingest that but I'd rather remove that additonal step.
Snowflake has both ODBC and JDBC drivers that you can use. However, if you are loading a lot of data from Netsuite Analytics, most of the Snowflake drivers will actually generate files, PUT them to S3, and execute a COPY INTO statement to get the data into Snowflake for you. While it is more seamless, it is still executing that "additional step". The reason is...that's the most efficient way to get data into Snowflake, and it's not even close.
https://docs.snowflake.com/en/user-guide/odbc.html
https://docs.snowflake.com/en/user-guide/jdbc.html
No, Snowflake doesn't offer tools for loading data from JDBC or ODBC data sources. This is because Snowflake is a database platform and the functionality you're describing is that of a data integration or ETL tool. There are plenty of third party tools available that can handle this such as Matillion or Talend. Snowflake has a list of recommended technology partners on their website.
If you don't have access to an ETL tool then, as you mentioned, you can create a process yourself to export data from Netsuite to files that are uploaded to cloud storage such AWS S3. You can then set up this storage area an "external stage" and use Snowflake's COPY statement to load the data into Snowflake.

Bluemix Apache Spark Metrics

I have been looking for a way to monitor performance in Spark on Bluemix. I know in the Apache Spark project, they provide a metrics service based on the Coda Hale Metrics Library. This allows users to report Spark metrics to a variety of sinks including HTTP, JMX, and CSV files. Details here: http://spark.apache.org/docs/latest/monitoring.html
Does anyone know of any way to do this in the Bluemix Spark service? Ideally, I would like to save the metrics to a csv file in Object Storage.
Appreciate the help.
Thanks
Saul
Currently, I do not see an option for usage of "Coda Hale Metrics Library" and reporting the job history or accessing the information via REST API.
However, on the main page of the Spark history server, you can see the Event log directory. It refers to your following user directory: file:/gpfs/fs01/user/USER_ID/events/
There I saw JSON (like) formatted files.

web URL information to apache spark in web app

I am currently are trying to retrieve information from the EPA into our web app, which needs to utilize ibm bluemix and apache spark. The information that we are gathering from the EPA is this:
https://aqs.epa.gov/api and ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/surface/
But not only are we gathering historical data, we also want to update the data by inserting new data every hour into the web app. Hence concerning this I have a few questions:
1) Do we need to open a hdfs to store all the data? Or could we just retrieve the data by its URL and store it in a dataframe? IBM bluemix said it would provide 5 GB of storage, so how would one utilize that to store the historical data and store updated data per hour?
2) If we are going to update the data per hour by inserting new data into the data storage / data frame, should we still use spark streaming? If yes, how would we use spark streaming for URL data? A lot of resources I see online is only useful if one has an hdfs / formal database.
What we are doing currently is that we import the URLs through pandas:
url = "https://aqs.epa.gov/api/rawData?user=sogun3#gmail.com&pw=baycrane57&format=JSON&param=44201&bdate=20110501&edate=20110501&state=37&county=063"
import urllib2
content = urllib2.urlopen(url).read()
print content
However, if we use this method, it means that spark needs to be running 24-7 to ensure that the most updated data is utilized. How does one configure spark to run 24-7? Or is there a better method to process all the data and put them nicely in a dataframe so that the data could be accessed easily later?
Also, in a web app, can one still use iPython for data processing? Or is iPython just for interacting with the data and understanding the data experimentally?
Thanks a lot!
You have options ;-) If you need to read the source EPA data and then process it before you use it in your web app, then you can use the spark service to ETL (Extract Transform Load) the source data from EPA web site, manipulate or wrangle the data into the shape and size you want, and then save it into a storage service like Bluemix Object Storage. You web app would then read the data in the format you want directly from object storage. However, if the source EPA data is largely in a format you want to use in the web app, then you most certainly can create RDDs directly from web site and pull in the data as and when you need it. These datasets look small from my quick peek, so I don't think you need to worry about spark pulling it directly into memory for you to work on it; i.e. no need to try to store it locally with spark in the bluemix service cluster. Besides, there is no HDFS currently provided by the spark service; so as mentioned earlier, you would use an external storage service. re: "IBM bluemix said it would provide 5 GB of storage", that is intended for storing your personal and 3rd-party spark libraries and such.
re: "spark needs to be running 24-7". The spark service runs 24x7. Your spark code running on the service will run for as long as you program it to run ;-)
IPython (or Jupyter notebooks) is intended as a REPL for the web. So, yes, interactive. In your case, you can certainly write your spark code in an IPython notebook and have that run for as long as necessary, pulling and processing the EPA data for the web app, storing it in say object storage. The web app can then pull the data it needs from object storage. It is said that in the future APIs we will provided for the spark service, at which point your web app could talk directly to the spark service; in the meantime, you can certainly make something work with notebooks.

Resources