Pyspark Display not showing chart in Jupyter - apache-spark

I have the following line of code:
display(df2.groupBy("TransactionDate").sum("Amount").orderBy("TransactionDate"))
Which according to this document:
https://docs.databricks.com/user-guide/visualizations/index.html#visualizations-in-python
Should give me a chart in Jupyter. Instead I get the following output:
DataFrame[TransactionDate: timestamp, sum(Amount): double]
How come?

Which according to this document
...
Should give me a chart in Jupyter.
It should not. display is a feature of a proprietary Databricks platform, not feature of Spark, so unless you use their notebook flavor (based on Zeppelin not Jupyter), it won't be available for you.

Related

How to display markdown output in databricks notebook from a python cell

With IPython/Jupyter it's possible to output markdown using the IPython display module and its MarkDownclass.
Question
How can I accomplish this with Azure Databricks?
What I tried
Databricks display
Tried using Databrick's display with the IPython Markdown class:
from IPython.display import Markdown
display(Markdown('*some markdown* test'))
but this results in the following error:
Exception: Cannot call display(<class 'IPython.core.display.Markdown'>)
IPython display
I then tried to use IPython's display:
from IPython.display import display, Markdown
display(Markdown('*some markdown* test'))
but this just displays the text:
<IPython.core.display.Markdown object>
IPython display_markdown
Tried using IPython's display_markdown:
from IPython.display import display_markdown
display_markdown('# Markdown is here!\n*some markdown*\n- and\n- some\n- more')
but this results in nothing showing up:
Looking up documentation
Also tried checking Azure Databricks documentation. At first I visited https://www.databricks.com/databricks-documentation which leads me to https://learn.microsoft.com/en-ca/azure/databricks/ but I wasn't able to find anything via searching or clicking the links and I usually find Microsoft documentation quite good.
Checking Databrick's display source
As Saideep Arikontham mentioned in the comments, Databricks version 11 and above is using IPython kernel so I dug a bit deeper.
According to Databrick's source for the display function, it will readily render any object that implements _repr_html().
However I'm having a hard time being able to get the raw html output that I'm assuming IPython.display.Markdown should be able to output. I can only find _repr_markdown_() and _data_and_metadata() where the former just calls the latter and the output, at least in Databricks, is just the original raw markdown string.
Markdown and display_markdown are not giving desired output when used in Azure Databricks. I have done the following in Databricks 11.1 runtime.
Taking inputs from the question, I understood that when a class has _repr_html(), it is able to output the desired result. But when this method is absent in class, it is returning an object.
So, for Markdown to work, I have written my own Markdown class where I used Python's markdown library.
from IPython.display import DisplayObject, TextDisplayObject
class Markdown(TextDisplayObject):
def __init__(self,TextDisplayObject):
import markdown as md
#converting markdown to html
self.html = md.markdown(TextDisplayObject)
def _repr_html_(self):
return self.html
Now, this class is not completely same as the IPython.display.Markdown. I have formatted your sample markdown
'# Markdown is here!\n*some markdown*\n- and\n- some\n- more' as following to get the desired result.
Markdown('''# Markdown is here!\n
*some markdown*\n
- and\n
- some\n
- more''')
NOTE:
For display_markdown() to display output, we must specify another argument raw as True (display_markdown(<markdown_str>, raw=True)). However, in Databricks it is returning undefined (NoneType).
Please do install markdown library first using %pip install markdown in Databricks cell.

double percent spark sql in jupyter notebook

I'm using a Jupyter Notebook on a Spark EMR cluster, want to learn more about a certain command but I don't know what the right technology stack is to search. Is that Spark? Python? Jupyter special syntax? Pyspark?
When I try to google it, I get only a couple results and none of them actually include the content I quoted. It's like it ignores the %%.
What does "%%spark_sql" do, what does it originate from, and what are arguments you can pass to it like -s and -n?
An example might look like
%%spark_sql -s true
select
*
from df
These are called magic commands/functions. Try running %pinfo %%spark_sql or %pinfo2 %%spark_sqlin a Jupyter cell and see if it gives you a detailed information about %%spark_sql.

%run magic using get_ipython().run_line_magic() in Databricks

I am trying to import other modules inside an Azure Databricks notebook. For instance, I want to import the module called 'mynbk.py' that is at the same level as my current Databricks notebook called 'myfile'
To do so, inside 'myfile', in a cell, I use the magic command:
%run ./mynbk
And that works fine.
Now, I would like to achieve the same result, but with using get_ipython().run_line_magic()
I thought, this is what I needed to type:
get_ipython().run_line_magic('run', './mynbk')
Unfortunately, that does not work. The error I get is:
Exception: File `'./mynbk.py'` not found.
Any help is appreciated.
It won't work on Databricks because IPython commands doesn't know about Databricks-specific implementation, and IPython's %run is expecting the file to execute, but Databricks notebooks aren't files on the disk, but the data stored in the database, so %run from IPython can't find it, and you get error.

Invalid date:Error while import CSV to Cassandra using pySpark

I'm using Jupyter NoteBook to run pySpark code to import CSV file to Cassandra v3.11.3. Getting below error.
... 1 more[![enter image description here][1]][1]
---------------------------------------------------------------------------
pySpark Code i have attached as picture:
[![pyspark_code][1]][1]
Any inputs...
Without the full trace it's hard to know exactly where this is failing. The method you pasted is just the p4yj wrapper method and we really would need to see the underlying Java Exception.
From what I can tell it looks like you are attempting to also use some options on the C* write that are unsupported. For example "MODE" - "DROPMALFORMED" is not a valid C* connector option. DataFrame Writer and Reader options are source specific so you are unfortunately unable to mix and match.
This makes me think that the data being written actually has a malformed date string or two and this code is dying when attempting to write the broken record. One way around this would be to attempt to do the date casting on CSV read which I believe does support DROPMALFORMED style parsing options.

Writing into a Jupyter Notebook from Python

Is it possible for a Python script to write into a iPython Notebook?
with open("my_notebook.ipynb", "w") as jup:
jup.write("print(\"Hello there!\")")
If there's some package for doing so, can I also control the way cells are split in the notebook?
I'm designing a software tool (that carries out some optimization) to prepare an iPython notebook that can be run on some server performing scientific computations.
I understand that a related solution is to output to a Python script and load it within a iPython Notebook using %load my_python_script.py. However, that involves a user to type stuff that I would ideally like to avoid.
Look at the nbformat repo on Github. The reference implementation is shown there.
From their docs
Jupyter (né IPython) notebook files are simple JSON documents, containing text, source code, rich media output, and metadata. Each segment of the document is stored in a cell.
It also sounds like you want to create the notebook programmatically, so you should use the NotebookNode object.
For the code, something like, should get you what you need. new_cell_code should be used if you have code cells versus just plain text cells. Text cells should use the existing markdown formatting.
import nbformat
notebook = nbformat.v4.new_notebook()
text = """Hello There """
notebook['cells'] = [nbformat.v4.new_markdown_cell(text)]
notebook= nbformat.v4.new_notebook()
nbformat.write(notebook,'filename.ipynb')

Resources