double percent spark sql in jupyter notebook - apache-spark

I'm using a Jupyter Notebook on a Spark EMR cluster, want to learn more about a certain command but I don't know what the right technology stack is to search. Is that Spark? Python? Jupyter special syntax? Pyspark?
When I try to google it, I get only a couple results and none of them actually include the content I quoted. It's like it ignores the %%.
What does "%%spark_sql" do, what does it originate from, and what are arguments you can pass to it like -s and -n?
An example might look like
%%spark_sql -s true
select
*
from df

These are called magic commands/functions. Try running %pinfo %%spark_sql or %pinfo2 %%spark_sqlin a Jupyter cell and see if it gives you a detailed information about %%spark_sql.

Related

Is it possible to install a Databricks notebook into a cluster similarly to a library?

I want to be able to have the outputs/functions/definitions of a notebook available to be used by other notebooks in the same cluster without always have run the original one over and over...
For instance, i want to avoid:
definitions_file: has multiple commands, functions etc...
notebook_1
#invoking definitions file
%run ../../0_utilities/definitions_file
notebook_2
#invoking definitions file
%run ../../0_utilities/definitions_file
.....
Therefore i want that definitions_file is available for all other notebooks running in the same cluster.
I am using azure databricks.
Thank you!
No, there is no such thing as "shared notebook" that is implicitly imported. The closest thing you can do is to package your code as a Python library or into Python file inside Repos, but you still will need to write from my_cool_package import * in all notebooks.

%run magic using get_ipython().run_line_magic() in Databricks

I am trying to import other modules inside an Azure Databricks notebook. For instance, I want to import the module called 'mynbk.py' that is at the same level as my current Databricks notebook called 'myfile'
To do so, inside 'myfile', in a cell, I use the magic command:
%run ./mynbk
And that works fine.
Now, I would like to achieve the same result, but with using get_ipython().run_line_magic()
I thought, this is what I needed to type:
get_ipython().run_line_magic('run', './mynbk')
Unfortunately, that does not work. The error I get is:
Exception: File `'./mynbk.py'` not found.
Any help is appreciated.
It won't work on Databricks because IPython commands doesn't know about Databricks-specific implementation, and IPython's %run is expecting the file to execute, but Databricks notebooks aren't files on the disk, but the data stored in the database, so %run from IPython can't find it, and you get error.

Invalid date:Error while import CSV to Cassandra using pySpark

I'm using Jupyter NoteBook to run pySpark code to import CSV file to Cassandra v3.11.3. Getting below error.
... 1 more[![enter image description here][1]][1]
---------------------------------------------------------------------------
pySpark Code i have attached as picture:
[![pyspark_code][1]][1]
Any inputs...
Without the full trace it's hard to know exactly where this is failing. The method you pasted is just the p4yj wrapper method and we really would need to see the underlying Java Exception.
From what I can tell it looks like you are attempting to also use some options on the C* write that are unsupported. For example "MODE" - "DROPMALFORMED" is not a valid C* connector option. DataFrame Writer and Reader options are source specific so you are unfortunately unable to mix and match.
This makes me think that the data being written actually has a malformed date string or two and this code is dying when attempting to write the broken record. One way around this would be to attempt to do the date casting on CSV read which I believe does support DROPMALFORMED style parsing options.

Writing into a Jupyter Notebook from Python

Is it possible for a Python script to write into a iPython Notebook?
with open("my_notebook.ipynb", "w") as jup:
jup.write("print(\"Hello there!\")")
If there's some package for doing so, can I also control the way cells are split in the notebook?
I'm designing a software tool (that carries out some optimization) to prepare an iPython notebook that can be run on some server performing scientific computations.
I understand that a related solution is to output to a Python script and load it within a iPython Notebook using %load my_python_script.py. However, that involves a user to type stuff that I would ideally like to avoid.
Look at the nbformat repo on Github. The reference implementation is shown there.
From their docs
Jupyter (né IPython) notebook files are simple JSON documents, containing text, source code, rich media output, and metadata. Each segment of the document is stored in a cell.
It also sounds like you want to create the notebook programmatically, so you should use the NotebookNode object.
For the code, something like, should get you what you need. new_cell_code should be used if you have code cells versus just plain text cells. Text cells should use the existing markdown formatting.
import nbformat
notebook = nbformat.v4.new_notebook()
text = """Hello There """
notebook['cells'] = [nbformat.v4.new_markdown_cell(text)]
notebook= nbformat.v4.new_notebook()
nbformat.write(notebook,'filename.ipynb')

Pyspark Display not showing chart in Jupyter

I have the following line of code:
display(df2.groupBy("TransactionDate").sum("Amount").orderBy("TransactionDate"))
Which according to this document:
https://docs.databricks.com/user-guide/visualizations/index.html#visualizations-in-python
Should give me a chart in Jupyter. Instead I get the following output:
DataFrame[TransactionDate: timestamp, sum(Amount): double]
How come?
Which according to this document
...
Should give me a chart in Jupyter.
It should not. display is a feature of a proprietary Databricks platform, not feature of Spark, so unless you use their notebook flavor (based on Zeppelin not Jupyter), it won't be available for you.

Resources