Running Scala with pixiedust in jupyter notebook - python-3.x

I'm trying to run some scala code in python 3 with jupyter notebook. I have installed pixiedust to make this easier. I have imported it and managed to do that successfully. However, according to the tutorial (https://pixiedust.github.io/pixiedust/scalabridge.html) I should be able to use %scala and then run scala code. This is not working for me and I'm getting an error like this: UsageError: Cell magic %%scala not found.
I have tried with both %scala and %%scala, but neither work.
Does anyone know of a different syntax or how this could work?
Thanks!

Related

Pyspark -No module named coverage_daemon

I am trying to execute this simple code in my dataframe:
import ast rddAlertsRdd = df.rdd.map(lambda message: ast.literal_eval(message['value'])) rddAlerts= rddAlertsRdd.collect()
But I´m getting the error below:
Versions:
Spark: 3.3.1
Hadoop: 2.7
Python: 3.7
Pyspark: 3.3.1
Py4j: 0.10.9.5
OpenJDK: 8
Can it be a problem related to compatibility versions? Appreciate your help!
In order to solve the problem I tried to change Spark environment variables in my Dockerfile.
This is what I have in my Dockerfile:
tl;dr No idea what could be wrong but giving you a little more about the possible cause while reading the source code. Hope this helps.
The only place with coverage_daemon is python/test_coverage/conf/spark-defaults.conf which (as you may've guessed already) is for test coverage and does not seem to be used in production.
It appears that for some reason python/run-tests-with-coverage got executed.
It looks as if you're using Jupyter environment that seems misconfigured.

Is it possible to install a Databricks notebook into a cluster similarly to a library?

I want to be able to have the outputs/functions/definitions of a notebook available to be used by other notebooks in the same cluster without always have run the original one over and over...
For instance, i want to avoid:
definitions_file: has multiple commands, functions etc...
notebook_1
#invoking definitions file
%run ../../0_utilities/definitions_file
notebook_2
#invoking definitions file
%run ../../0_utilities/definitions_file
.....
Therefore i want that definitions_file is available for all other notebooks running in the same cluster.
I am using azure databricks.
Thank you!
No, there is no such thing as "shared notebook" that is implicitly imported. The closest thing you can do is to package your code as a Python library or into Python file inside Repos, but you still will need to write from my_cool_package import * in all notebooks.

%run magic using get_ipython().run_line_magic() in Databricks

I am trying to import other modules inside an Azure Databricks notebook. For instance, I want to import the module called 'mynbk.py' that is at the same level as my current Databricks notebook called 'myfile'
To do so, inside 'myfile', in a cell, I use the magic command:
%run ./mynbk
And that works fine.
Now, I would like to achieve the same result, but with using get_ipython().run_line_magic()
I thought, this is what I needed to type:
get_ipython().run_line_magic('run', './mynbk')
Unfortunately, that does not work. The error I get is:
Exception: File `'./mynbk.py'` not found.
Any help is appreciated.
It won't work on Databricks because IPython commands doesn't know about Databricks-specific implementation, and IPython's %run is expecting the file to execute, but Databricks notebooks aren't files on the disk, but the data stored in the database, so %run from IPython can't find it, and you get error.

Breusch-Pagan_test in Python 3

I am trying to run the Breusch-Pagan test in python 3 using the below code. It works perfectly in python 2.7, but when I run it in Anaconda with python 3.6 instead, I get the following the error: "module 'statsmodels.stats.api' has no attribute 'het_breuschpagan'".
I have looked at the statsmodel documentation at this link, https://www.statsmodels.org/devel/generated/statsmodels.stats.diagnostic.het_breuschpagan.html, and know that I am running the right code.
import statsmodels.stats.api as sms
breuschpagan_test = sms.het_breuschpagan(model_run.resid, model.model.exog)
Does anyone know a solution to this or a different way to call this statsmodel function in python 3?
Also, due to limitations at work, I cannot uninstall/re-install or update my statsmodel library at the moment either.
Thanks in advance!

Setting PYSPARK_SUBMIT_ARGS causes creating SparkContext to fail

a little backstory to my problem: I've been working on a spark project and recently switched my OS to Debian 9. After the switch, I reinstalled spark version 2.2.0 and started getting the following errors when running pytest:
E Exception: Java gateway process exited before sending the driver its port number
After googling for a little while, it looks like people have been seeing this cryptic error in two situations: 1) when trying to use spark with java 9; 2) when the environment variable PYSPARK_SUBMIT_ARGS is set.
It looks like I'm in the second scenario, because I'm using java 1.8. I have written a minimal example
from pyspark import SparkContext
import os
def test_whatever():
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages graphframes:graphframes:0.5.0-spark2.1-s_2.11,com.databricks:spark-avro_2.11:3.2.0 pyspark-shell'
sc = SparkContext.getOrCreate()
It fails with said error, but when the fourth line is commented out, the test is fine (I invoke it with pytest file_name.py).
Removing this env variable is -- at least I don't think it is -- a solution to this problem, because it gives some important information SparkContext. I can't find any documentation in this regard and am lost completely.
I would appreciate any hints on this
Putting this at the top of my jupyter notebook works for me:
import os
os.environ['JAVA_HOME'] = '/usr/lib/jvm/java-8-openjdk-amd64/'

Resources