Combining PyCharm, Spark and Jupyter - apache-spark

In the current setup I use a Jupyter notebook server that has a pyspark profile to use Spark. This all works great. I'm however working on a pretty big project and the notebook environment is lacking a bit for me. I found out that PyCharm allows you to run notebooks inside the IDE, giving you more of the advantages of a full IDE as opposed to Jupyter.
In the best case scenario I would run PyCharm locally as opposed to remote desktop on the gateway but using the gateway would be an acceptable alternative.
I'm trying first to get it to work on the gateway. If I have my (spark) Jupyter server running, the IP address set correctly 127.0.0.1:8888 and I create an .ipynb file, after I enter a line and press enter (not running it, just add a newline) I get the following error in the terminal I started pycharm from:
ERROR - pplication.impl.LaterInvocator - Not a stub type: Py:IPNB_TARGET in class org.jetbrains.plugins.ipnb.psi.IpnbPyTargetExpression
Googling doesn't get me anywhere.

I was able to get all three working by installing spark via terminal on OS X. Then I added the following packages to PyCharm project interpreter: findspark, pyspark.
Tested it out with
import findspark
findspark.init()
import pyspark
import random
sc = pyspark.SparkContext(appName="Pi")
num_samples = 100000000
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
sc.stop()
outputting: 3.14160028

Related

Multiprocessing on Jupyter notebook (Windows)

I'm trying to run the following simple code using multiprocessing on Jupyter notebook (Windows)
from multiprocessing import Process
def worker():
"""worker function"""
print('Worker')
return
p = Process(target=worker)
p.start()
I could not get it working. As suggested in a few other posts, I tried to call the worker function from another file and it still didn't work.
I came across a few posts related to this but I could not find any working solution.
Am I doing something wrong here? The same code works fine on a Linux distribution.

Why is ipython returning NameError while decorating interact?

I was assigned to use #interact in a Sage project. The code provided in the class notes is:
#interact
def show_crank(angle = slider(0,2*pi,pi/20,pi/10,label='angle')):
center = (0,0)
endpnt = (cos(angle),sin(angle))
pltcnt = point(center, size = 50)
pltend = point(endpnt, size = 50)
crank = line([center,endpnt])
(pltcnt + crank + pltend).show(xmin=-1,xmax=1,ymin=-1,ymax=1)
However, when I attempt to run this,
NameError: name 'interact' is not defined
is returned. I'm currently using Jupyter Lab. Using the Python 3 or Sage kernels results in the same issue.
I've read articles SO suggested to me, though no solution jumped out at me. Here are the articles for posterity:
NameError (from a function) while using iPython Notebook
Ipython notebook : Name error for Imported script function
Functions NameError
Python NameError: name is not defined
NameErrors and functions in python
Make sure ipywidgets is installed.
Activate widgets with: jupyter nbextension enable --py widgetsnbextension
For Jupyter Lab use: jupyter labextension install #jupyter-widgets/jupyterlab-manager
Finally: import ipywidgets as widgets and
from ipywidgets import *
This solves one part of the problem, but not another one that arose.
def show_crank(angle = slider(0,2*pi,pi/20,pi/10,label='angle')):
prompts the NameError: name 'slider' is not defined. When using Jupyter Lab, the correct function call is FloatSlider.
EDIT: Credit for this goes to Will Koehrsen.

Package Cell Issue on Databricks Community Edition

Followed this https://docs.databricks.com/notebooks/package-cells.html
On Community Edition - latest release Spark 3.x:
A.1. Created the Package with Object as per the example.
A.2. Ran in same Notebook in a different cell without cluster
re-start. No issues, runs fine.
package x.y.z
object Utils {
val aNumber = 5 // works!
def functionThatWillWork(a: Int): Int = a + 1
}
import x.y.z.Utils
Utils.functionThatWillWork(Utils.aNumber)
B.1. Ran this in different Notebook without Cluster re-start. Error.
import x.y.z.Utils
Utils.functionThatWillWork(Utils.aNumber)
C.1. Re-started the Cluster. Ran the import. Error.
import x.y.z.Utils
Utils.functionThatWillWork(Utils.aNumber)
Question
Is this an issue with Community Edition? Do not think so, but cannot place it. Observations contradict the official docs.

why my first spark/yarn app doesn't start (spark-submit error)

I am newbie in distributed system, big data. I recently started with Hadoop/yarn and spark(spark on yarn platform) for my graduation project and for now, I am blocked in.
I want to start my first spark application but I don't know the issue. when I use spark-submit to start the python script
#!/usr/bin/env python
from pyspark import SparkContext
sc=SparkContext("local[*]",appName="app")
data = sc.textFile("test.txt")
print(data.collect())
from numpy import array
parsedData = data.map(lambda line:array([float(x) for x in line.split(' ')]))
print(parsedData.collect())
this error shows up( unable to load Hadoop library...)
if someone can help me, please.
Here's a capture of the error:

Setting PYSPARK_SUBMIT_ARGS causes creating SparkContext to fail

a little backstory to my problem: I've been working on a spark project and recently switched my OS to Debian 9. After the switch, I reinstalled spark version 2.2.0 and started getting the following errors when running pytest:
E Exception: Java gateway process exited before sending the driver its port number
After googling for a little while, it looks like people have been seeing this cryptic error in two situations: 1) when trying to use spark with java 9; 2) when the environment variable PYSPARK_SUBMIT_ARGS is set.
It looks like I'm in the second scenario, because I'm using java 1.8. I have written a minimal example
from pyspark import SparkContext
import os
def test_whatever():
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages graphframes:graphframes:0.5.0-spark2.1-s_2.11,com.databricks:spark-avro_2.11:3.2.0 pyspark-shell'
sc = SparkContext.getOrCreate()
It fails with said error, but when the fourth line is commented out, the test is fine (I invoke it with pytest file_name.py).
Removing this env variable is -- at least I don't think it is -- a solution to this problem, because it gives some important information SparkContext. I can't find any documentation in this regard and am lost completely.
I would appreciate any hints on this
Putting this at the top of my jupyter notebook works for me:
import os
os.environ['JAVA_HOME'] = '/usr/lib/jvm/java-8-openjdk-amd64/'

Resources