Can I run pyspark locally without installing spark on windows 10? - apache-spark

I need to create a proof of concept using pyspark and I was wondering if there is a way to install it and use it via pip without having to install and configure spark itself. I've read a few answers suggesting that the newer versions of pyspark allow you to run it in standalone mode without without needing the full spark but when I try that I get the following error:
Traceback (most recent call last):
File "C:\Users\320181940\PycharmProjects\meetup\main.py", line 8, in <module>
sc = SparkContext("local", "meetup_etl")
File "C:\Users\320181940\PycharmProjects\meetup\venv\lib\site-packages\pyspark\context.py", line 144, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "C:\Users\320181940\PycharmProjects\meetup\venv\lib\site-packages\pyspark\context.py", line 331, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "C:\Users\320181940\PycharmProjects\meetup\venv\lib\site-packages\pyspark\java_gateway.py", line 101, in launch_gateway
proc = Popen(command, **popen_kwargs)
File "C:\Python310\lib\subprocess.py", line 966, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Python310\lib\subprocess.py", line 1435, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified
I installed pyspark 3.1.3 using pip, and I'm trying to run this on Windows 10. Any help would be much appreciated.

You need to install java and add JAVA_HOME to your environment variables path

Start a python interpreter, create a spark session and run your code, here's an example:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
df = spark.createDataFrame(
[["I'm ready!"], ["If I could put into words how much I love waking up at 6 am on Mondays I would."]]).toDF(
"text")
df.show()
Also make sure to set up HADOOP_HOME like it's specified in this gist

Related

problems installing sqlite and python on freebsd

This is driving me crazy! Any help you can provide will be most welcome!
I have a python3/sqlite application (running in a virtual environment) that is working fine on debian. I need to install it on freebsd (running in a virtual environment). I have installed python3 and sqlite (I can open a .sqlite file from the command line).
When I try to run the python script, I get the following error:
(venv) [jordan#webServer ~/crons/powerwall]$ python3 main.py
Traceback (most recent call last):
File "/usr/home/jordan/crons/powerwall/main.py", line 78, in <module>
run()
File "/usr/home/jordan/crons/powerwall/main.py", line 33, in run
database.load_db(config_mgr=config)
File "/usr/home/jordan/crons/powerwall/database.py", line 108, in load_db
db = PowerWallDb(cfg_mgr=config_mgr)
File "/usr/home/jordan/crons/powerwall/database.py", line 94, in __init__
super().__init__(cfg_mgr=cfg_mgr, section=section)
File "/usr/home/jordan/crons/powerwall/venv/lib/python3.9/site-packages/thompcoutils/db_utils.py", line 77, in __init__
self._connect_sqlite(self.sqlite_file, check_same_thread=False)
File "/usr/home/jordan/crons/powerwall/venv/lib/python3.9/site-packages/thompcoutils/db_utils.py", line 99, in _connect_sqlite
self._connect_uri(uri, **kwargs)
File "/usr/home/jordan/crons/powerwall/venv/lib/python3.9/site-packages/thompcoutils/db_utils.py", line 95, in _connect_uri
self.connection = sqlobject.sqlhub.processConnection = sqlobject.connectionForURI(uri, **kwargs)
File "/usr/home/jordan/crons/powerwall/venv/lib/python3.9/site-packages/sqlobject/dbconnection.py", line 1105, in connectionForURI
conn = connCls.connectionFromURI(uri)
File "/usr/home/jordan/crons/powerwall/venv/lib/python3.9/site-packages/sqlobject/dbconnection.py", line 154, in connectionFromURI
return cls._connectionFromParams(*cls._parseURI(uri))
File "/usr/home/jordan/crons/powerwall/venv/lib/python3.9/site-packages/sqlobject/sqlite/sqliteconnection.py", line 122, in _connectionFromParams
return cls(filename=path, **args)
File "/usr/home/jordan/crons/powerwall/venv/lib/python3.9/site-packages/sqlobject/sqlite/sqliteconnection.py", line 64, in __init__
raise ImportError(
ImportError: Cannot find an SQLite driver, tried supersqlite,pysqlite2,sqlite3,sqlite
Exception ignored in: <function DBAPI.__del__ at 0x8029c1310>
Traceback (most recent call last):
File "/usr/home/jordan/crons/powerwall/venv/lib/python3.9/site-packages/sqlobject/dbconnection.py", line 704, in __del__
self.close()
File "/usr/home/jordan/crons/powerwall/venv/lib/python3.9/site-packages/sqlobject/sqlite/sqliteconnection.py", line 217, in close
if self._memory:
AttributeError: 'SQLiteConnection' object has no attribute '_memory'
You have to install py-sqlite3 which is the "Standard Python binding to the SQLite3 library"
Install From Ports:
cd /usr/ports/databases/py-sqlite3/ && make install clean
Install From pkg:
pkg install databases/py-sqlite3
Basically, it looks like the standard Python bindings for SQLite is a separate package on FreeBSD (perhaps, on all *nix/*BSD). So, in general, there are three components, which you need, Python, SQLite (this one may not be actually necessary for Python) AND the standard Python bindings for SQLite. I have not worked with FreeBSD, but based on Googling, have you tried installing this https://pkgs.org/download/py39-sqlite3 maybe?

Jupyter notebook Python kernel - FileNotFoundError: [Errno 2] No such file or directory python3

Question
In Jupyter notebook, how to solve the issue of Python interpreter not found.
Environment
Ubuntu 18.04
Anaconda environment with Python 3.7
Problem
Start a jupyter notebook and create a notebook with Python 3 kernel and get the error. nlp_in_tensorflow was a conda environment already removed.
Traceback (most recent call last):
File "/home/user/conda/envs/cs231n/lib/python3.7/site-packages/tornado/web.py", line 1704, in _execute
result = await result
File "/home/user/conda/envs/cs231n/lib/python3.7/site-packages/tornado/gen.py", line 769, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/home/user/conda/envs/cs231n/lib/python3.7/site-packages/notebook/services/sessions/handlers.py", line 72, in post
type=mtype))
File "/home/user/conda/envs/cs231n/lib/python3.7/site-packages/tornado/gen.py", line 762, in run
value = future.result()
File "/home/user/conda/envs/cs231n/lib/python3.7/site-packages/tornado/gen.py", line 769, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/home/user/conda/envs/cs231n/lib/python3.7/site-packages/notebook/services/sessions/sessionmanager.py", line 88, in create_session
kernel_id = yield self.start_kernel_for_session(session_id, path, name, type, kernel_name)
File "/home/user/conda/envs/cs231n/lib/python3.7/site-packages/tornado/gen.py", line 762, in run
value = future.result()
File "/home/user/conda/envs/cs231n/lib/python3.7/site-packages/tornado/gen.py", line 769, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/home/user/conda/envs/cs231n/lib/python3.7/site-packages/notebook/services/sessions/sessionmanager.py", line 101, in start_kernel_for_session
self.kernel_manager.start_kernel(path=kernel_path, kernel_name=kernel_name)
File "/home/user/conda/envs/cs231n/lib/python3.7/site-packages/tornado/gen.py", line 762, in run
value = future.result()
File "/home/user/conda/envs/cs231n/lib/python3.7/site-packages/notebook/services/kernels/kernelmanager.py", line 176, in start_kernel
kernel_id = await maybe_future(self.pinned_superclass.start_kernel(self, **kwargs))
File "/home/user/conda/envs/cs231n/lib/python3.7/site-packages/jupyter_client/multikernelmanager.py", line 185, in start_kernel
km.start_kernel(**kwargs)
File "/home/user/conda/envs/cs231n/lib/python3.7/site-packages/jupyter_client/manager.py", line 313, in start_kernel
self.kernel = self._launch_kernel(kernel_cmd, **kw)
File "/home/user/conda/envs/cs231n/lib/python3.7/site-packages/jupyter_client/manager.py", line 220, in _launch_kernel
return launch_kernel(kernel_cmd, **kw)
File "/home/user/conda/envs/cs231n/lib/python3.7/site-packages/jupyter_client/launcher.py", line 131, in launch_kernel
proc = Popen(cmd, **kwargs)
File "/home/user/conda/envs/cs231n/lib/python3.7/subprocess.py", line 800, in __init__
restore_signals, start_new_session)
File "/home/user/conda/envs/cs231n/lib/python3.7/subprocess.py", line 1551, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: '/home/user/conda/envs/nlp_in_tensorflow/bin/python3': '/home/user/conda/envs/nlp_in_tensorflow/bin/python3'
Cause
The kernel.json file of the Python 3 kernel was pointing to the deleted environment.
$ jupyter kernelspec list
Available kernels:
python3 /home/oonisim/.local/share/jupyter/kernels/python3
$ cat ~/.local/share/jupyter/kernels/python3/kernel.json
{
"argv": [
"/home/user/conda/envs/nlp_in_tensorflow/bin/python3", <----- Referring to the deleted environment
"-m",
"ipykernel_launcher",
"-f",
"{connection_file}"
],
"display_name": "Python 3",
"language": "python"
}
Resource
What to do when things go wrong - Python Environments
Multiple python environments, whether based on Anaconda or Python
Virtual environments, are often the source of reported issues. In many
cases, these issues stem from the Notebook server running in one
environment, while the kernel and/or its resources, derive from
another environment.
Another thing to check is the kernel.json file that will be located in
the aforementioned kernel specs directory identified by running
jupyter kernelspec list. This file will contain an argv stanza that
includes the actual command to run when launching the kernel.
Oftentimes, when reinstalling python environments, a previous
kernel.json will reference an python executable from an old or
non-existent location. As a result, it’s always a good idea when
encountering kernel startup issues to validate the argv stanza to
ensure all file references exist and are appropriate.
Fix
Removed ~/.local/share/jupyter/kernels/python3/kernel.json.
Related issues
Jupyter Notebook is loading incorrect Python kernel #2563
Running Jupyter with multiple Python and IPython paths - 3. How Jupyter knows what Python to use
Jupyter is set-up to be able to use a wide range of "kernels", or
execution engines for the code. These can be Python 2, Python 3, R,
Julia, Ruby... there are dozens of possible kernels to use. But in
order for this to happen, Jupyter needs to know where to look for the
associated executable: that is, it needs to know which path the python
sits in.
These paths are specified in jupyter's kernelspec, and it's possible
for the user to adjust them to their desires. For example, here's the
list of kernels that I have on my system:

PyArrow OSError: [WinError 193] %1 is not a valid win32 application

My OS is Windows 10 64 bit and I use Anaconda 3.8 64 bit. I try to develop Hadoop File System 3.3 client with PyArrow module. Installation of PyArrow with conda on windows 10 is successful.
> conda install -c conda-forge pyarrow
But connection of hdfs 3.3 with pyarrow throws errors like below,
import pyarrow as pa
fs = pa.hdfs.connect(host='localhost', port=9000)
The errors are
Traceback (most recent call last):
File "C:\eclipse-workspace\PythonFredProj\com\aaa\fred\hdfs3-test.py", line 14, in <module>
fs = pa.hdfs.connect(host='localhost', port=9000)
File "C:\Python-3.8.3-x64\lib\site-packages\pyarrow\hdfs.py", line 208, in connect
fs = HadoopFileSystem(host=host, port=port, user=user,
File "C:\Python-3.8.3-x64\lib\site-packages\pyarrow\hdfs.py", line 38, in __init__
_maybe_set_hadoop_classpath()
File "C:\Python-3.8.3-x64\lib\site-packages\pyarrow\hdfs.py", line 136, in _maybe_set_hadoop_classpath
classpath = _hadoop_classpath_glob(hadoop_bin)
File "C:\Python-3.8.3-x64\lib\site-packages\pyarrow\hdfs.py", line 163, in _hadoop_classpath_glob
return subprocess.check_output(hadoop_classpath_args)
File "C:\Python-3.8.3-x64\lib\subprocess.py", line 411, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "C:\Python-3.8.3-x64\lib\subprocess.py", line 489, in run
with Popen(*popenargs, **kwargs) as process:
File "C:\Python-3.8.3-x64\lib\subprocess.py", line 854, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Python-3.8.3-x64\lib\subprocess.py", line 1307, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
OSError: [WinError 193] %1 is not a valid win32 application
I install the Visual C++ 2015 on Windows 10. But the same errors are still shown.
This is my solution.
Before starting the pyarrow, Hadoop 3 has to be installed on your windows 10 64 bit. and the installation path has to be set on Path
install pyarrow 3.0 (version is important. have to be 3.0)
pip install pyarrow==3.0
create PyDev module on eclipse PyDev perspective. The sample codes are like below
from pyarrow import fs
hadoop = fs.HadoopFileSystem("localhost", port=9000)
print(hadoop.get_file_info('/'))
choose your created pydev module and click the [Properties (Alt + Enter)]
Click the [Run/Debug Settings]. Choose the the pydev module and [Edit] button.
In [Edit Configuration] window, select the [Environment] tab
Click [Add] button
You have to make 2 Environment Variables. "CLASSPATH" and "LD_LIBRARY_PATH"
CLASSPATH : In command prompt, execute the following command.
hdfs classpath --glob
copy the returned values and paste them into Value text field (The retured values are long string value. but copy them all)
LD_LIBRARY_PATH : Insert the path of libhdfs.so file on hadoop 3, In my case "C:\hadoop-3.3.0\lib\native" into Value text field.
Ok! the pyarrow 3.0 configuration is set. You can connect the hadoop 3.0 on windows 10 eclipse PyDev.

Using pyspark on Windows not working- py4j

I installed Zeppelin on Windows using this tutorial and this.
I also installed java 8 to avoid problems.
I'm now able to start the Zeppelin server, and I'm trying to run this code -
%pyspark
a=5*4
print("value = %i" % (a))
sc.version
I'm getting this error, related to py4j. I had other problems with this library before (same as here), and to avoid them I replaced the library of py4j in the Zeppelin and Spark on my computer with the latest version- py4j 0.10.7.
This is the error I get-
Traceback (most recent call last):
File "C:\Users\SHIRM~1.ARG\AppData\Local\Temp\zeppelin_pyspark-1240802621138907911.py", line 309, in <module>
sc = _zsc_ = SparkContext(jsc=jsc, gateway=gateway, conf=conf)
File "C:\Users\SHIRM.ARGUS\spark-2.3.2\spark-2.3.2-bin-hadoop2.7\python\pyspark\context.py", line 118, in __init__
conf, jsc, profiler_cls)
File "C:\Users\SHIRM.ARGUS\spark-2.3.2\spark-2.3.2-bin-hadoop2.7\python\pyspark\context.py", line 189, in _do_init
self._javaAccumulator = self._jvm.PythonAccumulatorV2(host, port, auth_token)
File "C:\Users\SHIRM.ARGUS\Documents\zeppelin-0.8.0-bin-all\interpreter\spark\pyspark\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1525, in __call__
File "C:\Users\SHIRM.ARGUS\Documents\zeppelin-0.8.0-bin-all\interpreter\spark\pyspark\py4j-0.10.7-src.zip\py4j\protocol.py", line 332, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling None.org.apache.spark.api.python.PythonAccumulatorV2. Trace:
I googled it, but couldn't find anyone that it had happened to.
Does anyone have an idea how can I solve this?
Thanks
I feel you have installed Java 9 or 10. Uninstall either of those versions and install a fresh copy of Java 8 from here: https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
And set JAVA_HOME inside hadoop_env.cmd (open with any text-editor).
Note: Java 8 or 7 are stable versions to use and uninstall any existing versions of JAVA. Make sure you add JDK (not JRE) in JAVA_HOME.
I faced the same problem today, and I fixed it by adding PYTHONPATH in the system environment like:
%SPARK_HOME%\python\lib\py4j;%SPARK_HOME%\python\lib\pyspark

Environment Munging in Anaconda, PyCharm, and Jupyter Notebooks? No such file or directory: 'conda'

Regular PyCharm, Anaconda and Jupyter user, but for the first time I'm starting a project that would benefit with having them all play together:
I have a correctly configured Anaconda environment running within PyCharm, however when I try to launch ipynb notebooks (that work perfectly well in jupyter notebook from within PyCharm, I get...
/home/bolster/anaconda3/bin/python3.5 /home/bolster/anaconda3/bin/jupyter-notebook --no-browser --ip 127.0.0.1 --port 8888
[W 12:33:12.515 NotebookApp] Unrecognized JSON config file version, assuming version 1
[W 12:33:12.519 NotebookApp] Config option `matplotlib` not recognized by `NotebookApp`.
[W 12:33:12.521 NotebookApp] Config option `matplotlib` not recognized by `NotebookApp`.
Traceback (most recent call last):
File "/home/bolster/anaconda3/bin/jupyter-notebook", line 6, in <module>
sys.exit(notebook.notebookapp.main())
File "/home/bolster/anaconda3/lib/python3.5/site-packages/jupyter_core/application.py", line 267, in launch_instance
return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
File "/home/bolster/anaconda3/lib/python3.5/site-packages/traitlets/config/application.py", line 595, in launch_instance
app.initialize(argv)
File "<decorator-gen-7>", line 2, in initialize
File "/home/bolster/anaconda3/lib/python3.5/site-packages/traitlets/config/application.py", line 74, in catch_config_error
return method(app, *args, **kwargs)
File "/home/bolster/anaconda3/lib/python3.5/site-packages/notebook/notebookapp.py", line 1069, in initialize
self.init_configurables()
File "/home/bolster/anaconda3/lib/python3.5/site-packages/notebook/notebookapp.py", line 837, in init_configurables
parent=self,
File "/home/bolster/anaconda3/lib/python3.5/site-packages/nb_conda_kernels/manager.py", line 19, in __init__
specs = self.find_kernel_specs() or {}
File "/home/bolster/anaconda3/lib/python3.5/site-packages/nb_conda_kernels/manager.py", line 129, in find_kernel_specs
self.conda_info = self._conda_info()
File "/home/bolster/anaconda3/lib/python3.5/site-packages/nb_conda_kernels/manager.py", line 29, in _conda_info
p = subprocess.check_output(["conda", "info", "--json"]
File "/home/bolster/anaconda3/lib/python3.5/subprocess.py", line 629, in check_output
**kwargs).stdout
File "/home/bolster/anaconda3/lib/python3.5/subprocess.py", line 696, in run
with Popen(*popenargs, **kwargs) as process:
File "/home/bolster/anaconda3/lib/python3.5/subprocess.py", line 950, in __init__
restore_signals, start_new_session)
File "/home/bolster/anaconda3/lib/python3.5/subprocess.py", line 1544, in _execute_child
raise child_exception_type(errno_num, err_msg)
FileNotFoundError: [Errno 2] No such file or directory: 'conda'
I have a hunch that this is because PyCharm isn't attempting to launch the notebook in a "real" user environment and as such isn't getting my $PATH imports (in .profile), however as this isn't a "Run Configuration" I can't see any way to "point" the IDE to look in the right path for the conda executable.
However if I add a link to the conda executable to /usr/bin/, it works, but this is an order of hackery on a collaborative project that I'm not exactly happy with.
Is there a way to force PyCharm to look in the right place or at least update the internal-global environment variables to avoid seriously telling collaborators they need to link from their userland environment to the root bins?
Any application started from the terminal will inherit all the properties from the terminal. If you start the PyCharm the non-terminal way then the $PATH defined in the .profile won't be inherited and so the default $PATH.
Started the PyCharm from the shell and then $PATH from the .profile was inherited. Now, PyCharm is able to found conda in the path.
Another way is to create a .sh file in the /etc/profile.d folder with the PATH variable. These variables are system wide. So, there is no need to start PyCharm from the terminal now.
The reason for such a behavior and alternate solutions are specified in this StackOverflow post.
Hope this helps!

Resources