Problem to Install pyspark of version 2.3 - apache-spark

I was trying to install pyspark 2.3 from the last couple days. But I have found out Version 3.0.1 and 2.4.7 only so far. Actually I was trying to run a code implemented in pyspark 2.3 as a part of my project. Is that version still available now ? Please send me the essential resources to install pyspark 2.3 if it is available to install as well as shareable. As it seems tough to me to implement that code in version 3.0.1.

Pyspark 2.3 should still be available via Conda-Forge.
Please checkout https://anaconda.org/conda-forge/pyspark/files?version=2.3.2
There you will find the following and more packages for a direct download:
linux-64/pyspark-2.3.2-py36_1000.tar.bz2
win-64/pyspark-2.3.2-py36_1000.tar.bz2
If you don't want the raw packages, you can also install it via conda:
conda install -c conda-forge pyspark=2.3.2

Related

How to install Apache Spark on Windows 10?

I am trying to install spark-3.3.1-bin-hadoop3 "Prebuilt for Apache Hadoop 3.3 and later" that I have downloaded from https://spark.apache.org/downloads.html , On my Windows 10 machine.
But when I search for winutils the latest version I could find is 3.3.1 from here: https://github.com/kontext-tech/winutils. Also there are other winutils with lower versions on other Github pages.
As the latest version of Hadoop is 3.3.4 I don't know what to do? I even don't know should I install Hadoop before installing Spark? Or what Github page is the official page for winutils?

Unable to run Rasa

I have been trying to install and use Rasa for one of my Assignment for 2 days. I have tried everything. I have tried using it with python 3.9, python 3.7.x and currently I have installed python 3.6.8 for which the error message is shown below. I have tried to find the solution via github discuss sections, but nothing has helped me yet. Can anyone tell me how can I resolve this?
Installed versions:
Rasa Version Installed: 2.3
Python version 3.6.8
Pip version 18.1
tensorflow version 2.3.1
The install steps can be found here along with the supported Python versions. 3.9 is not supported.
This is a TensorFlow loading issue which is probably related to your system configuration. There is more information here
I faced this problem and solved it by installing tensorflow using conda.
Another way can be using virtual environment.

PyArrow >= 0.8.0 must be installed; however, it was not found

I am on the Cloudera platform, I am trying to use pandas UDF in pyspark.I am getting below error.
PyArrow >= 0.8.0 must be installed; however, it was not found.
Installing pyarrow 0.8.0 on the platform will take time.
Is there any workaround to use pandas udf without installing pyarrow?
I can install on my personal anaconda environment, is it possible to export conda and use it in pyspark?
I can install on my personal anaconda environment, is it possible to export conda and use it in pyspark?
No you cant simply install in your machine and use it, as pyspark is distributed.
But you can pack your venv and ship to your pyspark worker without install custom package like pyarrow on every machine of your platform.
To use virtualenv, simply follow venv-pack package's instruction.
https://jcristharif.com/venv-pack/spark.html

How to update OpenCV 3.4.2 to OpenCV 4 or latest stable version?

I have installed OpenCV 3.4.2 successfully by following the tutorial given here:
https://www.pyimagesearch.com/2015/07/20/install-opencv-3-0-and-python-3-4-on-ubuntu/
Now, I would like to update to openCV 4 or the latest stable version.
Do I need to uninstall 3.4.2 first?
If so, how should I uninstall it.
I am afraid to creating another virtual environment and installing version 4 or master package from github by following the same steps might create conflicts. Please advice.
Working on Ubuntu 16.04 LTS, python 3.5
For Python interface, I guess you can try something like pip install opencv-python==4.0.0.21. Note, you might need to run pip3 install opencv-python==4.0.0.21 depending upon your pip version.

best_score_ parameter from GridSearchCV from spark-sklearn doesn't work with version 0.2.3

I am looking to use best_score_ parameter from GridSearchCV function, but it looks like that is not present in the latest version of the library spark-sklearn (version 0.2.3). When I'm trying to uninstall the latest version and reinstall and older version (with version 0.2.0) with the command
pip install spark-sklearn-0.2.0
It does not work. How can I install older versions of spark-sklearn library in my cluster environments? best_score_ parameter seems to work fine in version 0.2.0.
Thanks
There is a known issue with spark-sklearn version 0.2.3 for not having best_score_ parameter in gridSearchCV. The issue can be found here at
https://github.com/databricks/spark-sklearn/issues/73
To install older version of the library, use the following command:
pip install spark-sklearn==0.2.0

Resources