PyArrow >= 0.8.0 must be installed; however, it was not found

PyArrow >= 0.8.0 must be installed; however, it was not found - apache-spark

I am on the Cloudera platform, I am trying to use pandas UDF in pyspark.I am getting below error.
PyArrow >= 0.8.0 must be installed; however, it was not found.
Installing pyarrow 0.8.0 on the platform will take time.
Is there any workaround to use pandas udf without installing pyarrow?
I can install on my personal anaconda environment, is it possible to export conda and use it in pyspark?

I can install on my personal anaconda environment, is it possible to export conda and use it in pyspark?
No you cant simply install in your machine and use it, as pyspark is distributed.
But you can pack your venv and ship to your pyspark worker without install custom package like pyarrow on every machine of your platform.
To use virtualenv, simply follow venv-pack package's instruction.
https://jcristharif.com/venv-pack/spark.html

Related

Importing Tensorflow on Python 3.6, 3.7, 3.8

I have a problem with importing TensorFlow. I have tried multiple versions of Numpy, Python, and TensorFlow and I still get the following error:
struct_pb2.TypeSpecProto.NDARRAY_SPEC
AttributeError: NDARRAY_SPEC
I have tried using conda and pip for installation and neither one works. I have no idea what might be the cause of this problem and it started happening about a week ago before that TensorFlow was working fine!

I believe you are using windows, and you have an incompatible version of tensorflow installed or you are missing a dependency. First make sure you have the following installed correct version of Visual C++ installed for your version of windows.
https://support.microsoft.com/en-us/topic/the-latest-supported-visual-c-downloads-2647da03-1eea-4433-9aff-95f26a218cc0
https://aka.ms/vs/16/release/vc_redist.x64.exe here is the direct link.
If it still doesn't work, enable longpaths,
https://superuser.com/questions/1119883/windows-10-enable-ntfs-long-paths-policy-option-missing
If you are having a clash with other packages, create a new conda environment first if you haven't already, and install tensorflow like this.
conda create -n tfenv
conda activate tfenv
conda install tensorflow
Then try to import tensorflow as tf again.

'xlrd' installed, but getting the error: "Missing optional dependency 'xlrd'..."

I'm using Python 3.7 and I recently upgraded to Spyder 4.2.0 from Spyder 4.1.5. Now when I run my code (which was working fine before) I get the following error:
ImportError: Missing optional dependency 'xlrd'. Install xlrd >= 1.0.0 for Excel support Use pip or conda to install xlrd.
So apparently Spyder thinks 'xlrd' ('Excel Reader'?) is not installed. So I went to the Anaconda prompt and tried pip install xlrd, but it replied with
Requirement already satisfied: xlrd in c:\users\michael\anaconda3\lib\site-packages (2.0.1)
I tried uninstalling and reinstalling xlrd anyways, using pip, but it didn't change anything. How do I resolve this error?
Also, I'm not sure if this matters or not, but I originally installed Spyder via Anaconda, whereas now I just downloaded Spyder 4.2.0 by itself, through this link: https://github.com/spyder-ide/spyder/releases.
Also, on the linked github page, it says: "If you are new to Python or the Scientific Python ecosystem, we strongly recommend you to install and use Anaconda. It comes with Spyder and all its dependencies, along with the most important Python scientific libraries (i.e. Numpy, Pandas, Matplotlib, IPython, etc) in a single, easy to use environment."
I had at first assumed this was meant for people downloading Python/Anaconda for the very first time, but now I'm thinking this applies to a semi newbie at Python such as me? As someone who is not very familiar with how packages and dependencies work, should I be downloading Anaconda every time I want to update Python or Spyder?
Apologies for the (probably) silly newbie question...

This sounds like you needed to re-start Spyder for it to pick up the package you installed.
However, as the author of xlrd, I would suggest you do the following:
Stop Spyder
conda install openpyxl
Start Spyder.
Change your pandas code to be pd.read_excel(..., engine='openpxyl')

Problem updating joblib library from GitHub repo in IBM Watson Studio

In my program, I need to use some joblib functions. However, when I run the program, I get the error message: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23.
Apparently the library has been updated in this Github repo but I did not have success installing the library with the pip install command
I did a test just to install the setup file
pip install https://github.com/dsxuser/scikit-learn/setup.py/0.20.x.zip
but i got 404 error.
What I need is to update all the joblib library in that branch.
Does anyone know how to properly install it?

That's not an error, that's a warning. It tells you that you shouldn't use sklearn.externals.joblib anymore, if you want your code to be compatible with later versions of scikit-learn. Should means that you still can, as long as you do NOT upgrade scikit-learn to 0.23 or later.
The way to make your code ready for later versions of scikit-learn is to not use the deprecated sklearn.externals.joblib, but to use joblib directly instead. It's not pre-installed, so you can do one of these:
conda install joblib
pip install joblib
You didn't mention what part of Watson Studio you are using. If it's notebooks without Spark, the preferred way to install packages is with conda. You can define a custom environment with this customization:
dependencies:
- joblib=0.13.2
or else you can call conda from a notebook cell:
!conda install joblib=0.13.2
If you're using some other part of Watson Studio, give conda a try, and if it doesn't work, switch to pip. Note that pip expects == instead of = before the version number. Specifying the version number protects you from surprises when new versions of joblib are released.

Conda keeps trying to install all optional dependencies?

When installing Theano anaconda automatically tries to install pygpu despite this being an optional dependency. I have deleted the .theanorc file from my windows user directory.
Also when running my application Theano tries to load from the GPU. It's like it remembers somehow?
conda install theano
Fetching package metadata .............
Solving package specifications: .
Package plan for installation in environment
C:\Users\zebco\Miniconda3\envs\py35:
The following NEW packages will be INSTALLED:
libgpuarray: 0.6.9-vc14_0
pygpu: 0.6.9-py36_0
theano: 0.9.0-py36_0
Proceed ([y]/n)?
As you can see I've only specified to install theano yet conda wants to install everything including optional dependancies.

Your assumption that pygpu is optional is dependent on the package manager you are using.
Regular Python (pip)
If you are using a direct Python install (obtained using brew or Python site) then you would be using pip to install theano. This basically comes from
https://pypi.python.org/pypi/Theano/1.0.0
If you download the file and unzip it. Open setup.py, you will see below lines
install_requires=['numpy>=1.9.1', 'scipy>=0.14', 'six>=1.9.0'],
So they are set as the dependencies for this package. Which means when you install theano you will also get numpy, scipy and six.
Anaconda Python (conda)
Now coming to Anaconda python. Anaconda doesn't use a package format that PyPI or pip uses. It uses its own format. In case of Anaconda you should be using conda to install the packages you need and not pip.
Conda has channels which is nothing but a repository which has some packages available. You can install a package from any channel using below
conda install -c <channel-name> <package-name>
The default channel is conda-forge. If you look at the theano package over there
https://anaconda.org/conda-forge/theano/files
And download and extract it. There will be a info/recipe/meta.yml file. You will notice below content in the same
requirements:
build:
- ca-certificates 2017.7.27.1 0
- certifi 2017.7.27.1 py36_0
- ncurses 5.9 10
- openssl 1.0.2l 0
- python 3.6.2 0
- readline 6.2 0
- setuptools 36.3.0 py36_0
- sqlite 3.13.0 1
- tk 8.5.19 2
- xz 5.2.3 0
- zlib 1.2.11 0
run:
- python
- setuptools
- six >=1.9.0
- numpy >=1.9.1
- scipy >=0.14
- pygpu >=0.6.5,<0.7
Which specifies that if you want to run this package then pygpu is also on of its dependencies. So conda downloads pygpu as a dependency which you though was optional (which is probably true if you were using regular python and pip)

Update:
Usually, 'Optional Dependency' is an oxymoron. Something optional is not a dependency, a dependency is a software package another piece of software depends on to function for features.
One may get by without a dependency if the dependency does not interact with the package except for one atomized feature which is not being used. As a beginner I would suggest you not take this path.
I am not super familiar with Theano, but Theano can use the system's GPU to speed up its computations, and it seems to me pygpu and gpulibarray are what enable this functionality. Which means it is not optional.
I believe pygpu is 'optional' if you do not wish to use the GPU for speeding up computation (only done if the GPU is powerful enough to be useful for this).
The --no-deps command above allows you to install a package without its dependencies but that is rarely wise, unless one really knows what they are doing. As a beginner I would not recommend you go down this path yet. Conda was designed specifically to ensure scientific packages are easily managed with all necessary stuff installed without any fuss or muss. pip is a general python package manager, but is not built specifically for scientific packages.
If you wish to install theano without installing its dependencies, then you have one of three options:
use conda install theano --no-deps.
Install it using pip instead of conda, using pip install theano. This will install theano, numpy, scipy and six but not pygpu and libgpuarray.
Create a custom conda build file for Theano. Documentation is at:
https://conda.io/docs/user-guide/tasks/build-packages/index.html
Original Answer:
You probably know this already but, use this command instead:
conda install theano --no-deps
This does not install dependencies of the package. If you already have the essential dependencies installed, as it would seem, this should work out for you.
libgpuarray is a dependency of pygpu. With this command switch neither will be installed.
Can you share the .yaml file that you edited?

Installing requirements

I'm new in python and anaconda . I have some cods and I need a lots of requirements for run , how can I install that packages? The requirements includes Python 3.3 or later numexpr numpy 1.9 or later pandas 0.15.2 or later scikit-learn 0.16 scipy 0.15 or later six C/C++ compiler ipython (optional) seaborn (optional) Tnx

It's a good practice to first create a virtual environment using the virtualenv command, and then activate it with source <environment_path>/bin/activate.
After loading the virtual environment, you should be able to use pip install -r requirements.txt to install the requirements listed in the file.

Anaconda comes with its own package manager conda. You should probably use that to install extra packages.
With the possible exception of a C/C++ compiler (although it includes cython, which needs a compiler IIRC), all packages that you need come with anaconda. You'l find a list of included packages here.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string