install JAR package related to pyspark into foundry - apache-spark

we would like to install Spark-Alchemy to use it within Pyspark inside foundry (we would like to use their hyperloglog functions). While I know how to install a pip package, I am not sure what it is needed to install this kind of package.
Any help or alternative solutions related to the use of hyperloglog with pyspark will be appreciated, thanks!

PySpark Transform repositories in Foundry are connected to conda. You can use the coda_recipe/meta.yml to pull packages into your transforms. If a package you want is not available in your channels, I would recommend you reach out to your administrators to ask if it's possible to add it. Adding a custom jar that extends spark is something that needs to be reviewed by your platform administrators since it can represent a security risk.
I did a $ conda search spark-alchemy and couldn't find anything related and reading through these instructions https://github.com/swoop-inc/spark-alchemy/wiki/Spark-HyperLogLog-Functions#python-interoperability it makes me guess that there isn't a conda package available.

I can't comment about the use of this specific library but in general, Foundry support Conda channels and if you have a Conda repo and configure foundry to connect to that channel you can add this library or others and reference them in your code.

Related

How can I create my own conda environment in a HPC without access to the internet?

I am quite new working with High-Performance Computing (HPC). In the one I am working I cannot install any packages from the internet, I don't even have access to the internet.
The support I have is very limited when I contact them. I am developing a small application and I need the last version of pandas. They have told me that to get the last version I need to do this:
module load lang/Anaconda3
sources/resources/conda/miniconda3/bin/activate
conda activate SigProfilerExtractor
This work but I want to create my own environment. I know how to do this on my own computer but in the HPC I do not know how to do this? I know that the packages must be somewhere but how can I install them in my environment from where they live?
And second question: They have tools and packages located in many different environments so it is very difficult to find out when the future tool I will use is. Not all environments have useful names and the manual they provided is not updated. If I need the tool mytool, how can I find it?

Should I create a virtual environment for a framework?

Guys I'm about to install "scrapy" and I was wondering if it would be a good idea to create a virtual environment? I'm also not the expert in doing these types of things, so I would have to research on it before doing anything... but my question still stands should I create one or should I just install it directly "pip3 install scrapy", I ask this because I read somewhere it can conflict with other frameworks, correct me if I'm wrong please.
Yes, you should try to create virtual environments if you have multiple frameworks.
PEP0405 proposes to add to Python a mechanism for lightweight "virtual environments" with their own site directories, optionally isolated from system site directories. Each virtual environment has its own Python binary (allowing creation of environments with various Python versions) and can have its own independent set of installed Python packages in its site directories, but shares the standard library with the base installed Python.
for more information check https://docs.python.org/3/library/venv.html and
https://www.python.org/dev/peps/pep-0405/

How to specify pytorch as a package requirement on windows?

I have a python package which depends on pytorch and which I’d like windows users to be able to install via pip (the specific package is: https://github.com/mindsdb/lightwood, but I don’t think this is very relevant to my question).
What are the best practices for going about this ?
Are there some project I could use as examples ?
It seems like the pypi hosted version of torch & torchvision aren’t windows compatible and the “getting started” section suggests installing from the custom pytorch repository, but beyond that I’m not sure what the ideal solution would be to incorporate this as part of a setup script.
What are the best practices for going about this ?
If your project depends on other projects that are not distributed through PyPI then you have to inform the users of your project one way or another. I recommend the following combination:
clearly specify (in your project's documentation pages, or in the project's long description, or in the README, or anything like this) which dependencies are not available through PyPI (and possibly the reason why, with the appropriate links) as well as the possible locations to get them from;
to facilitate the user experience, publish alongside your project a pre-prepared requirements.txt file with the appropriate --find-links options.
The reason why (or main reason, there are others), is that anyone using pip assumes that (by default) everything will be downloaded from PyPI and nowhere else. In other words anyone using pip puts some trust into pypi.org as a source for Python project distributions. If pip were suddenly to download artifacts from other sources, it would breach this trust. It should be the user's decision to download from other sources.
So you could provide in your project's documentation an example of requirements.txt file like the following:
# ...
torch===1.4.0 --find-links https://download.pytorch.org/whl/torch_stable.html
torchvision===0.5.0 --find-links https://download.pytorch.org/whl/torch_stable.html
# ...
Update
The best solution would be to help the maintainers of the projects in question to publish Windows wheels on PyPI directly:
https://github.com/pytorch/pytorch/issues/24310
https://github.com/pytorch/vision/issues/1774
https://pypi.org/help/#file-size-limit

Why few packages of python doesn't support in Azure Function V2?

I'm trying to publish my app to azure function from visual studio code,
and the following are my dependencies,
pyodbc==4.0.26
pandas==0.25.0
numpy==1.16.4
azure-eventhub==1.3.1
and when I'm publishing my app I get the following error,
ERROR: cannot install cryptography-2.7 dependency: binary dependencies without wheels are not supported. Use the --build-native-deps option to automatically build and configure the dependencies using a Docker container. More information at https://aka.ms/func-python-publish
This is a limitation of the way azure functions uses pip to download wheels. cryptography uploads an abi3 manylinux wheel, but this command can't successfully download it. For more information (and a workaround) see: https://github.com/Azure/azure-functions-core-tools/issues/1150
The link in the error message does answer your exact question:
If you're using a package that requires a compiler and does not support the installation of many linux-compatible wheels from PyPI, publishing to Azure will fail
If you ask for the "why was it designed in this way?" - that's a different question and out of scope for StackOverflow. You might want to try on the Functions Github

Identifying most suitable dependent rpm packages

Not sure SO is the best place to ask this, but it is development related so maybe someone can help.
I've written an app (in python but that's not important) which parses a Yum repo database to collate RPM packages and their dependencies. The problem I have is that I am sucking in too many packages when a dependency is met by more than one.
Specific example: I am seeking the list of packages which meet dependencies for Java-1.8.0 and getting a dependency of libjli.so()(64bit). libjli.so()(64bit) My code correctly works out that this is provided by multiple -devel packages from the Java 1.8, 1.7 and 1.6 streams. Unfortunately all three versions (and their dependencies) then get included in my list.
I guess my question is, given a list of packages meeting a requirement, what is the best way to identify the most appropriate package to include? i.e. when resolving the dependencies for Java-1.8.0, only include the -devel package for 1.8.0 and not suck in the -devel packages for 1.6 and 1.7 as well.
I know this is a problem with my code, I'm just not sure what facilities are provided by the yum ecosystem to help me identify which package would be best to include from the list of multiple.
It is hard to tell without seeing your code.
Yum is dead. If you are developing something new, you should develop on top of DNF. DNF use satsolver algorithm (https://doc.opensuse.org/projects/satsolver/11.4/index.html) and you can use libdnf https://github.com/rpm-software-management/libdnf (formerly known as libhif, formerly known as libhawkey).

Resources