Is there a way to OCR images in PySpark? - apache-spark

I can not find an open source solution for OCRing images in PySpark. I know solutions like pytesseract exist, but not sure if they will play nicely with PySpark since the tesseract-ocr will need to be installed in the linux machines. Are there any open source OCR solutions that would play nicely with PySpark?

I could not find a pure python library. pytesseract calls a linux library called tesseract-ocr which I was able to install on a Spark cluster. You can also install this on your Spark cluster fairly easily and it works well.
Here's an answer on how to install it on Databricks. I used global init scripts to install it:
How to install Tesseract OCR on Databricks

Related

install JAR package related to pyspark into foundry

we would like to install Spark-Alchemy to use it within Pyspark inside foundry (we would like to use their hyperloglog functions). While I know how to install a pip package, I am not sure what it is needed to install this kind of package.
Any help or alternative solutions related to the use of hyperloglog with pyspark will be appreciated, thanks!
PySpark Transform repositories in Foundry are connected to conda. You can use the coda_recipe/meta.yml to pull packages into your transforms. If a package you want is not available in your channels, I would recommend you reach out to your administrators to ask if it's possible to add it. Adding a custom jar that extends spark is something that needs to be reviewed by your platform administrators since it can represent a security risk.
I did a $ conda search spark-alchemy and couldn't find anything related and reading through these instructions https://github.com/swoop-inc/spark-alchemy/wiki/Spark-HyperLogLog-Functions#python-interoperability it makes me guess that there isn't a conda package available.
I can't comment about the use of this specific library but in general, Foundry support Conda channels and if you have a Conda repo and configure foundry to connect to that channel you can add this library or others and reference them in your code.

How to specify pytorch as a package requirement on windows?

I have a python package which depends on pytorch and which I’d like windows users to be able to install via pip (the specific package is: https://github.com/mindsdb/lightwood, but I don’t think this is very relevant to my question).
What are the best practices for going about this ?
Are there some project I could use as examples ?
It seems like the pypi hosted version of torch & torchvision aren’t windows compatible and the “getting started” section suggests installing from the custom pytorch repository, but beyond that I’m not sure what the ideal solution would be to incorporate this as part of a setup script.
What are the best practices for going about this ?
If your project depends on other projects that are not distributed through PyPI then you have to inform the users of your project one way or another. I recommend the following combination:
clearly specify (in your project's documentation pages, or in the project's long description, or in the README, or anything like this) which dependencies are not available through PyPI (and possibly the reason why, with the appropriate links) as well as the possible locations to get them from;
to facilitate the user experience, publish alongside your project a pre-prepared requirements.txt file with the appropriate --find-links options.
The reason why (or main reason, there are others), is that anyone using pip assumes that (by default) everything will be downloaded from PyPI and nowhere else. In other words anyone using pip puts some trust into pypi.org as a source for Python project distributions. If pip were suddenly to download artifacts from other sources, it would breach this trust. It should be the user's decision to download from other sources.
So you could provide in your project's documentation an example of requirements.txt file like the following:
# ...
torch===1.4.0 --find-links https://download.pytorch.org/whl/torch_stable.html
torchvision===0.5.0 --find-links https://download.pytorch.org/whl/torch_stable.html
# ...
Update
The best solution would be to help the maintainers of the projects in question to publish Windows wheels on PyPI directly:
https://github.com/pytorch/pytorch/issues/24310
https://github.com/pytorch/vision/issues/1774
https://pypi.org/help/#file-size-limit

Why does HDInsight cluster does not come with pre-installed Scala?

on HDInsight's masternode, $scala -verion returns an error. It is easily installed via
$apt-get install scala
but shouldn't scala be installed there by default?
Thank you for suggestion. What's the scenario where you need scala to be directly installed on the node? For example, in spark there are couple of other common scenarios that already work:
Running Spark commands in command line. This is accomplished through spark-shell which has built-in scala interpreter.
Building spark project. This is ussually done through maven or sbt project definition file. Those tools would automatically download correct scala version and compiler based on the project dependencies.
As you said it's not hard to preinstall scala, but we would like to understand the need to do that. In the discussions with customers this didn't come up before.

how to activate python3 in Google-cloud-data

I am using Python3 for my projects. However, the Google Cloud Datalab runs with Python2.7.x by default. How do I change to Python3?
Datalab only supports Python 2 for now unfortunately.
One thing you can try is install the Pydatalab lib, which is a Jupyter extension that adds support for a number of Google Cloud Platform services to your Jupyter notebooks. That library supports Python 3.
It looks like that they support Python 3 now:
https://github.com/googledatalab/datalab/issues/902 .

Remote Python Program As Local Program

I am new to python. So please pardon my mistakes/ignorance.
I have an GUI app script that I use to copy some folders from another machine to my machine and also do some other processing with the files in the folders.
Now, I would like to place this script on my machine and let other people(with no python installed on theirs machines) to execute this script. I want it to behave as if it was running on their machine. I mean, I don't want to see any errors while this script from my machine makes any changes to their files like access denied etc. It should tread D:\ drive as theirs not mine.
Is it possible somehow in python?
Thanks in advance.
I don't know the way do such a thing. But maybe you can use tools such as py2exe to convert Python scripts into Windows .exe applications.
And as the introduction
It is an utility based in Distutils that allows you to run applications written in Python on a Windows computer without requiring the user to install Python. It is an excellent option when you need to distribute a program to the end user as a standalone application. py2exe currently only works in Python 2.x.
if you use python3.0 or 3.1, this question is helpful.

Resources