Is there a straightforward way to use SQLAlchemy in Julia? - python-3.x

In connection with the question on ORM for Julia I am wondering how to get about using SQLAlchemy in Julia given that SQLAlchemy uses a lot of object/type magic to load and persist data. Do you have any hints how to use Julia structures in the context of SQLAlchemy?
(I am sorry, I am new to Julia, just looking around at this point and I am currently unable to come up with some code for a start - as MCVE).

The package PyCall.jl lets you load and use arbitrary python packages, including SQLAlchemy.
julia> using PyCall
julia> #pyimport sqlalchemy as sql
julia> sql.__version__
"1.1.9"
Please see it's documentation for further details.
As of now there are some arguably inconvenient syntax mappings necessary. Concretely, you must access python object fields and methods by object[:field] instead of object.field which you'd use in python. Nevertheless, since my pull request has been merged this week this is going to change once PyCall 2.0 is out! (Of course you can checkout the master branch through ] add PyCall#master and get this feature already now.)

Related

How can I include fast.ai functionality when primarily using PyTorch?

I am using PyTorch to carry out vision tasks, but would like to use some of what fast.ai provides since it has a lot of useful functionality. I'd prefer to work mostly in PyTorch since it's easier for me to understand what's going on, it's easier for me to find information on it online, and I want to maintain flexibility.
In https://docs.fast.ai/migrating_pytorch it's written that after I use the following imports: from fastai.vision.all import * and from migrating_pytorch import *, I should be able to start "Incrementally adding fastai goodness to your PyTorch models", which sounds great.
But when I run the second import I get ModuleNotFoundError: No module named 'migrating_pytorch'. Searching in https://github.com/fastai/fastai I also don't find any code mention of migrating_pytorch.py, nor did I manage to find something online.
(I'm using fast.ai version 2.3.1)
I'd like to know if this is indeed the way to go, and if so how to get it working. Or if there's a better way then how I should use that approach instead.
As an example, it would be nice if I could use the EarlyStoppingCallback, SaveModelCallback, and add some metrics from fast.ai instead of writing them myself, while still having everything in mostly "native" PyTorch.
Preferably the solution isn't specific to vision only, but that's my current need.
migrating_pytorch is an example script. It's in the fast.ai repo at: https://github.com/fastai/fastai/blob/master/nbs/examples/migrating_pytorch.py
The notebook that shows how to use it is at: https://github.com/fastai/fastai/blob/827e7cc0fad2db06c40df393c9569309377efac0/nbs/examples/migrating_pytorch.ipynb
For the callback example. Your training code would end up looking something like:
cbs = [EarlyStoppingCallback(), SaveModelCallback()]
learner = Learner(dls, simple_cnn(), loss_func=F.cross_entropy, cbs=cbs)
learner.fit(1)
Those two callbacks probably need some arguments, e.g. save path, etc.

storing compiled python 3.7 regular expressions in a database

I'm using AWS lambda for some text mining tasks in a severless environment. Since it is serverless there is no possibility to keep the running environment running and starting it cold takes like 10 minutes to compile all the regular expressions.
Therefore I would love to store a bunch (more than 10k) of serialized compiled regular expressions in a Database to fast reuse them when needed.
Does anyone have any pointers for me?
Something along the lines of:
import psycopg2
import re
r=re.compile(r"\w+")
cursor.execute("update regex set compiled=%s where id=%s", (r, 1))
with 'compiled' being of type bytea and
cursor.execute("select compiled from regex where id=%s", (1,))
r=cursor.fetchone()[0]
r.search("somestring")
I believe you are talking about storing the returned object in re.compile(r"\w+").
You can store the string r"\w+" in NoSQL database like DynamoDB and retrieve the string to compile it using re.compile.
Like this:
cursor.execute("select compiled from regex where id=%s", (1,))
s=cursor.fetchone()[0]
r=re.compile(s)
r.search("somestring")
...
Another option is to use Python Pickle to serialize your object, but I think that it's not possible to save it in Database, and you can use S3 to upload the pickle resulting file and retrieve it.
With Lambda warm start, you could keep this object in memory using Pickle + S3 whenever you need but the first executions will have high latency.
I think the solution to this is not use a serverless architecture in situations where it is ineffective.
Python does not seem to offer a way an efficient way to serialize compiled regex. All you get out is the code needed to recompile it.
If anyone is running into the same Problem, the solution is to prevent cold-starts of the lambda.
Far from elegant, but currently the only solution.

Is there any alternative for pandas.DataFrame function for Python?

I am developing an application for Android with Kivy, and package it with Buildozer. The core of my application is using pandas and specially the DataFrame function. It failed when I tried to package it with Buildozer even if I had put pandas in the requirements. So I want to use another library that can be used with Buildozer. So does anyone know about a great alternative to the pandas.DataFrame function with the numpy library for example or another one ?
Thanks a lot for your help. :)
Similar to Pandas.DataFrame.
As database you likely know SQLite (in python see SQLAlchemy and SQLite3).
On the raw tables (i.e., pure matrix-like) Numpy (Numpy.ndarray), it lacks of some database functionalities compared to Pandas but it is fast and you could easily implement what you need. You can find many comparisons between Pandas and Numpy.
Finally,depending on your needs, some simple python dictionaries, maybe OrderedDict.

Parameters in bw2, what to use bw2data.parameters or bw2parameters?

I will need to import and work on few databases containing parameters in bw2, ecoinvent(s) and another db exported from Simapro. While in the past I had used bw2parameters I have seen now the handling of parameters has been also included in bwdata and I am getting a bit confused. what is the workflow now? should I just rely and work with one of the two, both or what? and with which version of the two packages?
thx
Parameters (named variables and formulas stored as strings) are introduced in Brightway2-data version 3.0, which is currently a release candidate. Although we have effectively 100% test coverage and some documentation, I would like to wait a bit before the final release to improve the documentation and make sure that there aren't any bugs that pop up somewhere. That being said, I would feel completely comfortable using the release candidate (along with Brightway2-IO 0.6.RC3, which provides nice ways to specify parameters).
bw2parameters is a library for evaluating a graph of variables and formulas. bw2data is a library for storing, loading, and exporting variables and formulas, tracking when their values are obsolete. They don't compete but instead work together.

API compatibility between scala and python?

I have read a dozen pages of docs, and it seems that:
I can skip learning the scala part
the API is completely implemented in python (I don't need to learn scala for anything)
the interactive mode works as completely and as quickly as the scala shell and troubleshooting is equally easy
python modules like numpy will still be imported (no crippled python environment)
Are there fall-short areas that will make it impossible?
In recent Spark releases (1.0+), we've implemented all of the missing PySpark features listed below. A few new features are still missing, such as Python bindings for GraphX, but the other APIs have achieved near parity (including an experimental Python API for Spark Streaming).
My earlier answers are reproduced below:
Original answer as of Spark 0.9
A lot has changed in the seven months since my original answer (reproduced at the bottom of this answer):
Spark 0.7.3 fixed the "forking JVMs with large heaps" issue.
Spark 0.8.1 added support for persist(), sample(), and sort().
The upcoming Spark 0.9 release adds partial support for custom Python -> Java serializers.
Spark 0.9 also adds Python bindings for MLLib (docs).
I've implemented tools to help keep the Java API up-to-date.
As of Spark 0.9, the main missing features in PySpark are:
zip() / zipPartitions.
Support for reading and writing non-text input formats, like Hadoop SequenceFile (there's an open pull request for this).
Support for running on YARN clusters.
Cygwin support (Pyspark works fine under Windows powershell or cmd.exe, though).
Support for job cancellation.
Although we've made many performance improvements, there's still a performance gap between Spark's Scala and Python APIs. The Spark users mailing list has an open thread discussing its current performance.
If you discover any missing features in PySpark, please open a new ticket on our JIRA issue tracker.
Original answer as of Spark 0.7.2:
The Spark Python Programming Guide has a list of missing PySpark features. As of Spark 0.7.2, PySpark is currently missing support for sample(), sort(), and persistence at different StorageLevels. It's also missing a few convenience methods added to the Scala API.
The Java API was in sync with the Scala API when it was released, but a number of new RDD methods have been added since then and not all of them have been added to the Java wrapper classes. There's a discussion about how to keep the Java API up-to-date at https://groups.google.com/d/msg/spark-developers/TMGvtxYN9Mo/UeFpD17VeAIJ. In that thread, I suggested a technique for automatically finding missing features, so it's just a matter of someone taking the time to add them and submit a pull request.
Regarding performance, PySpark is going to be slower than Scala Spark. Part of the performance difference stems from a weird JVM issue when forking processes with large heaps, but there's an open pull request that should fix that. The other bottleneck comes from serialization: right now, PySpark doesn't require users to explicitly register serializers for their objects (we currently use binary cPickle plus some batching optimizations). In the past, I've looked into adding support for user-customizable serializers that would allow you to specify the types of your objects and thereby use specialized serializers that are faster; I hope to resume work on this at some point.
PySpark is implemented using a regular cPython interpreter, so libraries like numpy should work fine (this wouldn't be the case if PySpark was written in Jython).
It's pretty easy to get started with PySpark; simply downloading a pre-built Spark package and running the pyspark interpreter should be enough to test it out on your personal computer and will let you evaluate its interactive features. If you like to use IPython, you can use IPYTHON=1 ./pyspark in your shell to launch Pyspark with an IPython shell.
I'd like to add some points about why many people who have used both APIs recommend the Scala API. It's very difficult for me to do this without pointing out just general weaknesses in Python vs Scala and my own distaste of dynamically typed and interpreted languages for writing production quality code. So here are some reasons specific to the use case:
Performance will never be quite as good as Scala, not by orders, but by fractions, this is partly because python is interpreted. This gap may widen in future as Java 8 and JIT technology becomes part of the JVM and Scala.
Spark is written in Scala, so debugging Spark applications, learning how Spark works, and learning how to use Spark is much easier in Scala because you can just quite easily CTRL + B into the source code and read the lower levels of Spark to suss out what is going on. I find this particularly useful for optimizing jobs and debugging more complicated applications.
Now my final point may seem like just a Scala vs Python argument, but it's highly relevant to the specific use case - that is scale and parallel processing. Scala actually stands for Scalable Language and many interpret this to mean it was specifically designed with scaling and easy multithreading in mind. It's not just about lambda's, it's head to toe features of Scala that make it the perfect language for doing Big Data and parallel processing. I have some Data Science friends that are used to Python and don't want to learn a new language, but stick to their hammer. Python is a scripting language, it was not designed for this specific use case - it's an awesome tool, but the wrong one for this job. The result is obvious in the code - their code is often 2 - 5x longer than my Scala code as Python lacks a lot of features. Furthermore they find it harder to optimize their code as they are further away from the underlying framework.
Let me put it this way, if someone knows both Scala and Python, then they will nearly always choose to use the Scala API. The only people IME that use Python are those that simply do not want to learn Scala.

Resources