storing compiled python 3.7 regular expressions in a database - python-3.x

I'm using AWS lambda for some text mining tasks in a severless environment. Since it is serverless there is no possibility to keep the running environment running and starting it cold takes like 10 minutes to compile all the regular expressions.
Therefore I would love to store a bunch (more than 10k) of serialized compiled regular expressions in a Database to fast reuse them when needed.
Does anyone have any pointers for me?
Something along the lines of:
import psycopg2
import re
r=re.compile(r"\w+")
cursor.execute("update regex set compiled=%s where id=%s", (r, 1))
with 'compiled' being of type bytea and
cursor.execute("select compiled from regex where id=%s", (1,))
r=cursor.fetchone()[0]
r.search("somestring")

I believe you are talking about storing the returned object in re.compile(r"\w+").
You can store the string r"\w+" in NoSQL database like DynamoDB and retrieve the string to compile it using re.compile.
Like this:
cursor.execute("select compiled from regex where id=%s", (1,))
s=cursor.fetchone()[0]
r=re.compile(s)
r.search("somestring")
...
Another option is to use Python Pickle to serialize your object, but I think that it's not possible to save it in Database, and you can use S3 to upload the pickle resulting file and retrieve it.
With Lambda warm start, you could keep this object in memory using Pickle + S3 whenever you need but the first executions will have high latency.

I think the solution to this is not use a serverless architecture in situations where it is ineffective.
Python does not seem to offer a way an efficient way to serialize compiled regex. All you get out is the code needed to recompile it.

If anyone is running into the same Problem, the solution is to prevent cold-starts of the lambda.
Far from elegant, but currently the only solution.

Related

What is the easiest way to operationalize Python code?

I am new to writing Python code. I have currently written a few modules for data analysis projects. The data is queried from AWS Redshift tables and summarized in CSVs and Excel spreadsheets.
At this point I do not want to pass it on other users in the org as I do not want to expose the code.
Is there an easy way to operationalize the code without exposing it?
PS: I am in the process of learning front-end development (Flask, HTML, CSS) so users can input data and get results back.
Python programs are almost always shipped as bare source. There are ways of compiling Python code into binaries, but this is not a common thing to do and usually I would not recommend it, as it's not as easy as one might expect (which is too bad, really).
That said, you can check out cx_Freeze and Cython.

Is there a straightforward way to use SQLAlchemy in Julia?

In connection with the question on ORM for Julia I am wondering how to get about using SQLAlchemy in Julia given that SQLAlchemy uses a lot of object/type magic to load and persist data. Do you have any hints how to use Julia structures in the context of SQLAlchemy?
(I am sorry, I am new to Julia, just looking around at this point and I am currently unable to come up with some code for a start - as MCVE).
The package PyCall.jl lets you load and use arbitrary python packages, including SQLAlchemy.
julia> using PyCall
julia> #pyimport sqlalchemy as sql
julia> sql.__version__
"1.1.9"
Please see it's documentation for further details.
As of now there are some arguably inconvenient syntax mappings necessary. Concretely, you must access python object fields and methods by object[:field] instead of object.field which you'd use in python. Nevertheless, since my pull request has been merged this week this is going to change once PyCall 2.0 is out! (Of course you can checkout the master branch through ] add PyCall#master and get this feature already now.)

How to incorporate custom functions into tf.data pipe-lining process for maximum efficiency

So tf.image for example has some elementary image processing methods already implemented which i'd assumed are optimized. The question is as I'm iterating through a large dataset of images what/how is the recommended way of implementing a more complex function on every image, in batches of course, (for example a a patch 2-D DCT) for it to go as best as possible with the whole tf.data framework.
Thanks in advance.
p.s. of course I could use the "Map" method but i'm asking beyond that. like if I'm passing a pure numpy written function to pass to "map", it wouldn't help as much.
The current best approach (short of writing custom ops in C++/CUDA) is probably to use https://www.tensorflow.org/api_docs/python/tf/contrib/eager/py_func. This allows you to write any TF eager code and use Python control flow statements. With this you should be able to do most of the things you can do with numpy. The added benefit is that you can use your GPU and the tensors you produce in tfe.py_func will be immediately usable in your regular TF code - no copies are needed.

Parallelizing python3 program with huge complex objects

Intro
I have a quite complex python program (say more than 5.000 rows) written with Python 3.6. This program parses a huge dataset of more than 5.000 files, processes them creating an internal representation of the dataset and then creates statistics. Since I have to test the model, I need to save the dataset representation and at now I'm doing it by using serialization through dill (in the representation there are objects that pickle does not support). The serialization of the whole dataset, not compressed, takes about 1GB.
The problem
Now, I would like to speed up computation by parallelization. The perfect way would be a multithreading approach but GIL forbid that. multiprocessing module (and multiprocess - which is dill compatible - too) uses serialization to share complex objects between processes so that, in the best case I managed to invent, parallelization is ininfluent for me on time performance because of the huge size of the dataset.
The question
What is the best way to manage this situation?
I know about posh, but it seems to be only x86 compatible, ray but it uses serialization too, gilectomy (a version of python without gil) but I'm not able to make it parallelize threads and Jython which has no GIL but is not compatible with python 3.x.
I am open to any alternative, any language, however complex it may be, but I can't rewrite the code from scratch.
Best solution I found is change dill to a custom pickling module based on standard pickle. See here: Python 3.6 pickling custom procedure

Why is it not recommended to use server-side stored functions in MongoDB?

According to the MongoDB documentation, it isn't recommended to use server-side stored functions. What is the reason behind this warning?
I am sure I have stated the list a couple of times despite the Google search result being filled only with people telling you how to do it:
It is eval
eval has natural abilities to be easily injected, it is like a non-PDO equilivant to SQL, if you don't buld a full scale escaping library around it it will mess you up. By using these functions you are effectively replacing the safer native language of MongoDB for something that is just as insecure as any old SQL out there.
It takes a global lock and can take write lock and will not release until the operation is completely done, unlike other operations which will release in certain cases.
eval only works on Primaries and never any other member of the replica set
It is basically running, unchecked, a tonne of JS in a bundled V8/spidermonkey envo that comes with MongoDB with full ability to touch any part of your database and admin commands, does that sound safe?
It is NOT MongoDB and nor is it "MongoDBs SQL", it runs within a built in JS environment, not MongoDBs C++ code itself (unlike the aggregation framework).
Due to the previous point it is EXTREMELY slow in comparison to many other options, this goes for $where usage as well.
That should be enough to get you started on this front.

Resources