Is there any alternative for pandas.DataFrame function for Python? - python-3.x

I am developing an application for Android with Kivy, and package it with Buildozer. The core of my application is using pandas and specially the DataFrame function. It failed when I tried to package it with Buildozer even if I had put pandas in the requirements. So I want to use another library that can be used with Buildozer. So does anyone know about a great alternative to the pandas.DataFrame function with the numpy library for example or another one ?
Thanks a lot for your help. :)

Similar to Pandas.DataFrame.
As database you likely know SQLite (in python see SQLAlchemy and SQLite3).
On the raw tables (i.e., pure matrix-like) Numpy (Numpy.ndarray), it lacks of some database functionalities compared to Pandas but it is fast and you could easily implement what you need. You can find many comparisons between Pandas and Numpy.
Finally,depending on your needs, some simple python dictionaries, maybe OrderedDict.

Related

numpy pickle data exchange with C++ Eigen or std::vector

I am writing numpy data to sqlite database with pickling.
Is there anyway to read this pickle from c++ to EigenMatrix or std::vector ?
Best
You can either use Boost Libraries or the specifically designed cross-language library Picklingtools.
Edit: you can find an example to the latter one on this post.

Is there a straightforward way to use SQLAlchemy in Julia?

In connection with the question on ORM for Julia I am wondering how to get about using SQLAlchemy in Julia given that SQLAlchemy uses a lot of object/type magic to load and persist data. Do you have any hints how to use Julia structures in the context of SQLAlchemy?
(I am sorry, I am new to Julia, just looking around at this point and I am currently unable to come up with some code for a start - as MCVE).
The package PyCall.jl lets you load and use arbitrary python packages, including SQLAlchemy.
julia> using PyCall
julia> #pyimport sqlalchemy as sql
julia> sql.__version__
"1.1.9"
Please see it's documentation for further details.
As of now there are some arguably inconvenient syntax mappings necessary. Concretely, you must access python object fields and methods by object[:field] instead of object.field which you'd use in python. Nevertheless, since my pull request has been merged this week this is going to change once PyCall 2.0 is out! (Of course you can checkout the master branch through ] add PyCall#master and get this feature already now.)

How to incorporate custom functions into tf.data pipe-lining process for maximum efficiency

So tf.image for example has some elementary image processing methods already implemented which i'd assumed are optimized. The question is as I'm iterating through a large dataset of images what/how is the recommended way of implementing a more complex function on every image, in batches of course, (for example a a patch 2-D DCT) for it to go as best as possible with the whole tf.data framework.
Thanks in advance.
p.s. of course I could use the "Map" method but i'm asking beyond that. like if I'm passing a pure numpy written function to pass to "map", it wouldn't help as much.
The current best approach (short of writing custom ops in C++/CUDA) is probably to use https://www.tensorflow.org/api_docs/python/tf/contrib/eager/py_func. This allows you to write any TF eager code and use Python control flow statements. With this you should be able to do most of the things you can do with numpy. The added benefit is that you can use your GPU and the tensors you produce in tfe.py_func will be immediately usable in your regular TF code - no copies are needed.

Do I get a speed up running python libraries on pyspark

I've tried to read through and understand exactly where the speed up is coming from in spark when I run python libraries like pandas or scikit learn but I don't see anything particularly informative. If I can get the same speedup without using the pyspark dataframes can I just deploy code using pandas and it will perform roughly the same?
I suppose my question is:
If I have working pandas code should I translate it to PySpark for efficiency or not?
If you ask if you get any speedup by starting arbitrary Python code on the driver node the answer is negative. Driver is a plain Python interpreter, it doesn't affect you code in "magic" way.
If I have working pandas code should I translate it to PySpark for efficiency or not?
If you want to get benefits of distributed computing then you have to rewrite your code using distribute primitives. However it is not a free lunch:
You problem might not distribute well.
Even if does, amount of data might not justify distribution - How to add a <br/> after each result, but not last result?
In other words - if your code works just fine with Pandas or Scikit Learn, there is little chance you'll get anything from rewriting it to Spark.

How do I manipulate,modify excel files using python3 along with searching/sorting algorithms for huge databases?

I am new to python and need to work on a project with the whole database on excel. The database is over 100k entries and has to be periodically modified and searched/sorted often.
Are there any package(s) for this?
As per your title, this is a list of python modules that can perform various operations on Excel worksheets.
Pandas (possibly your best option, since it has powerful data management).
pyExcelerator (apparently not maintained anymore)
xlwt (a fork of pyExcelerator)
openpyxl
xlrd
As to search/sort, python has builtin features and there are plenty of other packages.
https://pypi.python.org/pypi/algorithms

Resources