pass multiple Pandas dataframes to an executable via subprocess module in Python - python-3.x

I have the following python script example.py, which I converted to a Windows Executable.
import pandas as pd
import pickle
def example(df1, df2):
print('Started the executable with example')
df1 = pickle.loads(df1)
df2 = pickle.loads(df2)
print('df1 has {df1.shape[0]} rows')
print('df2 has {df2.shape[0]} rows')
return pickle.dumps(pd.concat([df1,df2]))
if __name__=="__main__":
example(sys.argv[1],sys.argv[2])
Then I use PyInstaller to create an executable named example.exe using pyinstaller example.py -F
Next, I have create random Pandas DataFrames in Python, let's call them df1 and df2.
Now, I would like to use the subprocess module in the main Python script main.py to call this executable and get the results. This is the part with which I need help. Following is the code I wrote, but obviously isn't working.
import subprocess
import pickle
df_dump1 = pickle.dumps(df1)
df_dump2 = pickle.dumps(df2
command = ['./example.exe',df_dump1, df_dump2]
result = subprocess.run(command,
input,=df_dump,
stdout=subprocess.PIPE, stderr=subprocess.PIPE,
shell = True)
print(result.returncode, result.stdout, result.stderr)
The error message I get is this:
TypeError: a bytes-like object is required, not \'str\
Clearly, I am not able to send multiple Pandas dataframes (or even one) to the executable. Any ideas about how to achieve this?

Related

returning results from python script to variable in Jupyter notebook

I have a python script that returns a pandas dataframe and I want to run the script in a Jupyter notebook and then save the results to a variable.
The data are in a file called data.csv and a shortened version of the dataframe.py file whose results I want to access in my Jupyter notebook is:
# dataframe.py
import pandas as pd
import sys
def return_dataframe(file):
df = pd.read_csv(file)
return df
if __name__ == '__main__':
return_dataframe(sys.argv[1])
I tried running:
data = !python dataframe.py data.csv
in my Jupyter notebook but data does not contain the dataframe that dataframe.py is supposed to return.
This is how I did it:
# dataframe.py
import pandas as pd
import sys
def return_dataframe(f): # don't shadow built-in `file`
df = pd.read_csv(f)
return df
if __name__ == '__main__':
return_dataframe(sys.argv[1]).to_csv(sys.stdout,index=False)
Then in the notebook you need to convert an 'IPython.utils.text.SList' into a DataFrame as shown in the comments to this question: Convert SList to Dataframe:
data = !python3 dataframe.py data.csv
df = pd.DataFrame(data=data)[0].str.split(',',expand=True)
If the DataFrame is already going to be put into CSV format then you could simply do this in the notebook:
df = pd.read_csv('data.csv')

Where to do package imports when importing multiple python scripts?

This might have been answered before, but I could not find anything that addresses my issue.
So, I have 2 files.
|
|-- test.py
|-- test1.py
test1.py is as below
def fnc():
return np.ndarray([1,2,3,4])
I'm trying to call test1 from test and calling the function like
from test1 import *
x = fnc()
Now naturally I'm getting NameError: name 'np' is not defined.
I tried to write the import both in test and test1 as
import numpy as np
But still, I'm getting the error. This might be silly, but what exactly I'm missing?
Any help is appreciated. Thanks in advance.
Each Python module has it's own namespace, so if some functions in test1.py depends on numpy, you have to import numpy in test1.py:
# test1.py
import numpy as np
def fnc():
return np.ndarray([1,2,3,4])
If test.py doesn't directly use numpy, you don't have to import it again, ie:
# test.py
# NB: do NOT use 'from xxx import *' in production code, be explicit
# about what you import
from test1 import fnc
if __name__ == "__main__":
result = fnc()
print(result)
Now if test.py also wants to use numpy, it has to import it too - as I say, each module has it's own namespace:
# test.py
# NB: do NOT use 'from xxx import *' in production code, be explicit
# about what you import
import numpy as np
from test1 import fnc
def other():
return np.ndarray([3, 44, 5])
if __name__ == "__main__":
result1 = fnc()
print(result1)
result2 = other()
print(result2)
Note that if you were testing your code in a python shell, just modifying the source and re-importing it in the python shell will not work (modules are only loaded once per process, subsequent imports fetch the already loaded module from the sys.modules cache), so you have to exit the shell and open a new one.
mostly you need to have __init__.py in the directort where you have these files
just try creating init.py file like below in the directory where you .py files are present and see if it helps.
touch __init__.py

Both Cython and Numba, Pandasql sqldf select statement throws sqlite3.OperationalError: no such table

I am really new to Python programming. I have a dataframe pandasql query which runs fine when I run my code with the standard Python3 implementation. However, after cythonizing it, I always get the following exception:
sqlite3.OperationalError: no such table: dataf
Following is the snippet from the processor.pyx file
import pandas as pd
from pandasql import sqldf
def process_date(json):
#json has the properties format [{"x": "1", "y": "2", "z": "3"}]
dataf = pd.read_json(json, orient='records')
sql = """select x, y, z from dataf;"""
result = sqldf(sql)
Could cythonizing the code make it behave differently? This exact same code runs
fine with the standard python3 implementation.
Following is the setup.py I have written to transpile the code to c.
# several files with ext .pyx, that i will call by their name
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
ext_modules=[
Extension("c_processor", ["processor.pyx"])]
setup(
name = 'RTP_Cython',
cmdclass = {'build_ext': build_ext},
ext_modules = ext_modules,
)
I also tried to use Numba and got the same error. Code below:
import pandas as pd
from pandasql import sqldf
from numba import jit
from numpy import arange
#jit
def process_data():
print("In process data")
json = "[{\"id\": 2, \"name\": \"zain\"}]"
df = pd.read_json(json, orient='records')
sql = "select id, name from df;"
df = sqldf(sql)
print("the df is %s" % df)
process_data()
If I comment out #jit annotation, the code works fine.
Should I be using another extension of the panda libraries that inter operate with C, since both Numba and Cython give me the same error?
I hope there is an easy solution to this.

Python Unittest for big arrays

I am trying to put together a unittest to test whether my function that reads in big data files, produces the correct result in shape of an numpy array. However, these files and arrays are huge and can not be typed in. I believe I need to save input and output files and test using them. This is how my testModule looks like:
import numpy as np
from myFunctions import fun1
import unittest
class TestMyFunctions(unittest.TestCase):
def setUp(self):
self.inputFile1 = "input1.txt"
self.inputFile2 = "input2.txt"
self.outputFile = "output.txt"
def test_fun1(self):
m1 = np.genfromtxt(self.inputFile1)
m2 = np.genfromtxt(self.inputFile2)
R = np.genfromtxt(self.outputFile)
self.assertEqual(fun1(m1,m2),R)
if __name__ =='__main__':
unittest.main(exit=False)
I'm not sure if there is a better/neater way of testing huge results.
Edit:
Also getting an attribute error now:
AttributeError: TestMyFunctions object has no attribute '_testMethodName'
Update - AttributeError Solved - 'def init()' is not allowed. Changed with def setUp()!

(Python3.6 using cx_Freeze) Exe does not run pandas, numpy application

I wrote a .py script called Expiration_Report.py using the following libraries: pandas, numpy. This code runs perfectly fine when executed in Spyder(python 3.6).
(Using Anaconda for everything)
I then created another .py file called 'setup.py' with the following code in order to convert Expiration_Report.py to Expiration_Report.exe:
import sys
from cx_Freeze import setup, Executable
# Dependencies are automatically detected, but it might need fine tuning.
build_exe_options = {"packages": ["os"],
"excludes": ["tkinter"]}
# GUI applications require a different base on Windows (the default is for a
# console application).
base = None
if sys.platform == "win32":
base = "console"
setup( name = "my prog",
version = "1.0",
description = "My application!",
options = {"build_exe": build_exe_options},
executables = [Executable("Expiration_Report.py", base = base)])
Then in the command prompt I write:
python setup.py build
It builds without any errors. And the build folder is available with the .exe file as well. However, when I run the .exe file from the build folder: nothing happens.
Here is the code from the Expiration_Report.py script:
import pandas as pd
import numpy as np
df = pd.read_excel('C:/Users/Salman/Desktop/WIP Board - 007.xlsx', index_col=None, na_values=['NA'])
df.columns = df.iloc[12]
df.columns
df.shape
df = df.dropna(axis=1, how = 'all')
df
df.columns
df1 = df.copy()
df1 = df1.iloc[13:]
df1
df1 = df1.dropna(axis=1, how = 'all')
df1.shape
from datetime import datetime
print(str(datetime.now()))
df2 = df1.copy()
df2["Now_Time"] = pd.Series([datetime.now()] * (13+len(df1)))
df2["Now_Time"]
df2
df2.fillna(value='NaN')
df2 = df2.dropna(how='any')
df2.shape
df3 = df2.copy()
df3 = df3[df3.Size>0]
df3['Lot Expiration Date'] = pd.to_datetime(df3['Lot Expiration Date'])
df3['Days_Countdown'] = df3[['Lot Expiration Date']].sub(df3['Now_Time'], axis = 0 )
df3.dtypes
df3['Hours_Countdown'] = df3['Days_Countdown'] / np.timedelta64(1, 'h')
df3 = df3.sort_values('Hours_Countdown')
df_expiration = df3[df3.Hours_Countdown<12]
df_expiration['Hours_Countdown'].astype(int)
df_expiration
df_expiration.to_excel('C:/Users/Salman/Desktop/WIP Board - 000.xlsx', sheet_name = 'Sheet1')
The method for creating an exe file from cs_Freeze is correct. Because I converted a simple script HelloWorld.py to exe and it worked fine. It is not importing the pandas library and just exits the exe.
Maybe you need to add pandas and numpy to the packages list. cx_freeze can be a bit dodgy when it comes to finding all the necessary packages.
build_exe_options = {"packages": ["os", "numpy", "pandas"],
"excludes": ["tkinter"]}
It seems that this (including the packages in the setu.py file) doesn'work for CX_freeze 5 and 6 ( as I understand it, these are the latest versions).
I had the same problem and whatever advice I followed here, including adding the packages. It is numpy that seems to cause the trouble.
You can test this by putting import numpy in your very simple testscript, and see it crash when you freeze it.
The solution for me worked for python 3.4, but I doubt that it works under python 3.6:
I uninstalled cx_freeze and reinstalled cx_freeze 4.3.4 via pip install cx_Freeze==4.3.4 and then it worked.
pd.read_csv works but pd.read_excel does not, once the application is cx_freezed.
Pandas read_excel requires an import called xlrd. If you don't have it, you need to install it and pd.read_excel works after.

Resources