Convert multiple DataFrames to numpy arrays - python-3.x

I am attempting to convert 5 dataframes to numpy arrays in a loop.
df = [df1, df2, df3, df4, df5]
for index, x in enumerate(df):
x = x.to_numpy()
print(type(df3)) still gives me pandas DataFrame as the output.

This dose not save into the environment
for index, x in enumerate(df):
df[index] = x.to_numpy()
Then you do
df[0]

Related

Is there a way to order the columns (sliced with iloc) in pandas as the main dataframe?

I'm trying to separate columns by slicing them because I need to assign dtypes for each one. So I grouped them by dtypes and assign their respective dtype and now I want to join or concat and that has the same column order as the main dataframe. I add that is not possible to do it by its column name because it may change.
Example:
import pandas as pd
file = pd.read_csv(f, encoding='utf8') #It has 11 columns
intg = file.iloc[:,[0,2,4,6,8,9,11]].astype("Int64")
obj = file.iloc[:,[1,3,5,7,10]].astype(str)
After doing this I need to join them with the same order as the main file, that is from 0 to 11.
To join these 2 chunks you can use the join function, then reindex based on your original dataframes columns. Should look something like this:
import pandas as pd
file = pd.read_csv(f, encoding='utf8') #It has 11 columns
intg = file.iloc[:,[0,2,4,6,8,9,11]].astype("Int64")
obj = file.iloc[:,[1,3,5,7,10]].astype(str)
out = pd.join(intg, obj).reindex(file.columns, axis="columns")

How to iteratively add rows to an inital empty pandas Dataframe?

I have to iteratively add rows to a pandas DataFrame and find this quite hard to achieve. Also performance-wise I'm not sure if this is the best approach.
So from time to time, I get data from a server and this new dataset from the server will be a new row in my pandas DataFrame.
import pandas as pd
import datetime
df = pd.DataFrame([], columns=['Timestamp', 'Value'])
# as this df will grow over time, is this a costly copy (df = df.append) or does pandas does some optimization there, or is there a better way to achieve this?
# ignore_index, as I want the index to automatically increment
df = df.append({'Timestamp': datetime.datetime.now()}, ignore_index=True)
print(df)
After one day the DataFrame will be deleted, but during this time, probably 100k times a new row with data will be added.
The goal is still to achieve this in a very efficient way, runtime-wise (memory doesn't matter too much as enough RAM is present).
I tried this to compare the speed of 'append' compared to 'loc' :
import timeit
code = """
import pandas as pd
df = pd.DataFrame({'A': range(0, 6), 'B' : range(0,6)})
df= df.append({'A' : 3, 'B' : 4}, ignore_index = True)
"""
code2 = """
import pandas as pd
df = pd.DataFrame({'A': range(0, 6), 'B' : range(0,6)})
df.loc[df.index.max()+1, :] = [3, 4]
"""
elapsed_time1 = timeit.timeit(code, number = 1000)/1000
elapsed_time2 = timeit.timeit(code2, number = 1000)/1000
print('With "append" :',elapsed_time1)
print('With "loc" :' , elapsed_time2)
On my machine, I obtained these results :
With "append" : 0.001502693824000744
With "loc" : 0.0010836279180002747
Using "loc" seems to be faster.

log transformation of whole dataframe using numpy

I have a dataframe in python which is made using the following code:
import pandas as pd
df = pd.read_csv('myfile.txt', sep="\t")
df1 = df.iloc[:, 3:]
now in df1 there are 24 columns. I would like to transform the values to log2 value and make a new dataframe in which there are 24 columns with log value of original dataframe. to do so I used numpy.log like the following line:
df2 = (numpy.log(df1))
this code does not return what I would like to get. do you know how to fix it?

Any way to bypass the need to map new variables to tuple values returned from a python function?

Here is my working code example:
def load_databases():
df1 = pd.read_csv(file_directory+'data1.csv', encoding='utf-8', low_memory=False)
df2 = pd.read_csv(file_directory+'data2.csv', encoding='utf-8', low_memory=False)
df3 = pd.read_csv(file_directory+'data3.csv', encoding='utf-8', low_memory=False)
return df1, df2, df3
df1_new, df2_new, df3_new = load_ref_databases()
To use this function, I need to remember the order and nature of the returned output values (df1, df2, df3). And I need to make sure the order of the new variables (df1_new, df2_new, df3_new) needs to map with the function's tuple output. Is there a better way to map the tuple values? Or better yet, bypassing the need of this line:
df1, df2, df3 = load_ref_databases()
So that when I run load_databases(), df1, df2, and df3 will be created and accessible as global variables?

Can't seem to use use pandas to_csv and read_csv to properly read numpy array

The problem seems to stem from when I read in the csv with read_csv having a type issue when I try to perform operations on the nparray. The following is a minimum working example.
x = np.array([0.83151197,0.00444986])
df = pd.DataFrame({'numpy': [x]})
np.array(df['numpy']).mean()
Out[151]: array([ 0.83151197, 0.00444986])
Which is what I would expect. However, if I write the result to a file and then read the data back into a pandas DataFrame the types are broken.
x = np.array([0.83151197,0.00444986])
df = pd.DataFrame({'numpy': [x]})
df.to_csv('C:/temp/test5.csv')
df5 = pd.read_csv('C:/temp/test5.csv', dtype={'numpy': object})
np.array(df5['numpy']).mean()
TypeError: unsupported operand type(s) for /: 'str' and 'long'
The following is the output of "df5" object
df5
Out[186]:
Unnamed: 0 numpy
0 0 [0.83151197 0.00444986]
The following is the file contents:
,numpy
0,[ 0.83151197 0.00444986]
The only way I have figured out how to get this to work is to read the data and manually convert the type, which seems silly and slow.
[float(num) for num in df5['numpy'][0][1:-1].split()]
Is there anyway to avoid the above?
pd.DataFrame({'col_name': data}) expects a 1D array alike objects as data:
In [63]: pd.DataFrame({'numpy': [0.83151197,0.00444986]})
Out[63]:
numpy
0 0.831512
1 0.004450
In [64]: pd.DataFrame({'numpy': np.array([0.83151197,0.00444986])})
Out[64]:
numpy
0 0.831512
1 0.004450
you've wrapped numpy array with [] so you passed a list of numpy arrays:
In [65]: pd.DataFrame({'numpy': [np.array([0.83151197,0.00444986])]})
Out[65]:
numpy
0 [0.83151197, 0.00444986]
Replace df = pd.DataFrame({'numpy': [x]}) with df = pd.DataFrame({'numpy': x})
Demo:
In [56]: x = np.array([0.83151197,0.00444986])
...: df = pd.DataFrame({'numpy': x})
# ^ ^
...: df.to_csv('d:/temp/test5.csv', index=False)
...:
In [57]: df5 = pd.read_csv('d:/temp/test5.csv')
In [58]: df5
Out[58]:
numpy
0 0.831512
1 0.004450
In [59]: df5.dtypes
Out[59]:
numpy float64
dtype: object

Resources