Convert multiple DataFrames to numpy arrays

Convert multiple DataFrames to numpy arrays - python-3.x

I am attempting to convert 5 dataframes to numpy arrays in a loop.
df = [df1, df2, df3, df4, df5]
for index, x in enumerate(df):
x = x.to_numpy()
print(type(df3)) still gives me pandas DataFrame as the output.

This dose not save into the environment
for index, x in enumerate(df):
df[index] = x.to_numpy()
Then you do
df[0]

Related

Is there a way to order the columns (sliced with iloc) in pandas as the main dataframe?

I'm trying to separate columns by slicing them because I need to assign dtypes for each one. So I grouped them by dtypes and assign their respective dtype and now I want to join or concat and that has the same column order as the main dataframe. I add that is not possible to do it by its column name because it may change.
Example:
import pandas as pd
file = pd.read_csv(f, encoding='utf8') #It has 11 columns
intg = file.iloc[:,[0,2,4,6,8,9,11]].astype("Int64")
obj = file.iloc[:,[1,3,5,7,10]].astype(str)
After doing this I need to join them with the same order as the main file, that is from 0 to 11.

To join these 2 chunks you can use the join function, then reindex based on your original dataframes columns. Should look something like this:
import pandas as pd
file = pd.read_csv(f, encoding='utf8') #It has 11 columns
intg = file.iloc[:,[0,2,4,6,8,9,11]].astype("Int64")
obj = file.iloc[:,[1,3,5,7,10]].astype(str)
out = pd.join(intg, obj).reindex(file.columns, axis="columns")

How to iteratively add rows to an inital empty pandas Dataframe?

I have to iteratively add rows to a pandas DataFrame and find this quite hard to achieve. Also performance-wise I'm not sure if this is the best approach.
So from time to time, I get data from a server and this new dataset from the server will be a new row in my pandas DataFrame.
import pandas as pd
import datetime
df = pd.DataFrame([], columns=['Timestamp', 'Value'])
# as this df will grow over time, is this a costly copy (df = df.append) or does pandas does some optimization there, or is there a better way to achieve this?
# ignore_index, as I want the index to automatically increment
df = df.append({'Timestamp': datetime.datetime.now()}, ignore_index=True)
print(df)
After one day the DataFrame will be deleted, but during this time, probably 100k times a new row with data will be added.
The goal is still to achieve this in a very efficient way, runtime-wise (memory doesn't matter too much as enough RAM is present).

I tried this to compare the speed of 'append' compared to 'loc' :
import timeit
code = """
import pandas as pd
df = pd.DataFrame({'A': range(0, 6), 'B' : range(0,6)})
df= df.append({'A' : 3, 'B' : 4}, ignore_index = True)
"""
code2 = """
import pandas as pd
df = pd.DataFrame({'A': range(0, 6), 'B' : range(0,6)})
df.loc[df.index.max()+1, :] = [3, 4]
"""
elapsed_time1 = timeit.timeit(code, number = 1000)/1000
elapsed_time2 = timeit.timeit(code2, number = 1000)/1000
print('With "append" :',elapsed_time1)
print('With "loc" :' , elapsed_time2)
On my machine, I obtained these results :
With "append" : 0.001502693824000744
With "loc" : 0.0010836279180002747
Using "loc" seems to be faster.

log transformation of whole dataframe using numpy

I have a dataframe in python which is made using the following code:
import pandas as pd
df = pd.read_csv('myfile.txt', sep="\t")
df1 = df.iloc[:, 3:]
now in df1 there are 24 columns. I would like to transform the values to log2 value and make a new dataframe in which there are 24 columns with log value of original dataframe. to do so I used numpy.log like the following line:
df2 = (numpy.log(df1))
this code does not return what I would like to get. do you know how to fix it?

Any way to bypass the need to map new variables to tuple values returned from a python function?

Here is my working code example:
def load_databases():
df1 = pd.read_csv(file_directory+'data1.csv', encoding='utf-8', low_memory=False)
df2 = pd.read_csv(file_directory+'data2.csv', encoding='utf-8', low_memory=False)
df3 = pd.read_csv(file_directory+'data3.csv', encoding='utf-8', low_memory=False)
return df1, df2, df3
df1_new, df2_new, df3_new = load_ref_databases()
To use this function, I need to remember the order and nature of the returned output values (df1, df2, df3). And I need to make sure the order of the new variables (df1_new, df2_new, df3_new) needs to map with the function's tuple output. Is there a better way to map the tuple values? Or better yet, bypassing the need of this line:
df1, df2, df3 = load_ref_databases()
So that when I run load_databases(), df1, df2, and df3 will be created and accessible as global variables?

Can't seem to use use pandas to_csv and read_csv to properly read numpy array

The problem seems to stem from when I read in the csv with read_csv having a type issue when I try to perform operations on the nparray. The following is a minimum working example.
x = np.array([0.83151197,0.00444986])
df = pd.DataFrame({'numpy': [x]})
np.array(df['numpy']).mean()
Out[151]: array([ 0.83151197, 0.00444986])
Which is what I would expect. However, if I write the result to a file and then read the data back into a pandas DataFrame the types are broken.
x = np.array([0.83151197,0.00444986])
df = pd.DataFrame({'numpy': [x]})
df.to_csv('C:/temp/test5.csv')
df5 = pd.read_csv('C:/temp/test5.csv', dtype={'numpy': object})
np.array(df5['numpy']).mean()
TypeError: unsupported operand type(s) for /: 'str' and 'long'
The following is the output of "df5" object
df5
Out[186]:
Unnamed: 0 numpy
0 0 [0.83151197 0.00444986]
The following is the file contents:
,numpy
0,[ 0.83151197 0.00444986]
The only way I have figured out how to get this to work is to read the data and manually convert the type, which seems silly and slow.
[float(num) for num in df5['numpy'][0][1:-1].split()]
Is there anyway to avoid the above?

pd.DataFrame({'col_name': data}) expects a 1D array alike objects as data:
In [63]: pd.DataFrame({'numpy': [0.83151197,0.00444986]})
Out[63]:
numpy
0 0.831512
1 0.004450
In [64]: pd.DataFrame({'numpy': np.array([0.83151197,0.00444986])})
Out[64]:
numpy
0 0.831512
1 0.004450
you've wrapped numpy array with [] so you passed a list of numpy arrays:
In [65]: pd.DataFrame({'numpy': [np.array([0.83151197,0.00444986])]})
Out[65]:
numpy
0 [0.83151197, 0.00444986]
Replace df = pd.DataFrame({'numpy': [x]}) with df = pd.DataFrame({'numpy': x})
Demo:
In [56]: x = np.array([0.83151197,0.00444986])
...: df = pd.DataFrame({'numpy': x})
# ^ ^
...: df.to_csv('d:/temp/test5.csv', index=False)
...:
In [57]: df5 = pd.read_csv('d:/temp/test5.csv')
In [58]: df5
Out[58]:
numpy
0 0.831512
1 0.004450
In [59]: df5.dtypes
Out[59]:
numpy float64
dtype: object

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Convert multiple DataFrames to numpy arrays - python-3.x

I am attempting to convert 5 dataframes to numpy arrays in a loop. df = [df1, df2, df3, df4, df5] for index, x in enumerate(df): x = x.to_numpy() print(type(df3)) still gives me pandas DataFrame as the output.

This dose not save into the environment for index, x in enumerate(df): df[index] = x.to_numpy() Then you do df[0]

Related

Is there a way to order the columns (sliced with iloc) in pandas as the main dataframe?

How to iteratively add rows to an inital empty pandas Dataframe?

log transformation of whole dataframe using numpy

Any way to bypass the need to map new variables to tuple values returned from a python function?

Can't seem to use use pandas to_csv and read_csv to properly read numpy array

Categories

Resources