Combining a list of Dataframes - python-3.x

I have a folder with several .csv-files. Each contains data on Time, High, Low, Open, Volumefrom, Volumeto, Close of a cryptocurrency.
I managed to load the .csvs into a list of dataframes and drop the columns Open, High, Low, Volumefrom, Volumeto , which I don't need, leaving me with Time and Close for each dataframe.
Now i want to combine the list of dataframes into one dataframe, where the index starts with the Timestamp of the youngest coin which would be iota in this example.
This is the code I wrote so far:
import pandas as pd
import os
# Path to my folder
PATH_COINS = r"C:\Users\...\Coins"
# creating a path for each of the .csv-files and saving it into a list
namelist = [name for name in os.listdir(PATH_COINS)]
path_lists = [os.path.join(PATH_COINS, path) for path in namelist]
# creating the dataframes and saving them into a list
dfs = [pd.read_csv(k, index_col=0) for k in path_lists]
# dropping unwanted columns
for num, i in enumerate(dfs):
i.drop(columns=["Open", "High", "Low", "Volumefrom", "Volumeto"], inplace=True)
# combining the list of dataframes into one dataframe
pd.concat(dfs, join="inner", axis=1)
However i am getting an Errormessage and cant figure out how to achieve my goal:
Traceback (most recent call last): File
"C:/Users/Jonas/PycharmProjects/Pandas/main.py", line 16, in
pd.concat(dfs, join="inner", axis=1)
File
"C:\Users\Jonas\PycharmProjects\Pandas\venv\lib\site-packages\pandas\core\reshape\concat.py",
line 226, in concat
return op.get_result()
File
"C:\Users\Jonas\PycharmProjects\Pandas\venv\lib\site-packages\pandas\core\reshape\concat.py",
line 423, in get_result
copy=self.copy)
File
"C:\Users\Jonas\PycharmProjects\Pandas\venv\lib\site-packages\pandas\core\internals.py",
line 5425, in concatenate_block_managers
return BlockManager(blocks, axes)
File
"C:\Users\Jonas\PycharmProjects\Pandas\venv\lib\site-packages\pandas\core\internals.py",
line 3282, in init
self._verify_integrity()
File
"C:\Users\Jonas\PycharmProjects\Pandas\venv\lib\site-packages\pandas\core\internals.py",
line 3493, in _verify_integrity
construction_error(tot_items, block.shape[1:], self.axes)
File
"C:\Users\Jonas\PycharmProjects\Pandas\venv\lib\site-packages\pandas\core\internals.py",
line 4843, in construction_error
passed, implied))
ValueError: Shape of passed values is (5, 8514), indices imply (5,
8490)

join should work
Check for duplicate index values as it doesn't know how to map multiple duplicate indexes across multiple DFs (e.g. df.index.is_unique)
Removing duplicate index values (e.g., df.drop_duplicates(inplace=True)) or one of the methods here should resolve it.

Related

Stacking up dataframes in a 3-dimenional numpy array

I have several pandas dataframe that I would like to stack them up using numpy as a three-dimensional numpy array. I could manually do the job using the following code:
arr = np.array([df1.values, df2.values], dtype="object")
However, since I have many dataframes, I can neither write this line for all the dataframes nor automate it.
I tried to use append function (np.append(df1.values, df2['1002'].values)) but it flattens dataframes and ignores their structure. What I want is a three-dimensional numpy array where the first dimension is the number of dataframes (that I have), the second one is the number of rows in each dataframe, and the third one is the number of columns. In the first example that I mentioned earlier, I get a three-dimensional numpy array. In fact when I run arr.shape the result is (2,) and when I run arr[0].shape and arr[1].shape, I get (26, 7) and (24, 7), respectively which are the structure of their corresponding dataframe.
I even ran np.append(df1.values, df2['1002'].values, axis=0) but I received the error of ValueError: all the input array dimensions for the concatenation axis must match exactly. Is there any way that I can fix this problem and stack up all my dataframes in a 3-dimensional numpy array?
Looks like you start with 2 frames with 7 columns, but different numbers of rows. The equivalent of:
In [1]: arr1 = np.ones((26,7)); arr2 = np.zeros((24,7))
...:
In [2]: arr = np.array([arr1, arr2], object)
In [3]: arr.shape
Out[3]: (2,)
In [4]: arr[0].shape
Out[4]: (26, 7)
You probably tried this without the object and got a 'ragged array' warning. In any case, this is not a 3d array. It is 1d (2,), with two arrays. It's roughly the same as the list
[arr1, arr2]
The np.append docs should make it clear that it flattens the arguments, when you don't specify an axis.
In [6]: np.append(arr1,arr2).shape
Out[6]: (350,)
You could specify an axis, and get a 2d array, where the 50 is the sum of 26 and 24.
In [7]: np.append(arr1,arr2,axis=0).shape
Out[7]: (50, 7)
This is the same as:
In [8]: np.concatenate((arr1,arr2), axis=0).shape
Out[8]: (50, 7)
np.append is poorly name cover for np.concatenate. It is not a list append clone. Learn to use concatenate and its stack derivatives. In
With different dataframe shapes, you cannot make a 3d array. Arrays cannot be 'ragged'.
As for working with more than 2 dataframes, if you can make a list of all the frames, you can use the initial syntax.
alist = []
for a in frame_list:
alist.append(a.values)
arr = np.array(alist, object)
But make such array doesn't do much for you.
If the frames are all the same size, then you can make a 3d array
In [10]: np.array([arr1[:10,:],arr2[:10,:]]).shape
Out[10]: (2, 10, 7)
In [11]: np.stack([arr1[:10,:],arr2[:10,:]]).shape
Out[11]: (2, 10, 7)
But if they differ, stack will complain about that:
In [12]: np.stack([arr1, arr2])
Traceback (most recent call last):
File "<ipython-input-12-23d05d0422dc>", line 1, in <module>
np.stack([arr1, arr2])
File "<__array_function__ internals>", line 180, in stack
File "/usr/local/lib/python3.8/dist-packages/numpy/core/shape_base.py", line 426, in stack
raise ValueError('all input arrays must have the same shape')
ValueError: all input arrays must have the same shape

OneHotEncoder failing after combining dataframes

I have a model that runs successfully.
When I tried to predict using it, it was failing due to the fact that after OneHotEncoding, the test set had more columns than the train.
After some reading I found where I need to concat the two df's first, OneHotEncode, then split them apart.
Added a 'temp' column to the train data set with value 'train'.
Added a 'temp' column to the test data set with value 'test'.
This is so that I can split the df apart later using boolean indexing like this:
X = temp_df[temp_df['temp'] == 'train']
X2 = temp_df[temp_df['temp'] == 'test']
Vertically concat the two df's.
Verify the shape of the new combined df.
Change all columns to type 'category' except 'temp', which is object:
basin category
region category
lga category
extraction_type_class category
management category
quality_group category
quantity category
source category
waterpoint_type category
cluster category
temp object
Now I am simply trying to OneHotEncode like I did before. I choose only categorical columns:
cat_ix = temp_df.select_dtypes(include=['category']).columns
And I try to apply with:
ct = ColumnTransformer([('o', OneHotEncoder(), cat_ix)], remainder='passthrough')
temp_df = ct.fit_transform(temp_df)
It fails on the temp_df = ct.fit_transform(temp_df) line.
These identical steps worked perfectly before I added the temp column and concat'd the two df's.
The exact error:
Traceback (most recent call last):
File "C:\Users\Mark\AppData\Local\Programs\Python\Python38\lib\site-packages\sklearn\compose\_column_transformer.py", line 778, in _hstack
converted_Xs = [
File "C:\Users\Mark\AppData\Local\Programs\Python\Python38\lib\site-packages\sklearn\compose\_column_transformer.py", line 779, in <listcomp>
check_array(X, accept_sparse=True, force_all_finite=False)
File "C:\Users\Mark\AppData\Local\Programs\Python\Python38\lib\site-packages\sklearn\utils\validation.py", line 738, in check_array
array = np.asarray(array, order=order, dtype=dtype)
File "C:\Users\Mark\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\generic.py", line 1993, in __array__
return np.asarray(self._values, dtype=dtype)
ValueError: could not convert string to float: 'train'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\Mark\AppData\Local\Programs\Python\Python38\lib\site-packages\sklearn\compose\_column_transformer.py", line 783, in _hstack
raise ValueError(
ValueError: For a sparse output, all columns should be a numeric or convertible to a numeric.
Why is it complaining about 'train'? That is in the 'temp' column which is being excluded.
Note that the traceback doesn't reference OneHotEncoder, it's all the ColumnTransformer. You're trying to pass through the temp column, which gets tacked onto the one-hot-encoded sparse matrix in the method _hstack, and the second error message is the more relevant one. It cannot stack a string-type array onto a numeric sparse array (which leads to the first error message).
If the sparse matrix isn't too large, you can just force it to be dense by using sparse_threshold=0 in the ColumnTransformer or sparse=False in the OneHotEncoder. If it is too large for memory (or you'd prefer the sparse matrices), you could use a 0/1 indicator for the train/test split instead of the strings "train", "test".

How to send many pickled file into a dataframe?

I have many files which have been created using "pickle".
I want to send them to a dataframe, calculate the average (from the 2nd row until the end) of each one, multiply it by 1000 and round it to 2 decimals.
So far I have achieved this using 1 pickle file.
import pandas as pd
df = pd.read_pickle(r'C:\Users\file_inference_time')
df = pd.DataFrame(df)
df.rename(columns={0:'MobileNet'},inplace=True)
df_mean=(df.iloc[2::,:].mean()* 1000).round(decimals=2)
df_mean2=pd.DataFrame(df_mean)
df_mean2
Result I get from 1 file.
These are the files ("pickle") that I need to read
EDIT
This is the error that I get when running the 2nd option
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-2-b72e45d8bcfc> in <module>
16
17
---> 18 df_mean_all = pd.concat(df_mean_list).reset_index(drop=True)
19
20 print(df_mean_all)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, sort, copy)
253 verify_integrity=verify_integrity,
254 copy=copy,
--> 255 sort=sort,
256 )
257
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\concat.py in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy, sort)
302
303 if len(objs) == 0:
--> 304 raise ValueError("No objects to concatenate")
305
306 if keys is None:
ValueError: No objects to concatenate
THIS IS A PLOT WITH THE RESULTS
Get a dict of dataframes
Save the calculated mean result for each file, into a dict
from pathlib import Path
dir_path = Path(r'C:\Users\path_to_files')
files = dir_path.glob('**/file_inference_time*') # get all pkl files in main dir and subdirectories
df_mean_dict = dict()
for i, file in enumerate(files):
df = pd.DataFrame(pd.read_pickle(file))
df.rename(columns={0:'MobileNet'}, inplace=True)
df_mean_dict[i] = pd.DataFrame((df.iloc[2::,:].mean()* 1000).round(decimals=2))
# if all the file names are unique, the dict key can be the file name (w/o the file extension)
# df_mean_dict[file.stem] = pd.DataFrame((df.iloc[2::,:].mean()* 1000).round(decimals=2))
Get a single dataframe - This is what I would do
The result df_mean_all will be a single, 2-column dataframe.
column 0 will be MobileNet
column 1 will be file
dir_path = Path(r'C:\Users\path_to_files')
files = dir_path.glob('**/file_inference_time*') # get all pkl files in main dir and subdirectories
# to check if the files are found
# if an empty list prints, no files are found
files = list(files)
print(files[:5]
df_mean_list = list()
for file in files:
df = pd.DataFrame(pd.read_pickle(file))
df_mean = pd.DataFrame((df.iloc[2::,:].mean()* 1000).round(decimals=2)).reset_index(drop=True).rename(columns={0: 'MobileNet'})
df_mean['file'] = file # or file.stem for just the file name
df_mean_list.append(df_mean)
# df_mean_list is a list of dataframes, pd.concat combines them all into one dataframe
df_mean_all = pd.concat(df_mean_list).reset_index(drop=True)
print(df_mean_all)
MobileNet file
0 3.24 C:\Users\file_inference_time\file1.pkl
1 2.34 C:\Users\file_inference_time\file2.pkl
2 4.23 C:\Users\file_inference_time\file3.pkl

Strange problem when saving to excel pandas

I have some problem wirting to excel. I have 15 columns in my dataframe. I wish only to write 7 of them to excel and in the process use another name for the header.
Here is my code
cols = ['SN', 'Date_x','Material_x', 'Batch_x', 'Qty_x', 'Booked_x', 'State_x']
headers = ['SN', 'Date', 'Material', 'Batch', 'Qty', 'Booked', 'State']
df.style.apply(highlight_changes_ivt2, axis=None).to_excel(writer, columns =cols, header=headers, sheet_name="temp", index = False)
But I have the following errors
File "/home/week/anaconda3/envs/SC/lib/python3.7/site-packages/pandas/io/formats/style.py", line 235, in to_excel
engine=engine,
File "/home/week/anaconda3/envs/SC/lib/python3.7/site-packages/pandas/io/formats/excel.py", line 735, in write
freeze_panes=freeze_panes,
File "/home/week/anaconda3/envs/SC/lib/python3.7/site-packages/pandas/io/excel/_xlsxwriter.py", line 214, in write_cells
for cell in cells:
File "/home/week/anaconda3/envs/SC/lib/python3.7/site-packages/pandas/io/formats/excel.py", line 684, in get_formatted_cells
for cell in itertools.chain(self._format_header(), self._format_body()):
File "/home/week/anaconda3/envs/SC/lib/python3.7/site-packages/pandas/io/formats/excel.py", line 513, in _format_header_regular
f"Writing {len(self.columns)} cols but got {len(self.header)} "
ValueError: Writing 15 cols but got 7 aliases
I tried to do debugging.. and setting pdb.set_trace()
df.style.apply(highlight_changes_ivt2, axis=None).to_excel(writer, columns =cols, header=headers, sheet_name="temp", index = False)
(Pdb) df.columns
Index(['SN', 'Status_x', 'Material_x', 'Batch_x', 'Date_x', 'Quantity_x',
'Booked_x', 'DiffQty_x', 'Status_y', 'Material_y', 'Batch_y',
'Date_y', 'Quantity_y', 'Booked_y', 'DiffQty_y'],
dtype='object')
(Pdb)
This code is running well at my home laptop though... just wondering what's wrong... the difference is only python using version 3.7 for this and 3.8 back at home
Thanks
Let me elaborate my idea in the comment by an example:
df = pd.DataFrame(np.arange(16).reshape(4,-1))
# this is the reference dataframe
np.random.seed(1)
ref_df = pd.DataFrame(np.random.randint(1,10,(4,4)))
# this is the function
def highlight(col, ref_df=None):
return ['background-color: yellow' if c>r else ''
for c,r in zip(col, ref_df[col.name])]
# this works
df[[0,1,3]].style.apply(highlight, ref_df=ref_df).to_excel('style.xlsx', header=list('abc'))
Output:

applying a lambda function to pandas dataframe

First time posting on stackoverflow, so bear with me if I'm making some faux pas please :)
I'm trying to calculate the distance between two points, using geopy, but I can't quite get the actual application of the calculation to work.
Here's the head of the dataframe I'm working with (there are some missing values later in the dataframe, not sure if this is the issue or how to handle it in general):
start lat start long end_lat end_long
0 38.902760 -77.038630 38.880300 -76.986200
2 38.895914 -77.026064 38.915400 -77.044600
3 38.888251 -77.049426 38.895914 -77.026064
4 38.892300 -77.043600 38.888251 -77.049426
I've set up a function:
def dist_calc(st_lat, st_long, fin_lat, fin_long):
from geopy.distance import vincenty
start = (st_lat, st_long)
end = (fin_lat, fin_long)
return vincenty(start, end).miles
This one works fine when given manual input.
However, when I try to apply() the function, I run into trouble with the below code:
distances = df.apply(lambda row: dist_calc(row[-4], row[-3], row[-2], row[-1]), axis=1)
I'm fairly new to python, any help will be much appreciated!
Edit: error message:
distances = df.apply(lambda row: dist_calc2(row[-4], row[-3], row[-2], row[-1]), axis=1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 4262, in apply
ignore_failures=ignore_failures)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 4358, in _apply_standard
results[i] = func(v)
File "<stdin>", line 1, in <lambda>
File "<stdin>", line 5, in dist_calc2
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/geopy/distance.py", line 322, in __init__
super(vincenty, self).__init__(*args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/geopy/distance.py", line 115, in __init__
kilometers += self.measure(a, b)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/geopy/distance.py", line 414, in measure
u_sq = cos_sq_alpha * (major ** 2 - minor ** 2) / minor ** 2
UnboundLocalError: ("local variable 'cos_sq_alpha' referenced before assignment", 'occurred at index 10')
The default settings for pandas functions typically used to import text data like this (pd.read_table() etc) will interpret the spaces in the first 2 column names as separators, so you'll end up with 6 columns instead of 4, and your data will be misaligned:
In [23]: df = pd.read_clipboard()
In [24]: df
Out[24]:
start lat start.1 long end_lat end_long
0 0 38.902760 -77.038630 38.880300 -76.986200 NaN
1 2 38.895914 -77.026064 38.915400 -77.044600 NaN
2 3 38.888251 -77.049426 38.895914 -77.026064 NaN
3 4 38.892300 -77.043600 38.888251 -77.049426 NaN
In [25]: df.columns
Out[25]: Index(['start', 'lat', 'start.1', 'long', 'end_lat', 'end_long'], dtype='object')
Notice column names are wrong, the last column is full of NaNs, etc. If I apply your function to the dataframe in this form, I get the same error as you did.
Its usually better to try to fix this before it gets imported as a dataframe. I can think of 2 methods:
clean the data before importing, for example copy it into an editor and replace the offending spaces with underscores. This is the easiest.
use a regex to fix it during import. This may be necessary if the dataset is very large, or its is pulled from a website and has to be refreshed regularly.
Here's an example of case (2):
In [35]: df = pd.read_clipboard(sep=r'\s{2,}|\s(?=-)', engine='python')
In [36]: df = df.rename_axis({'start lat': 'start_lat', 'start long': 'start_long'}, axis=1)
In [37]: df
Out[37]:
start_lat start_long end_lat end_long
0 38.902760 -77.038630 38.880300 -76.986200
2 38.895914 -77.026064 38.915400 -77.044600
3 38.888251 -77.049426 38.895914 -77.026064
4 38.892300 -77.043600 38.888251 -77.049426
The specified that separators must contain either 2+ whitespaces characters, or 1 whitespace followed by a hyphen (minus sign). Then I rename the columns to what i assume are the expected values.
From this point your function / apply works fine, but i've changed it a little:
PEP8 recommends putting imports at the top of each file, rather than in a function
Extracting the columns by name is more robust, and would have given a much more understandable error than the weird error thrown by geopy.
For example:
In [51]: def dist_calc(row):
...: start = row[['start_lat','start_long']]
...: end = row[['end_lat', 'end_long']]
...: return vincenty(start, end).miles
...:
In [52]: df.apply(lambda row: dist_calc(row), axis=1)
Out[52]:
0 3.223232
2 1.674780
3 1.365851
4 0.420305
dtype: float64

Resources