How to send many pickled file into a dataframe? - python-3.x

I have many files which have been created using "pickle".
I want to send them to a dataframe, calculate the average (from the 2nd row until the end) of each one, multiply it by 1000 and round it to 2 decimals.
So far I have achieved this using 1 pickle file.
import pandas as pd
df = pd.read_pickle(r'C:\Users\file_inference_time')
df = pd.DataFrame(df)
df.rename(columns={0:'MobileNet'},inplace=True)
df_mean=(df.iloc[2::,:].mean()* 1000).round(decimals=2)
df_mean2=pd.DataFrame(df_mean)
df_mean2
Result I get from 1 file.
These are the files ("pickle") that I need to read
EDIT
This is the error that I get when running the 2nd option
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-2-b72e45d8bcfc> in <module>
16
17
---> 18 df_mean_all = pd.concat(df_mean_list).reset_index(drop=True)
19
20 print(df_mean_all)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, sort, copy)
253 verify_integrity=verify_integrity,
254 copy=copy,
--> 255 sort=sort,
256 )
257
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\concat.py in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy, sort)
302
303 if len(objs) == 0:
--> 304 raise ValueError("No objects to concatenate")
305
306 if keys is None:
ValueError: No objects to concatenate
THIS IS A PLOT WITH THE RESULTS

Get a dict of dataframes
Save the calculated mean result for each file, into a dict
from pathlib import Path
dir_path = Path(r'C:\Users\path_to_files')
files = dir_path.glob('**/file_inference_time*') # get all pkl files in main dir and subdirectories
df_mean_dict = dict()
for i, file in enumerate(files):
df = pd.DataFrame(pd.read_pickle(file))
df.rename(columns={0:'MobileNet'}, inplace=True)
df_mean_dict[i] = pd.DataFrame((df.iloc[2::,:].mean()* 1000).round(decimals=2))
# if all the file names are unique, the dict key can be the file name (w/o the file extension)
# df_mean_dict[file.stem] = pd.DataFrame((df.iloc[2::,:].mean()* 1000).round(decimals=2))
Get a single dataframe - This is what I would do
The result df_mean_all will be a single, 2-column dataframe.
column 0 will be MobileNet
column 1 will be file
dir_path = Path(r'C:\Users\path_to_files')
files = dir_path.glob('**/file_inference_time*') # get all pkl files in main dir and subdirectories
# to check if the files are found
# if an empty list prints, no files are found
files = list(files)
print(files[:5]
df_mean_list = list()
for file in files:
df = pd.DataFrame(pd.read_pickle(file))
df_mean = pd.DataFrame((df.iloc[2::,:].mean()* 1000).round(decimals=2)).reset_index(drop=True).rename(columns={0: 'MobileNet'})
df_mean['file'] = file # or file.stem for just the file name
df_mean_list.append(df_mean)
# df_mean_list is a list of dataframes, pd.concat combines them all into one dataframe
df_mean_all = pd.concat(df_mean_list).reset_index(drop=True)
print(df_mean_all)
MobileNet file
0 3.24 C:\Users\file_inference_time\file1.pkl
1 2.34 C:\Users\file_inference_time\file2.pkl
2 4.23 C:\Users\file_inference_time\file3.pkl

Related

How to extract 4dimensional data from a list of pandas dataframes?

I have a list of 500 dataframes (in the form of .csv files); 500 = 20(time) x 25(energy) bins. In other words, each dataframe is a measurement of flux at a single time and energy and is represented as 150x150 mesh grid corresponding to x and y spatial coordinates. However, I would like to transform these data into 4-d coordinates of the form Flux(x, y, t, E) such that I have new set of dataframes with columns E and rows t for any given (x,y) position.
I am not sure how to approach the problem. I would appreciate your help in giving me some sort of roadmap for doing this procedure.
Note:
The time and energy of each dataframe is in the name of the corresponding .csv file in the form time-5e+35-energy0.00023-position.csv where t=-5 10^35 and E=0.00023.
What I know:
500 dataframes of 20tx25E must be converted to 22,500 dataframes of 150x150 coordinates. However, this is very time consuming and I am not sure if there is any other package in python3 that can do the job easier.
Code that combines your files into one big Pandas dataframe of size 11,250,000 or 25 × 20 × 150 × 150:
import pandas as pd
from glob import glob
import re
from datetime import datetime
pattern_file_name = re.compile(r'time-(.*)-energy(.*)-position.csv')
start_time = datetime.now()
result_df = None
for file_name in glob('time-*.csv'):
# extract time and energy values from file name
if not pattern_file_name.match(file_name):
raise ValueError(f'file name {file_name} failed pattern match.')
time_s, energy_s = pattern_file_name.findall(file_name)[0]
time, energy = float(time_s), float(energy_s)
print(f'Processing | {time_s} | {energy_s} |...')
df = pd.read_csv(file_name, header=None)
# assuming the CSV (i) has no headers (ii) is an array of 150x150...
# ...floats with no missing or problematic values (iii) each row...
# ...represents a fixed y-coordinate; adjust to your needs
df.index.name = 'y'
df = df.stack()
df.index.rename('x', level=-1, inplace=True)
df = df.swaplevel().sort_index().reset_index().rename(columns={0: 'flux'})
# df is now (x, y, f)
# x and y will each vary from 0 to 149
df.insert(0, 't', time)
df.insert(0, 'E', energy)
result_df = df if result_df is None else pd.concat([result_df, df])
result_df = result_df.set_index(['E', 't', 'x', 'y']).sort_index()
# result_df is now (E, t, x, y) -> flux
result_df.to_csv('output.csv', index=True)
final_time = datetime.now()
delta_time = final_time - start_time
print(f'Completed in {delta_time}')
The main steps are as follows:
Loop over file names
Extract t and E values from file name
Read square matrix of flux values from file
Transform 150 × 150 square matrix to Pandas dataframe of length 22,500
Add columns to keep track of E and t
Append local result to a global, ever-increasing result vector
Finally, leave the loop and save results to disk as CSV
The resulting CSV file will have 5 columns. The first four would represent (E,t,x,y) and the last column would be the value of the flux field at those co-ordinates.

Performing a Principal Component Analysis to reconstruct time series creates more values than expected

I want to do a Principal Component Analysis following this notebook to reconstruct the DJIA (I'm using alpha_ventage) from its components (found with Quandl). Yet, it seems that I create more values than expected, than the original dataframe, when reconstructing the values multiplying the principal components by their weights
kernel_pca = KernelPCA(n_components=5).fit(df_z_components)
pca_5 = kernel_pca.transform(-daily_df_components)
weights = fn_weighted_average(kernel_pca.lambdas_)
reconstructed_values = np.dot(pca_5, weights)
Indeed, daily_df_components is created from the components of the DJIA by the quandl API which seem to have more data than the library I use to get the DJIA Index, alpha_ventage.
Here is the full code
"""
Obtaining the components data from quandl
"""
import quandl
QUANDL_API_KEY = 'MYKEY'
quandl.ApiConfig.api_key = QUANDL_API_KEY
SYMBOLS = [
'AAPL', 'MMM', 'BA', 'AXP', 'CAT',
'CVX', 'CSCO', 'KO', 'DD', 'XOM',
'GS', 'HD', 'IBM', 'INTC', 'JNJ',
'JPM', 'MCD', 'MRK', 'MSFT', 'NKE',
'PFE', 'PG', 'UNH', 'UTX', 'TRV',
'VZ', 'V', 'WMT', 'WBA', 'DIS'
]
wiki_symbols = ['WIKI/%s'%symbol for symbol in SYMBOLS]
df_components = quandl.get(
wiki_symbols,
start_date='2017-01-01',
end_date='2017-12-31',
column_index=11)
df_components.columns = SYMBOLS
filled_df_components = df_components.fillna(method='ffill')
daily_df_components = filled_df_components.resample('24h').ffill()
daily_df_components = daily_df_components.fillna(method='bfill')
"""
Download the all-time DJIA dataset
"""
from alpha_vantage.timeseries import TimeSeries
# Update your Alpha Vantage API key here...
ALPHA_VANTAGE_API_KEY = 'MYKEY'
ts = TimeSeries(key=ALPHA_VANTAGE_API_KEY, output_format='pandas')
df, meta_data = ts.get_intraday(symbol='DIA',interval='1min', outputsize='full')
# Finding eigenvectors and eigen values
fn_weighted_average = lambda x: x/x.sum()
weighted_values = fn_weighted_average(fitted_pca.lambdas_)[:5]
from sklearn.decomposition import KernelPCA
fn_z_score = lambda x: (x - x.mean())/x.std()
df_z_components = daily_df_components.apply(fn_z_score)
fitted_pca = KernelPCA().fit(df_z_components)
# Reconstructing the Dow Average with PCA
import numpy as np
kernel_pca = KernelPCA(n_components=5).fit(df_z_components)
pca_5 = kernel_pca.transform(-daily_df_components)
weights = fn_weighted_average(kernel_pca.lambdas_)
reconstructed_values = np.dot(pca_5, weights)
# Combine PCA and Index to compare
df_combined = djia_2020_weird.copy()
df_combined['pca_5'] = reconstructed_values
But it returns:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-100-2808dc14f789> in <module>()
9 # Combine PCA and Index to compare
10 df_combined = djia_2020_weird.copy()
---> 11 df_combined['pca_5'] = reconstructed_values
12 df_combined = df_combined.apply(fn_z_score)
13 df_combined.plot(figsize=(12,8));
3 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/internals/construction.py in sanitize_index(data, index)
746 if len(data) != len(index):
747 raise ValueError(
--> 748 "Length of values "
749 f"({len(data)}) "
750 "does not match length of index "
ValueError: Length of values (361) does not match length of index (14)
Indeed, reconstructed_values is 361 long and df_combined is 14 values long...
Here is this last dataframe:
DJI
date
2021-01-21 NaN
2021-01-22 311.37
2021-01-23 310.03
2021-01-24 310.03
2021-01-25 310.03
2021-01-26 309.01
2021-01-27 309.49
2021-01-28 302.17
2021-01-29 305.25
2021-01-30 299.20
2021-01-31 299.20
2021-02-01 299.20
2021-02-02 302.13
2021-02-03 307.86
Maybe the reason is that the notebook author was available to get the data for the whole year he was interested in, when I run the data it seems that I only have two months?
Ahoy there, I'm the author of the notebook. It seems Quandl no longer provides historical prices of DJIA after the time of writing, and copyright wasn't granted to redistribute the data. For research, you may consider other free stock tickers to proxy DJIA.
The example usages have been updated in the repo to demostrate KernelPCA, as explained here.

KeyError: "None of [Index(['23/01/2020' ......,\n dtype='object', length=9050)] are in the [columns]"

I am learning pandas and matplotlib on my own by using some public dataset via
this api link
I'm using colab and below are my codes:
import datetime
import io
import json
import pandas as pd
import requests
import matplotlib.pyplot as plt
confirm_resp = requests.get('https://api.data.gov.hk/v2/filterq=%7B%22resource%22%3A%22http%3A%2F%2Fwww.chp.gov.hk%2Ffiles%2Fmisc%2Fenhanced_sur_covid_19_eng.csv%22%2 C%22section%22%3A1%2C%22format%22%3A%22json%22%7D').content
confirm_df = pd.read_json(io.StringIO(confirm_resp.decode('utf-8')))
confirm_df.columns = confirm_df.columns.str.replace(" ", "_")
pd.to_datetime(confirm_df['Report_date'])
confirm_df.columns = ['Case_no', 'Report_date', 'Onset_date', 'Gender', 'Age',
'Name_of_hospital_admitted', 'Status', 'Resident', 'Case_classification', 'Confirmed_probable']
confirm_df = confirm_df.drop('Name_of_hospital_admitted', axis = 1)
confirm_df.head()
and this is what the dataframe looks like:
Case_no
Report_date
Onset_date
Gender
Age
Status
Resident
Case_classification
Confirmed_probable
1
23/01/2020
21/01/2020
M
39
Discharged
Non-HK resident
Imported case
Confirmed
2
23/01/2020
18/01/2020
M
56
Discharged
HK resident
Imported case
Confirmed
3
24/01/2020
20/01/2020
F
62
Discharged
Non-HK resident
Imported case
Confirmed
4
24/01/2020
23/01/2020
F
62
Discharged
Non-HK resident
Imported case
Confirmed
5
24/01/2020
23/01/2020
M
63
Discharged
Non-HK resident
Imported case
Confirmed
When I try to make a simple plot with the below code:
x = confirm_df['Report_date']
y = confirm_df['Case_classification']
confirm_df.plot(x, y)
It gives me the below error:
KeyError Traceback (most recent call last)
<ipython-input-17-e4139a9b5ef1> in <module>()
4 y = confirm_df['Case_classification']
5
----> 6 confirm_df.plot(x, y)
3 frames
/usr/local/lib/python3.6/dist-packages/pandas/plotting/_core.py in __call__(self, *args, **kwargs)
912 if is_integer(x) and not data.columns.holds_integer():
913 x = data_cols[x]
--> 914 elif not isinstance(data[x], ABCSeries):
915 raise ValueError("x must be a label or position")
916 data = data.set_index(x)
/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in __getitem__(self, key)
2910 if is_iterator(key):
2911 key = list(key)
-> 2912 indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
2913
2914 # take() does not accept boolean indexers
/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
1252 keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
1253
-> 1254 self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
1255 return keyarr, indexer
1256
/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
1296 if missing == len(indexer):
1297 axis_name = self.obj._get_axis_name(axis)
-> 1298 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
1299
1300 # We (temporarily) allow for some missing keys with .loc, except in
KeyError: "*None of [Index(['23/01/2020', '23/01/2020', '24/01/2020', '24/01/2020', '24/01/2020',\n '26/01/2020', '26/01/2020', '26/01/2020', '29/01/2020', '29/01/2020',\n ...\n '05/01/2021', '05/01/2021', '05/01/2021', '05/01/2021', '05/01/2021',\n '05/01/2021', '05/01/2021', '05/01/2021', '05/01/2021', '05/01/2021'],\n dtype='object', length=9050)] are in the [column*s]"
I have tried to make the plot with and without converting Report date to datetime object, I tried substitute x value with all the columns in the data frame, but all give me the same error code.
Appreciate if anyone can help me to understand how to handle these issues here and going forward. I've spent hours to resolve it but cannot find the answers.
I did not encounter this issue before when I downloaded some notebooks and datasets from Kaggle to follow along.
Thank you and happy new year.
First, you need to assign the converted date back to the column:
confirm_df['Report_date'] = pd.to_datetime(confirm_df['Report_date'])
Second, When the plot method is called from a dataframe object, you need to provide only the column names as argument (1).
confirm_df.plot(x='Report_date', y='Case_classification')
But the above code still throws error because 'Case_classification' is not numeric data.
You are trying to plot datetime vs. categorical data, so normal plot won't work but Something like this could work (2):
# I used only first 15 examples here, full dataset is kinda messy
confirm_df.iloc[:15, :].groupby(['Report_date', 'Case_classification']).size().unstack().plot.bar()
(1)pandas.DataFrame.plot
(2)How to plot categorical variable against a date column in Python
Several problems. First, the links were incorrect, I have edited them (probably just a copy/paste error). Second, you have to assign the converted datetime series back to the dataframe. Use print(confirm_df.dtypes) to see the difference. Then, the dataset is not ordered by date, but matplotlib expects an ordered x-axis. Well, actually, the problem was that the parser misinterpreted the datetime objects. I have added dayfirst=True to ensure that the dates are read correctly. Finally, what do you want to plot here? Just the cases by date? The number of cases per group by date? Your original code implies just the former but this is not really informative, is it?
import io
import pandas as pd
import requests
import matplotlib.pyplot as plt
print("starting download")
confirm_resp = requests.get('https://api.data.gov.hk/v2/filter?q=%7B%22resource%22%3A%22http%3A%2F%2Fwww.chp.gov.hk%2Ffiles%2Fmisc%2Fenhanced_sur_covid_19_eng.csv%22%2C%22section%22%3A1%2C%22format%22%3A%22json%22%7D').content
print("finished download")
confirm_df = pd.read_json(io.StringIO(confirm_resp.decode('utf-8')))
confirm_df.columns = confirm_df.columns.str.replace(" ", "_")
confirm_df['Report_date'] = pd.to_datetime(confirm_df['Report_date'], dayfirst=True)
confirm_df.columns = ['Case_no', 'Report_date', 'Onset_date', 'Gender', 'Age',
'Name_of_hospital_admitted', 'Status', 'Resident', 'Case_classification', 'Confirmed_probable']
confirm_df = confirm_df.drop('Name_of_hospital_admitted', axis = 1)
print(confirm_df.dtypes)
fig, ax = plt.subplots(figsize=(20, 5))
ax.plot(confirm_df['Report_date'], confirm_df['Case_classification'])
plt.tight_layout()
plt.show()
Sample output:
Some grouping and data aggregation might be more informative, but you have to decide what you want to display first before writing the code.

For loop in DataFrame

I have multiples files with a lot of data and 19 columns. I am trying to to multiple for-loop and set it equal the first column, second etc. in the files.
import numpy as np
import glob
import pandas as pd
#
lat=np.zeros(90)
long=np.zeros(180)
indat=np.zeros(19)
#
file_in = glob.glob('filenames*.dat').
for a in range(140):
for i in range (90):
for j in range (180):
df = pd.DataFrame()
for f in file_in:
cols = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] #there are nineteen columns
indat = df.append(pd.read_csv(f, delimiter='\\s+', header=None, usecols=cols, skiprows=4), ignore_index=True)
lat[i]=indat[0] # error here
long[j]=indat[1]
#updates some code here
if i >=70:
dens[a,j,i-70]=indat[2]
It gave me this error:
ValueError: setting an array element with a sequence.
Updates:
indat has 19 columns, many files but all the format is the same.
Sample indat
#columns
#0 1 2 3 ..... 19
-90 0 2e-12 #just some number
-90 2 3e-12 #just some number
-90 4 4e-12 #just some number
...
-90 360 1e-12 #just some number
-88 0 1e-11 #just some number
-88 2 2e-11 #just some number
-88 4 3e-11 #just some number
...
-88 360 4e-11 #just some number
...
90 0 2.5e-12 #just some number
90 2 3.5e-11 #just some number
90 4 4.5e-12 #just some number
...
90 360 1.5e-12 #just some number
EDIT: I clean the code up based on everyone suggestions
import numpy as np
import glob
import pandas as pd
file_in = glob.glob('filenames*.dat').
df = pd.DataFrame()
for f in file_in:
cols = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18]
indat = pd.read_csv(f, delimiter='\\s+', header=None, usecols=cols, skiprows=4)
for a in range(140):
for i in range (90):
for j in range (180):
lat[i]=indat[0] # error here
long[j]=indat[1]
if i >=70:
dens[a,j,i-70]=indat[2]
you tried to assign a column (pandas series) indat[0] to an element of a numpy vector lat[i]
Also what the point of indat=np.zeros(19) when you override it to be a dataframe later?
What is the content of indat[0]?
This line of code
indat = df.append(pd.read_csv(f, delimiter='\\s+', header=None, usecols=cols, skiprows=4), ignore_index=True)
is basically same as
indat = pd.read_csv(f, delimiter='\\s+', header=None, usecols=cols, skiprows=4)
because df never changed, i.e. it is always an empty dataframe
Since the content of indat is unknown, it's difficult to fix your code.
If you just want to make it run without error, I suggest to write
lat[i] = indat[0].values[0] # take the first value of the vector
long[i] = indat[1].values[0] # take the first value of the vector
It's good to take some tutorial on Numpy and Pandas since it can be very confusing without some basic understanding.

Combining a list of Dataframes

I have a folder with several .csv-files. Each contains data on Time, High, Low, Open, Volumefrom, Volumeto, Close of a cryptocurrency.
I managed to load the .csvs into a list of dataframes and drop the columns Open, High, Low, Volumefrom, Volumeto , which I don't need, leaving me with Time and Close for each dataframe.
Now i want to combine the list of dataframes into one dataframe, where the index starts with the Timestamp of the youngest coin which would be iota in this example.
This is the code I wrote so far:
import pandas as pd
import os
# Path to my folder
PATH_COINS = r"C:\Users\...\Coins"
# creating a path for each of the .csv-files and saving it into a list
namelist = [name for name in os.listdir(PATH_COINS)]
path_lists = [os.path.join(PATH_COINS, path) for path in namelist]
# creating the dataframes and saving them into a list
dfs = [pd.read_csv(k, index_col=0) for k in path_lists]
# dropping unwanted columns
for num, i in enumerate(dfs):
i.drop(columns=["Open", "High", "Low", "Volumefrom", "Volumeto"], inplace=True)
# combining the list of dataframes into one dataframe
pd.concat(dfs, join="inner", axis=1)
However i am getting an Errormessage and cant figure out how to achieve my goal:
Traceback (most recent call last): File
"C:/Users/Jonas/PycharmProjects/Pandas/main.py", line 16, in
pd.concat(dfs, join="inner", axis=1)
File
"C:\Users\Jonas\PycharmProjects\Pandas\venv\lib\site-packages\pandas\core\reshape\concat.py",
line 226, in concat
return op.get_result()
File
"C:\Users\Jonas\PycharmProjects\Pandas\venv\lib\site-packages\pandas\core\reshape\concat.py",
line 423, in get_result
copy=self.copy)
File
"C:\Users\Jonas\PycharmProjects\Pandas\venv\lib\site-packages\pandas\core\internals.py",
line 5425, in concatenate_block_managers
return BlockManager(blocks, axes)
File
"C:\Users\Jonas\PycharmProjects\Pandas\venv\lib\site-packages\pandas\core\internals.py",
line 3282, in init
self._verify_integrity()
File
"C:\Users\Jonas\PycharmProjects\Pandas\venv\lib\site-packages\pandas\core\internals.py",
line 3493, in _verify_integrity
construction_error(tot_items, block.shape[1:], self.axes)
File
"C:\Users\Jonas\PycharmProjects\Pandas\venv\lib\site-packages\pandas\core\internals.py",
line 4843, in construction_error
passed, implied))
ValueError: Shape of passed values is (5, 8514), indices imply (5,
8490)
join should work
Check for duplicate index values as it doesn't know how to map multiple duplicate indexes across multiple DFs (e.g. df.index.is_unique)
Removing duplicate index values (e.g., df.drop_duplicates(inplace=True)) or one of the methods here should resolve it.

Resources