I have a list of 500 dataframes (in the form of .csv files); 500 = 20(time) x 25(energy) bins. In other words, each dataframe is a measurement of flux at a single time and energy and is represented as 150x150 mesh grid corresponding to x and y spatial coordinates. However, I would like to transform these data into 4-d coordinates of the form Flux(x, y, t, E) such that I have new set of dataframes with columns E and rows t for any given (x,y) position.
I am not sure how to approach the problem. I would appreciate your help in giving me some sort of roadmap for doing this procedure.
Note:
The time and energy of each dataframe is in the name of the corresponding .csv file in the form time-5e+35-energy0.00023-position.csv where t=-5 10^35 and E=0.00023.
What I know:
500 dataframes of 20tx25E must be converted to 22,500 dataframes of 150x150 coordinates. However, this is very time consuming and I am not sure if there is any other package in python3 that can do the job easier.
Code that combines your files into one big Pandas dataframe of size 11,250,000 or 25 × 20 × 150 × 150:
import pandas as pd
from glob import glob
import re
from datetime import datetime
pattern_file_name = re.compile(r'time-(.*)-energy(.*)-position.csv')
start_time = datetime.now()
result_df = None
for file_name in glob('time-*.csv'):
# extract time and energy values from file name
if not pattern_file_name.match(file_name):
raise ValueError(f'file name {file_name} failed pattern match.')
time_s, energy_s = pattern_file_name.findall(file_name)[0]
time, energy = float(time_s), float(energy_s)
print(f'Processing | {time_s} | {energy_s} |...')
df = pd.read_csv(file_name, header=None)
# assuming the CSV (i) has no headers (ii) is an array of 150x150...
# ...floats with no missing or problematic values (iii) each row...
# ...represents a fixed y-coordinate; adjust to your needs
df.index.name = 'y'
df = df.stack()
df.index.rename('x', level=-1, inplace=True)
df = df.swaplevel().sort_index().reset_index().rename(columns={0: 'flux'})
# df is now (x, y, f)
# x and y will each vary from 0 to 149
df.insert(0, 't', time)
df.insert(0, 'E', energy)
result_df = df if result_df is None else pd.concat([result_df, df])
result_df = result_df.set_index(['E', 't', 'x', 'y']).sort_index()
# result_df is now (E, t, x, y) -> flux
result_df.to_csv('output.csv', index=True)
final_time = datetime.now()
delta_time = final_time - start_time
print(f'Completed in {delta_time}')
The main steps are as follows:
Loop over file names
Extract t and E values from file name
Read square matrix of flux values from file
Transform 150 × 150 square matrix to Pandas dataframe of length 22,500
Add columns to keep track of E and t
Append local result to a global, ever-increasing result vector
Finally, leave the loop and save results to disk as CSV
The resulting CSV file will have 5 columns. The first four would represent (E,t,x,y) and the last column would be the value of the flux field at those co-ordinates.
I am selecting spatially and temporal data from this kind of NetCDF opened by
ds = xr.open_mfdataset(file_list):
<xarray.Dataset>
Dimensions: (lat: 576, lon: 1152, time: 1464)
Coordinates:
* lon (lon) float32 0.0 0.3125 0.625 0.9375 ... 359.0625 359.375 359.6875
* lat (lat) float32 89.761 89.4514 89.1399 ... -89.1399 -89.4514 -89.761
* time (time) datetime64[ns] 1980-04-01T01:00:00 ... 1980-06-01
Data variables:
uasmean (lat, lon, time) float32 dask.array<shape=(576, 1152, 1464), chunksize=(576, 1152, 720)>
vasmean (lat, lon, time) float32 dask.array<shape=(576, 1152, 1464), chunksize=(576, 1152, 720)>
Attributes:
Creator: NCAR - CISL RDA (dattore)
history: Mon Aug 11 12:24:36 2014: ncatted -a history,global,d,, -O Wind...
I achieved to get the correct subset in time and lon/lat using:
ds = ds.where((ds.time >= np.datetime64(date_ini)) & (ds.time <= np.datetime64(date_end)), drop=True)
ds = ds.where((ds.lon >= lonlat[0]) & (ds.lon <= lonlat[1]) & (ds.lat >= lonlat[2]) & (ds.lat <= lonlat[3]), drop=True)
And finally to extract this information in my target format I use a loop over the time to convert the information to a dataframe that I export to csv after:
# for t in ds['time']:
t = ds['time'][0]
# Select time and convert to dataframe
df = ds.sel(time=t).to_dataframe()
My problem is that the conversion to dataframe is slow and I know that the originals netCDF are written in order to optimize the extraction of temporal series instead of extracting maps as I am trying to do. I know that is possible to change the sort of coordinates and write a new netCDF in order to speed this up, but the database is a too big... so it is not an option. Do you know if there is any other way to speed up this extraction??
Thank you all in advance!!!
P.D.: I attached the complete script of this block of code I am using to check the performance...
import os
import random
import shutil
from datetime import datetime, timedelta
from glob import glob
import pandas as pd
import xarray as xr
import numpy as np
import scipy.io
import matplotlib.pyplot as plt
import time
start_time = time.time()
files = glob('*.nc')
lonlat = [-5, 10, 50, 64]
date_ini = datetime(1980, 4, 28)
date_end = datetime(1980, 5, 3)
ds = xr.open_mfdataset(files)
print('[Processing 2D winds]')
# create date list to loop over folders
dates = pd.date_range(start=date_ini - timedelta(days=1), end=date_end + timedelta(days=1), freq='D')
# Create date list of files to open
file_list = []
for date in dates:
file_list.append('Wind_CFS_Global_' + date.strftime('%Y.%m') + '.nc')
# Delete repeated elements
file_list = list(dict.fromkeys(file_list))
print(file_list)
# load data
ds = xr.open_mfdataset(file_list)
# Select temporal subset
ds = ds.get(['uasmean','vasmean'])
ds = ds.where((ds.time >= np.datetime64(date_ini)) & (ds.time <= np.datetime64(date_end)), drop=True)
# from 0º,360º to -180º,180º
ds['lon'] = (ds.lon + 180) % 360 - 180
ds = ds.sortby('lon', 'lat')
ds = ds.where((ds.lon >= lonlat[0]) & (ds.lon <= lonlat[1]) & (ds.lat >= lonlat[2]) & (ds.lat <= lonlat[3]), drop=True)
print(ds)
currents_list = []
# for t in ds['time']:
t = ds['time'][0]
# Select time and depth array
df = ds.sel(time=t).to_dataframe()
# reset index because longitude latitude are as multi-index and I want them as columns
df = df.reset_index()
# sort data-rows for TESEO: longitude, latitude (ascending)
df = df.sort_values(['lon', 'lat'])
# generate full file path
outfile = 'winds_' + df['time'][0].strftime('%Y%m%dT%H%M') + '.txt'
# export to ascii without separator, without header neither index column, replace nan by 0 and set 3 floating numbers
df.to_csv(path_or_buf=outfile,
sep=' ',
columns=['lon', 'lat', 'uasmean', 'vasmean'],
header=False,
index=False,
na_rep=0,
float_format='%.3f'
)
elapsed_time = (time.time() - start_time)
print('Elapsed time: {} sec.'.format(elapsed_time))
I found a big improvement in the performance doing this:
convert all the Xarray dataset to a dataframe
loop over time directly in the dataframe
That makes a really big difference! I was looping over time and converting this shorter dataset to dataframe that is really inefficient!
Best regards!
Given the following data:
DC,Mode,Mod,Ven,TY1,TY2,TY3,TY4,TY5,TY6,TY7,TY8
Intra,S,Dir,C1,False,False,False,False,False,True,True,False
Intra,S,Co,C1,False,False,False,False,False,False,False,False
Intra,M,Dir,C1,False,False,False,False,False,False,True,False
Inter,S,Co,C1,False,False,False,False,False,False,False,False
Intra,S,Dir,C2,False,True,True,True,True,True,True,False
Intra,S,Co,C2,False,False,False,False,False,False,False,False
Intra,M,Dir,C2,False,False,False,False,False,False,False,False
Inter,S,Co,C2,False,False,False,False,False,False,False,False
Intra,S,Dir,C3,False,False,False,False,True,True,False,False
Intra,S,Co,C3,False,False,False,False,False,False,False,False
Intra,M,Dir,C3,False,False,False,False,False,False,False,False
Inter,S,Co,C3,False,False,False,False,False,False,False,False
Intra,S,Dir,C4,False,False,False,False,False,True,False,True
Intra,S,Co,C4,True,True,True,True,False,True,False,True
Intra,M,Dir,C4,False,False,False,False,False,True,False,True
Inter,S,Co,C4,True,True,True,False,False,True,False,True
Intra,S,Dir,C5,True,True,False,False,False,False,False,False
Intra,S,Co,C5,False,False,False,False,False,False,False,False
Intra,M,Dir,C5,True,True,False,False,False,False,False,False
Inter,S,Co,C5,False,False,False,False,False,False,False,False
Imports:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
To reproduce my DataFrame, copy the data then use:
df = pd.read_clipboard(sep=',')
I'd like to create a plot conveying the same information as my example, but not necessarily with the same shape (I'm open to suggestions). I'd also like to hover over the color and have the appropriate Ven displayed (e.g. C1, not 1).:
Edit 2018-10-17:
The two solutions provided so far, are helpful and each accomplish a different aspect of what I'm looking for. However, the key issue I'd like to resolve, which wasn't explicitly stated prior to this edit, is the following:
I would like to perform the plotting without converting Ven to an int; this numeric transformation isn't practical with the real data. So the actual scope of the question is to plot all categorical data with two categorical axes.
The issue I'm experiencing is the data is categorical and the y-axis is multi-indexed.
I've done the following to transform the DataFrame:
# replace False witn nan
df = df.replace(False, np.nan)
# replace True with a number representing Ven (e.g. C1 = 1)
def rep_ven(row):
return row.iloc[4:].replace(True, int(row.Ven[1]))
df.iloc[:, 4:] = df.apply(rep_ven, axis=1)
# drop the Ven column
df = df.drop(columns=['Ven'])
# set multi-index
df_m = df.set_index(['DC', 'Mode', 'Mod'])
Plotting the transformed DataFrame produces:
plt.figure(figsize=(20,10))
heatmap = plt.imshow(df_m)
plt.xticks(range(len(df_m.columns.values)), df_m.columns.values)
plt.yticks(range(len(df_m.index)), df_m.index)
plt.show()
This plot isn't very streamlined, there are four axis values for each Ven. This is a subset of data, so the graph would be very long with all the data.
Here's my solution. Instead of plotting I just apply a style to the DataFrame, see https://pandas.pydata.org/pandas-docs/stable/style.html
# Transform Ven values from "C1", "C2" to 1, 2, ..
df['Ven'] = df['Ven'].str[1]
# Given a specific combination of dc, mode, mod, ven,
# do we have any True cells?
g = df.groupby(['DC', 'Mode', 'Mod', 'Ven']).any()
# Let's drop any rows with only False values
g = g[g.any(axis=1)]
# Convert True, False to 1, 0
g = g.astype(int)
# Get the values of the ven index as an int array
# Note: we don't want to drop the ven index!!
# Otherwise styling won't work
ven = g.index.get_level_values('Ven').values.astype(int)
# Multiply 1 and 0 with Ven value
g = g.mul(ven, axis=0)
# Sort the index
g.sort_index(ascending=False, inplace=True)
# Now display the dataframe with styling
# first we get a color map
import matplotlib
cmap = matplotlib.cm.get_cmap('tab10')
def apply_color_map(val):
# hide the 0 values
if val == 0:
return 'color: white; background-color: white'
else:
# for non-zero: get color from cmap, convert to hexcode for css
s = "color:white; background-color: " + matplotlib.colors.rgb2hex(cmap(val))
return s
g
g.style.applymap(apply_color_map)
The available matplotlib colormaps can be seen here: Colormap reference, with some additional explanation here: Choosing a colormap
Explanation: Remove rows where TY1-TY8 are all nan to create your plot. Refer to this answer as a starting point for creating interactive annotations to display Ven.
The below code should work:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.read_clipboard(sep=',')
# replace False witn nan
df = df.replace(False, np.nan)
# replace True with a number representing Ven (e.g. C1 = 1)
def rep_ven(row):
return row.iloc[4:].replace(True, int(row.Ven[1]))
df.iloc[:, 4:] = df.apply(rep_ven, axis=1)
# drop the Ven column
df = df.drop(columns=['Ven'])
idx = df[['TY1','TY2', 'TY3', 'TY4','TY5','TY6','TY7','TY8']].dropna(thresh=1).index.values
df = df.loc[idx,:].sort_values(by=['DC', 'Mode','Mod'], ascending=False)
# set multi-index
df_m = df.set_index(['DC', 'Mode', 'Mod'])
plt.figure(figsize=(20,10))
heatmap = plt.imshow(df_m)
plt.xticks(range(len(df_m.columns.values)), df_m.columns.values)
plt.yticks(range(len(df_m.index)), df_m.index)
plt.show()
I have two dfs, for which I want to create a single bar plot,
each bar needs its own color depending on which df it came from.
# Ages < 20
df1.tags = ['locari', 'ママコーデ', 'ponte_fashion', 'kurashiru', 'fashion']
df1.tag_count = [2162, 1647, 1443, 1173, 1032]
# Ages 20 - 24
df2.tags= ['instagood', 'ootd', 'fashion', 'followme', 'love']
df2.tag_count = [6523, 4576, 3986, 3847, 3599]
How do I create such a plot?
P.S. The original df is way bigger. Some words may overlap, but I want them to have different colors as well
Your data frame tag_counts are just simple lists, so you can use standard mpl bar plots to plot both of them in the same axis. This answer assumes that both dataframes have the same length.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# Create dataframes
df1=pd.DataFrame()
df2=pd.DataFrame()
# Ages < 20
df1.tags = ['locari', 'blub', 'ponte_fashion', 'kurashiru', 'fashion']
df1.tag_count = [2162, 1647, 1443, 1173, 1032]
# Ages 20 - 24
df2.tags= ['instagood', 'ootd', 'fashion', 'followme', 'love']
df2.tag_count = [6523, 4576, 3986, 3847, 3599]
# Create figure
fig=plt.figure()
ax=fig.add_subplot(111)
# x-coordinates
ind1 = np.arange(len(df1.tag_count))
ind2 = np.arange(len(df2.tag_count))
width = 0.35
# Bar plot for df1
ax.bar(ind1,df1.tag_count,width,color='r')
# Bar plot for df1
ax.bar(ind2+width,df2.tag_count,width,color='b')
# Create new xticks
ticks=list(ind1+0.5*width)+list(ind2+1.5*width)
ticks.sort()
ax.set_xticks(ticks)
# Sort labels in an alternating way
labels = [None]*(len(df1.tags)+len(df2.tags))
labels[::2] = df1.tags
labels[1::2] = df2.tags
ax.set_xticklabels(labels)
plt.show()
This will return a plot like this
Note that to merge both tags into a single list I assumed that both lists have the same length.