Related
I have a dataframe which is much like the one following:
data = {'A':[21,22,23,24,25,26,27,28,29,30,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10],
'B':[8,8,8,8,8,8,8,8,8,8,5,5,5,5,5,5,5,5,5,5,3,3,3,3,3,3,3,3,3,3],
'C':[10,15,23,17,18,26,24,30,35,42,44,42,38,36,34,30,27,25,27,24,1,0,2,3,5,26,30,40,42,50]}
data_df = pd.DataFrame(data)
data_df
I would like to have the subplots, the number of subplots should be equal to number of unique values of column 'B'. X axis = Values in column 'A' and Y axis = values in Column 'C'.
The code that I tried:
fig = px.line(data_df,
x='A',
y='C',
color='B',
facet_col = 'B',
)
fig.show()
gives output like
However, I would like to have the graphs in a single column, each graph autoscaled to the relevant area and resolution on the axes.
Possibility: Can I somehow make use of groupby command to do it?
Since I may have other number of unique values in column 'B' (for example 5 unique values) based on other data, I would like to have this piece of code to work dynamic. Kindly help me.
PS: plotly express module is used to plot the graph.
In order to stack all subplot in one column, and make sure that each xaxis is independent, just add the following in your px.line() call:
facet_col_wrap=1
And then follow up with:
fig.update_xaxes(matches=None)
Plot 1: Default setup with px.line(facet_col = 'B')
If you'd like to display all x-axis labels just include this:
fig.update_xaxes(showticklabels = True)
Plot 2: Show x-axes for all subplots
Complete code:
import plotly.express as px
import pandas as pd
data = {'A':[21,22,23,24,25,26,27,28,29,30,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10],
'B':[8,8,8,8,8,8,8,8,8,8,5,5,5,5,5,5,5,5,5,5,3,3,3,3,3,3,3,3,3,3],
'C':[10,15,23,17,18,26,24,30,35,42,44,42,38,36,34,30,27,25,27,24,1,0,2,3,5,26,30,40,42,50]}
data_df = pd.DataFrame(data)
data_df
fig = px.line(data_df,
x='A',
y='C',
color='B',
facet_col = 'B',
facet_col_wrap=1
)
fig.update_xaxes(matches=None, showticklabels = True)
fig.show()
You can instead use the argument facet_row = 'B' which will automatically stack the subplots by rows. Then to automatically rescale, you'll want to set all of the x data to the same array of values, which can be done by looping through fig.data and modifying fig.data[i]['x'] for each i.
import pandas as pd
import plotly.express as px
data = {'A':[21,22,23,24,25,26,27,28,29,30,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10],
'B':[8,8,8,8,8,8,8,8,8,8,5,5,5,5,5,5,5,5,5,5,3,3,3,3,3,3,3,3,3,3],
'C':[10,15,23,17,18,26,24,30,35,42,44,42,38,36,34,30,27,25,27,24,1,0,2,3,5,26,30,40,42,50]}
data_df = pd.DataFrame(data)
fig = px.line(data_df,
x='A',
y='C',
color='B',
facet_row = 'B',
)
for fig_data in fig.data:
fig_data['x'] = list(range(len(fig_data['y'])))
fig.show()
I have a dataframe of XY coordinates which I'm plotting as Markers in a Scatter plot. I'd like to add_trace lines between specific XY pairs, not between every pair. For example, I'd like a line between Index 0 and Index 3 and another between Index 1 and Index 2. This means that just using a line plot won't work as I don't want to show all the connections. Is it possible to do it with a version of iloc or do I need to make my DataFrame in 'Wide-format' and have each XY pair as separate column pairs?
I've read through this but I'm not sure it helps in my case.
Adding specific lines to a Plotly Scatter3d() plot
import pandas as pd
import plotly.graph_objects as go
# sample data
d={'MeanE': {0: 22.448461538460553, 1: 34.78435897435799, 2: 25.94307692307667, 3: 51.688974358974164},
'MeanN': {0: 110.71128205129256, 1: 107.71666666666428, 2: 140.6384615384711, 3: 134.58615384616363}}
# pandas dataframe
df=pd.DataFrame(d)
# set up plotly figure
fig = go.Figure()
fig.add_trace(go.Scatter(x=df['MeanE'],y=df['MeanN'],mode='markers'))
fig.show()
UPDATE:
Adding the accepted answer below to what I had already, I now get the following finished plot.
taken approach of updating data frame rows that are the pairs of co-ordinates where you have defined
then add traces to figure to complete requirement as a list comprehension
import pandas as pd
import plotly.graph_objects as go
# sample data
d={'MeanE': {0: 22.448461538460553, 1: 34.78435897435799, 2: 25.94307692307667, 3: 51.688974358974164},
'MeanN': {0: 110.71128205129256, 1: 107.71666666666428, 2: 140.6384615384711, 3: 134.58615384616363}}
# pandas dataframe
df=pd.DataFrame(d)
# set up plotly figure
fig = go.Figure()
fig.add_trace(go.Scatter(x=df['MeanE'],y=df['MeanN'],mode='markers'))
# mark of pairs that will be lines
df.loc[[0, 3], "group"] = 1
df.loc[[1, 2], "group"] = 2
# add the lines to the figure
fig.add_traces(
[
go.Scatter(
x=df.loc[df["group"].eq(g), "MeanE"],
y=df.loc[df["group"].eq(g), "MeanN"],
mode="lines",
)
for g in df["group"].unique()
]
)
fig.show()
alternate solution to enhanced requirement in comments
# mark of pairs that will be lines
lines = [[0, 3], [1, 2], [0,2],[1,3]]
# add the lines to the figure
fig.add_traces(
[
go.Scatter(
x=df.loc[pair, "MeanE"],
y=df.loc[pair, "MeanN"],
mode="lines",
)
for pair in lines
]
)
I have multiple category columns (nearly 50). I using custom made frequency encoding and using it on training data. At last i am saving it as nested dictionary. For the test data I am using map function to encode and unseen labels are replaced with 0. But I need more efficient way?
I have already tried pandas replace method but it don't cares of unseen labels and leaves it as it. Further I am much concerned about the time and i want say 80 columns and 1 row to be encoded within 60 ms. Just need the most efficient way I can do it. I have taken my example from here.
import pandas
from sklearn import preprocessing
df = pandas.DataFrame({'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'meo'],
'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'],
'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego',
'New_York']})
My dict looks something like this :
enc = {'pets': {'cat': 0, 'dog': 1, 'monkey': 2},
'owner': {'Brick': 0, 'Champ': 1, 'Ron': 2, 'Veronica': 3},
'location': {'New_York': 0, 'San_Diego': 1}}
for col in enc:
if col in input_df.columns:
input_df[col]= input_df[col].map(dict_online['encoding'][col]).fillna(0)
Further I want multiple columns to be encoded at once. I don't want any loop for every column.... I guess we cant do it in map. Hence replace is good choice but in that as said it doesn't cares about unseen labels.
EDIT:
This the code i am using for now, Please note there is only 1 row in test data frame ( Not very sure i should handle it like numpy array to reduce time...). But i need to decrease this time to under 60 ms: Further i have dictionary only for mapping ( Cant use one hot because of use case). Currently time = 331.74 ms. Any idea how to do it more efficiently. Not sure that multiprocessing will work..? Further with replace method i have got many issues like : 1. It does not handle unseen labels and leave them as it is ( for string its issue). 2. It has problem with overlapping of keys and values.
from string import ascii_lowercase
import itertools
import pandas as pd
import numpy as np
import time
def iter_all_strings():
for size in itertools.count(1):
for s in itertools.product(ascii_lowercase, repeat=size):
yield "".join(s)
l = []
for s in iter_all_strings():
l.append(s)
if s == 'gr':
break
columns = l
df = pd.DataFrame(columns=columns)
for col in df.columns:
df[col] = np.random.randint(1, 4000, 3000)
transform_dict = {}
for col in df.columns:
cats = pd.Categorical(df[col]).categories
d = {}
for i, cat in enumerate(cats):
d[cat] = i
transform_dict[col] = d
print(f"The length of the dictionary is {len(transform_dict)}")
# Creating another test data frame
df2 = pd.DataFrame(columns=columns)
for col in df2.columns:
df2[col] = np.random.randint(1, 4000, 1)
print(f"The shape of teh 2nd data frame is {df2.shape}")
t1 = time.time()
for col in df2.columns:
df2[col] = df2[col].map(transform_dict[col]).fillna(0)
print(f"Time taken is {time.time() - t1}")
# print(df)
Firstly, when you want to encode categorical variables, which is not ordinal (meaning: there is no inherent ordering between the values of the variable/column. ex- cat, dog), you must use one hot encoding.
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'meo'],
'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'],
'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego',
'New_York']})
enc = [['cat','dog','monkey'],
['Brick', 'Champ', 'Ron', 'Veronica'],
['New_York', 'San_Diego']]
ohe = OneHotEncoder(categories=enc, handle_unknown='ignore', sparse=False)
Here, I have modified your enc in a way that can be fed into the OneHotEncoder.
Now comes the point of how can we going to handle the unseen
labels?
when you handle_unknown as False, the unseen values will have zeros in all the dummy variables, which in a way would help the model to understand its a unknown value.
colnames= ['{}_{}'.format(col,val) for col,unique_values in zip(df.columns,ohe.categories_) \
for val in unique_values]
pd.DataFrame(ohe.fit_transform(df), columns=colnames)
Update:
If you are fine with ordinal endocing, the following change could help.
df2.apply(lambda row: [transform_dict[val].get(col,0) \
for val,col in row.items()],
axis=1,
result_type='expand')
#1000 loops, best of 3: 1.17 ms per loop
I have a very large dataset with a polygons and points with buffers around them. I would like to creat a new column in the points data which includes the number of polygons that point's buffer intersects.
Heres a simplified example:
import pandas as pd
import geopandas as gp
from shapely.geometry import Polygon
from shapely.geometry import Point
import matplotlib.pyplot as plt
## Create polygons and points ##
df = gp.GeoDataFrame([['a',Polygon([(1, 0), (1, 1), (2,2), (1,2)])],
['b',Polygon([(1, 0.25), (2,1.25), (3,0.25)])]],
columns = ['name','geometry'])
df = gp.GeoDataFrame(df, geometry = 'geometry')
points = gp.GeoDataFrame( [['box', Point(1.5, 1.115), 4],
['triangle', Point(2.5,1.25), 8]],
columns=['name', 'geometry', 'value'],
geometry='geometry')
##Set a buffer around the points##
buf = points.buffer(0.5)
points['buffer'] = buf
points = points.drop(['geometry'], axis = 1)
points = points.rename(columns = {'buffer': 'geometry'})
This data looks like this:
What I'd like to do is create another column in the points dataframe that includes the number of polygons that point intersects.
I've tried utilising a for loop as such:
points['intersect'] = []
for geo1 in points['geometry']:
for geo2 in df['geometry']:
if geo1.intersects(geo2):
points['intersect'].append('1')
Which I would then sum to get the total number of intersects.
However, I get the error: 'Length of values does not match length of index'. I know this is because it is attempting to assign three rows of data to a frame with only two rows.
How can I aggrigate the counts so the first point is assigned a value of 2 and the second a value of 1?
If you have large dataset, I would go for solution using rtree spatial index, something like this.
import pandas as pd
import geopandas as gp
from shapely.geometry import Polygon
from shapely.geometry import Point
import matplotlib.pyplot as plt
## Create polygons and points ##
df = gp.GeoDataFrame([['a',Polygon([(1, 0), (1, 1), (2,2), (1,2)])],
['b',Polygon([(1, 0.25), (2,1.25), (3,0.25)])]],
columns = ['name','geometry'])
df = gp.GeoDataFrame(df, geometry = 'geometry')
points = gp.GeoDataFrame( [['box', Point(1.5, 1.115), 4],
['triangle', Point(2.5,1.25), 8]],
columns=['name', 'geometry', 'value'],
geometry='geometry')
# generate spatial index
sindex = df.sindex
# define empty list for results
results_list = []
# iterate over the points
for index, row in points.iterrows():
buffer = row['geometry'].buffer(0.5) # buffer
# find approximate matches with r-tree, then precise matches from those approximate ones
possible_matches_index = list(sindex.intersection(buffer.bounds))
possible_matches = df.iloc[possible_matches_index]
precise_matches = possible_matches[possible_matches.intersects(buffer)]
results_list.append(len(precise_matches))
# add list of results as a new column
points['polygons'] = pd.Series(results_list)
Given the following data:
DC,Mode,Mod,Ven,TY1,TY2,TY3,TY4,TY5,TY6,TY7,TY8
Intra,S,Dir,C1,False,False,False,False,False,True,True,False
Intra,S,Co,C1,False,False,False,False,False,False,False,False
Intra,M,Dir,C1,False,False,False,False,False,False,True,False
Inter,S,Co,C1,False,False,False,False,False,False,False,False
Intra,S,Dir,C2,False,True,True,True,True,True,True,False
Intra,S,Co,C2,False,False,False,False,False,False,False,False
Intra,M,Dir,C2,False,False,False,False,False,False,False,False
Inter,S,Co,C2,False,False,False,False,False,False,False,False
Intra,S,Dir,C3,False,False,False,False,True,True,False,False
Intra,S,Co,C3,False,False,False,False,False,False,False,False
Intra,M,Dir,C3,False,False,False,False,False,False,False,False
Inter,S,Co,C3,False,False,False,False,False,False,False,False
Intra,S,Dir,C4,False,False,False,False,False,True,False,True
Intra,S,Co,C4,True,True,True,True,False,True,False,True
Intra,M,Dir,C4,False,False,False,False,False,True,False,True
Inter,S,Co,C4,True,True,True,False,False,True,False,True
Intra,S,Dir,C5,True,True,False,False,False,False,False,False
Intra,S,Co,C5,False,False,False,False,False,False,False,False
Intra,M,Dir,C5,True,True,False,False,False,False,False,False
Inter,S,Co,C5,False,False,False,False,False,False,False,False
Imports:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
To reproduce my DataFrame, copy the data then use:
df = pd.read_clipboard(sep=',')
I'd like to create a plot conveying the same information as my example, but not necessarily with the same shape (I'm open to suggestions). I'd also like to hover over the color and have the appropriate Ven displayed (e.g. C1, not 1).:
Edit 2018-10-17:
The two solutions provided so far, are helpful and each accomplish a different aspect of what I'm looking for. However, the key issue I'd like to resolve, which wasn't explicitly stated prior to this edit, is the following:
I would like to perform the plotting without converting Ven to an int; this numeric transformation isn't practical with the real data. So the actual scope of the question is to plot all categorical data with two categorical axes.
The issue I'm experiencing is the data is categorical and the y-axis is multi-indexed.
I've done the following to transform the DataFrame:
# replace False witn nan
df = df.replace(False, np.nan)
# replace True with a number representing Ven (e.g. C1 = 1)
def rep_ven(row):
return row.iloc[4:].replace(True, int(row.Ven[1]))
df.iloc[:, 4:] = df.apply(rep_ven, axis=1)
# drop the Ven column
df = df.drop(columns=['Ven'])
# set multi-index
df_m = df.set_index(['DC', 'Mode', 'Mod'])
Plotting the transformed DataFrame produces:
plt.figure(figsize=(20,10))
heatmap = plt.imshow(df_m)
plt.xticks(range(len(df_m.columns.values)), df_m.columns.values)
plt.yticks(range(len(df_m.index)), df_m.index)
plt.show()
This plot isn't very streamlined, there are four axis values for each Ven. This is a subset of data, so the graph would be very long with all the data.
Here's my solution. Instead of plotting I just apply a style to the DataFrame, see https://pandas.pydata.org/pandas-docs/stable/style.html
# Transform Ven values from "C1", "C2" to 1, 2, ..
df['Ven'] = df['Ven'].str[1]
# Given a specific combination of dc, mode, mod, ven,
# do we have any True cells?
g = df.groupby(['DC', 'Mode', 'Mod', 'Ven']).any()
# Let's drop any rows with only False values
g = g[g.any(axis=1)]
# Convert True, False to 1, 0
g = g.astype(int)
# Get the values of the ven index as an int array
# Note: we don't want to drop the ven index!!
# Otherwise styling won't work
ven = g.index.get_level_values('Ven').values.astype(int)
# Multiply 1 and 0 with Ven value
g = g.mul(ven, axis=0)
# Sort the index
g.sort_index(ascending=False, inplace=True)
# Now display the dataframe with styling
# first we get a color map
import matplotlib
cmap = matplotlib.cm.get_cmap('tab10')
def apply_color_map(val):
# hide the 0 values
if val == 0:
return 'color: white; background-color: white'
else:
# for non-zero: get color from cmap, convert to hexcode for css
s = "color:white; background-color: " + matplotlib.colors.rgb2hex(cmap(val))
return s
g
g.style.applymap(apply_color_map)
The available matplotlib colormaps can be seen here: Colormap reference, with some additional explanation here: Choosing a colormap
Explanation: Remove rows where TY1-TY8 are all nan to create your plot. Refer to this answer as a starting point for creating interactive annotations to display Ven.
The below code should work:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.read_clipboard(sep=',')
# replace False witn nan
df = df.replace(False, np.nan)
# replace True with a number representing Ven (e.g. C1 = 1)
def rep_ven(row):
return row.iloc[4:].replace(True, int(row.Ven[1]))
df.iloc[:, 4:] = df.apply(rep_ven, axis=1)
# drop the Ven column
df = df.drop(columns=['Ven'])
idx = df[['TY1','TY2', 'TY3', 'TY4','TY5','TY6','TY7','TY8']].dropna(thresh=1).index.values
df = df.loc[idx,:].sort_values(by=['DC', 'Mode','Mod'], ascending=False)
# set multi-index
df_m = df.set_index(['DC', 'Mode', 'Mod'])
plt.figure(figsize=(20,10))
heatmap = plt.imshow(df_m)
plt.xticks(range(len(df_m.columns.values)), df_m.columns.values)
plt.yticks(range(len(df_m.index)), df_m.index)
plt.show()