i am new to python and seaborn, i am using the dataset from kaggle, now i want to visualize the data, this is my code
def add_if_zero(x):
if (x['stays_in_weekend_nights']+x['stays_in_week_nights'])==0:
return 1
else:
return (x['stays_in_weekend_nights']+x['stays_in_week_nights'])
_data['total_days_of_stay']=_data.apply(lambda x:add_if_zero(x),axis=1)
x_order=['January','February','March','April','May','May','May','August','September','October','November','December']
sns.relplot(x='arrival_date_month',
y='total_days_of_stay',hue='hotel',
kind='line',data=_data,height=10,aspect=.9,row_order=x_order)
i am getting the plot and i want the order to be from jan-dec but i am getting a random variable in the x-axis
i tried passing the order to the row_order attribute in array as well as in string format but nothing seems to work
as per the documentation
row_order, col_orderlists of strings, optional Order to organize the
rows and/or columns of the grid in, otherwise the orders are inferred
from the data objects.
UPDATE 1:
You may wanna use lineplot() that returns a matplotlib Axes object, which is easy to configure. The basic idea here is to plot the numerical order int first and then add labels. Please consider the following example that uses a sample dataset from seaborn.
import calendar
import seaborn as sns
import matplotlib.pyplot as plt
# sample dataset
df = sns.load_dataset('flights')
# map month name with numerical order
d = dict((v,k) for k,v in enumerate(calendar.month_name[1:], start=1))
df['month_num'] = df.month.map(d)
# plot
fig, ax = plt.subplots(figsize=(10, 5))
sns.lineplot(x='month_num', y='passengers', data=df, hue='year', ax=ax)
# set xticks position and labels
ax.set_xticks(range(1, len(d)+1))
ax.set_xticklabels(d.keys(), rotation=30)
Use sns.catplot , pass the desired order as a list of strings to the function and set the kind='point', as follows:
sns.catplot(... ,
kind='point',
order=['Jan', 'Feb' , ...., 'Dec'])
Related
In this data set I need to plot,pH as the x-column which is having continuous data and need to group it together the pH axis as per the quality value and plot the histogram. In many of the resources I referred I found solutions for using random data generated. I tried this piece of code.
plt.hist(, density=True, bins=1)
plt.ylabel('quality')
plt.xlabel('pH');
Where I eliminated the random generated data, but I received and error
File "<ipython-input-16-9afc718b5558>", line 1
plt.hist(, density=True, bins=1)
^
SyntaxError: invalid syntax
What is the proper way to plot my data?I want to feed into the histogram not randomly generated data, but data found in the data set.
Your Error
The immediate problem in your code is the missing data to the plt.hist() command.
plt.hist(, density=True, bins=1)
should be something like:
plt.hist(data_table['pH'], density=True, bins=1)
Seaborn histplot
But this doesn't get the plot broken down by quality. The answer by Mr.T looks correct, but I'd also suggest seaborn which works with "melted" data like you have. The histplot command should give you what you want:
import seaborn as sns
sns.histplot(data=df, x="pH", hue="quality", palette="Dark2", element='step')
Assuming the table you posted is in a pandas.DataFrame named df with columns "pH" and "quality", you get something like:
The palette (Dark2) can can be any matplotlib colormap.
Subplots
If the overlaid histograms are too hard to see, an option is to do facets or small multiples. To do this with pandas and matplotlib:
# group dataframe by quality values
data_by_qual = df.groupby('quality')
# create a sub plot for each quality group
fig, axes = plt.subplots(nrows=len(data_by_qual),
figsize=[6,12],
sharex=True)
fig.subplots_adjust(hspace=.5)
# loop over axes and quality groups together
for ax, (quality, qual_data) in zip(axes, data_by_qual):
ax.hist(qual_data['pH'], bins=10)
ax.set_title(f"quality = {quality}")
ax.set_xlabel('pH')
Altair Facets
The plotting library altair can do this for you:
import altair as alt
alt.Chart(df).mark_bar().encode(
alt.X("pH:Q", bin=True),
y='count()',
).facet(row='quality')
Several possibilities here to represent multiple histograms. All have in common that the data have to be transformed from long to wide format - meaning, each category is in its own column:
import matplotlib.pyplot as plt
import pandas as pd
#test data generation
import numpy as np
np.random.seed(123)
n=300
df = pd.DataFrame({"A": np.random.randint(1, 100, n), "pH": 3*np.random.rand(n), "quality": np.random.choice([3, 4, 5, 6], n)})
df.pH += df.quality
#instead of this block you have to read here your stored data, e.g.,
#df = pd.read_csv("my_data_file.csv")
#check that it read the correct data
#print(df.dtypes)
#print(df.head(10))
#bringing the columns in the required wide format
plot_df = df.pivot(columns="quality")["pH"]
bin_nr=5
#creating three subplots for different ways to present the same histograms
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(6, 12))
ax1.hist(plot_df, bins=bin_nr, density=True, histtype="bar", label=plot_df.columns)
ax1.legend()
ax1.set_title("Basically bar graphs")
plot_df.plot.hist(stacked=True, bins=bin_nr, density=True, ax=ax2)
ax2.set_title("Stacked histograms")
plot_df.plot.hist(alpha=0.5, bins=bin_nr, density=True, ax=ax3)
ax3.set_title("Overlay histograms")
plt.show()
Sample output:
It is not clear, though, what you intended to do with just one bin and why your y-axis was labeled "quality" when this axis represents the frequency in a histogram.
I'm trying to plot the scatter plot in which each point is colored w.r.t the variable Points. Moreover, I want to add the regression line.
import pandas as pd
import urllib3
import seaborn as sns
decathlon = pd.read_csv("https://raw.githubusercontent.com/leanhdung1994/Deep-Learning/main/decathlon.txt", sep='\t')
g = sns.lmplot(
data = decathlon,
x="100m", y="Long.jump",
hue = 'Points', palette = 'viridis'
)
It seems to me that there are 2 regression lines, one for each group of the data. This is not what I want. I would like to have a regression line for the entire data. Moreover, how can I hide the legend on the right hand side?
Could you please elaborate on how to do so?
You should not use lmplot unless you need to use a FacetGrid to split your dataset in several subplots.
Since the example that you show does not use any of the functionalities provided by FacetGrid, you should instead create your plot using a combination of scatterplot() and regplot()
tips = sns.load_dataset('tips')
ax = sns.scatterplot(data=tips, x="total_bill", y="tip", hue="day")
sns.regplot(data=tips, x="total_bill", y="tip", scatter=False, ax=ax)
I have a compiled a dataframe that contains USGS streamflow data at several different streamgages. Now I want to create a Gantt chart similar to this. Currently, my data has columns as site names and a date index as rows.
Here is a sample of my data.
The problem with the Gantt chart example I linked is that my data has gaps between the start and end dates that would normally define the horizontal time-lines. Many of the examples I found only account for the start and end date, but not missing values that may be in between. How do I account for the gaps where there is no data (blanks or nan in those slots for values) for some of the sites?
First, I have a plot that shows where the missing data is.
import missingno as msno
msno.bar(dfp)
Now, I want time on the x-axis and a horizontal line on the y-axis that tracks when the sites contain data at those times. I know how to do this the brute force way, which would mean manually picking out the start and end dates where there is valid data (which I made up below).
from datetime import datetime
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as dt
df=[('RIO GRANDE AT EMBUDO, NM','2015-7-22','2015-12-7'),
('RIO GRANDE AT EMBUDO, NM','2016-1-22','2016-8-5'),
('RIO GRANDE DEL RANCHO NEAR TALPA, NM','2014-12-10','2015-12-14'),
('RIO GRANDE DEL RANCHO NEAR TALPA, NM','2017-1-10','2017-11-25'),
('RIO GRANDE AT OTOWI BRIDGE, NM','2015-8-17','2017-8-21'),
('RIO GRANDE BLW TAOS JUNCTION BRIDGE NEAR TAOS, NM','2015-9-1','2016-6-1'),
('RIO GRANDE NEAR CERRO, NM','2016-1-2','2016-3-15'),
]
df=pd.DataFrame(data=df)
df.columns = ['A', 'Beg', 'End']
df['Beg'] = pd.to_datetime(df['Beg'])
df['End'] = pd.to_datetime(df['End'])
fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111)
ax = ax.xaxis_date()
ax = plt.hlines(df['A'], dt.date2num(df['Beg']), dt.date2num(df['End']))
How do I make a figure (like the one shown above) with the dataframe I provided as an example? Ideally I want to avoid the brute force method.
Please note: values of zero are considered valid data points.
Thank you in advance for your feedback!
Find date ranges of non-null data
2020-02-12 Edit to clarify logic in loop
df = pd.read_excel('Downloads/output.xlsx', index_col='date')
Make sure the dates are in order:
df.sort_index(inplace=True)
Loop thru the data and find the edges of the good data ranges. Get the corresponding index values and the name of the gauge and collect them all in a list:
# Looping feels like defeat. However, I'm not clever enough to avoid it
good_ranges = []
for i in df:
col = df[i]
gauge_name = col.name
# Start of good data block defined by a number preceeded by a NaN
start_mark = (col.notnull() & col.shift().isnull())
start = col[start_mark].index
# End of good data block defined by a number followed by a Nan
end_mark = (col.notnull() & col.shift(-1).isnull())
end = col[end_mark].index
for s, e in zip(start, end):
good_ranges.append((gauge_name, s, e))
good_ranges = pd.DataFrame(good_ranges, columns=['gauge', 'start', 'end'])
Plotting
Nothing new here. Copied pretty much straight from your question:
fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111)
ax = ax.xaxis_date()
ax = plt.hlines(good_ranges['gauge'],
dt.date2num(good_ranges['start']),
dt.date2num(good_ranges['end']))
fig.tight_layout()
Here's an approach that you could use, it's a bit hacky so perhaps some else will produce a better solution but it should produce your desired output. First use pd.where to replace non NaN values with an integer which will later determine the position of the lines on y-axis later, I do this row by row so that all data which belongs together will be at the same height. If you want to increase the spacing between the lines of the gantt chart you can add a number to i, I've provided an example in the comments in the code block below.
The y-labels and their positions are produced in the data munging steps, so this method will work regardless of the number of columns and will position the labels correctly when you change the spacing described above.
This approach returns matplotlib.pyplot.axes and matplotlib.pyplot.Figure object, so you can adjust the asthetics of the chart to suit your purposes (i.e. change the thickness of the lines, colours etc.). Link to docs.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_excel('output.xlsx')
dates = pd.to_datetime(df.date)
df.index = dates
df = df.drop('date', axis=1)
new_rows = [df[s].where(df[s].isna(), i) for i, s in enumerate(df, 1)]
# To increase spacing between lines add a number to i, eg. below:
# [df[s].where(df[s].isna(), i+3) for i, s in enumerate(df, 1)]
new_df = pd.DataFrame(new_rows)
### Plotting ###
fig, ax = plt.subplots() # Create axes object to pass to pandas df.plot()
ax = new_df.transpose().plot(figsize=(40,10), ax=ax, legend=False, fontsize=20)
list_of_sites = new_df.transpose().columns.to_list() # For y tick labels
x_tick_location = new_df.iloc[:, 0].values # For y tick positions
ax.set_yticks(x_tick_location) # Place ticks in correct positions
ax.set_yticklabels(list_of_sites) # Update labels to site names
I can create a simple columnar diagram in a matplotlib according to the 'simple' dictionary:
import matplotlib.pyplot as plt
D = {u'Label1':26, u'Label2': 17, u'Label3':30}
plt.bar(range(len(D)), D.values(), align='center')
plt.xticks(range(len(D)), D.keys())
plt.show()
But, how do I create curved line on the text and numeric data of this dictionarie, I do not know?
ΠΆ_OLD = {'10': 'need1', '11': 'need2', '12': 'need1', '13': 'need2', '14': 'need1'}
Like the picture below
You may use numpy to convert the dictionary to an array with two columns, which can be plotted.
import matplotlib.pyplot as plt
import numpy as np
T_OLD = {'10' : 'need1', '11':'need2', '12':'need1', '13':'need2','14':'need1'}
x = list(zip(*T_OLD.items()))
# sort array, since dictionary is unsorted
x = np.array(x)[:,np.argsort(x[0])].T
# let second column be "True" if "need2", else be "False
x[:,1] = (x[:,1] == "need2").astype(int)
# plot the two columns of the array
plt.plot(x[:,0], x[:,1])
#set the labels accordinly
plt.gca().set_yticks([0,1])
plt.gca().set_yticklabels(['need1', 'need2'])
plt.show()
The following would be a version, which is independent on the actual content of the dictionary; only assumption is that the keys can be converted to floats.
import matplotlib.pyplot as plt
import numpy as np
T_OLD = {'10': 'run', '11': 'tea', '12': 'mathematics', '13': 'run', '14' :'chemistry'}
x = np.array(list(zip(*T_OLD.items())))
u, ind = np.unique(x[1,:], return_inverse=True)
x[1,:] = ind
x = x.astype(float)[:,np.argsort(x[0])].T
# plot the two columns of the array
plt.plot(x[:,0], x[:,1])
#set the labels accordinly
plt.gca().set_yticks(range(len(u)))
plt.gca().set_yticklabels(u)
plt.show()
Use numeric values for your y-axis ticks, and then map them to desired strings with plt.yticks():
import matplotlib.pyplot as plt
import pandas as pd
# example data
times = pd.date_range(start='2017-10-17 00:00', end='2017-10-17 5:00', freq='H')
data = np.random.choice([0,1], size=len(times))
data_labels = ['need1','need2']
fig, ax = plt.subplots()
ax.plot(times, data, marker='o', linestyle="None")
plt.yticks(data, data_labels)
plt.xlabel("time")
Note: It's generally not a good idea to use a line graph to represent categorical changes in time (e.g. from need1 to need2). Doing that gives the visual impression of a continuum between time points, which may not be accurate. Here, I changed the plotting style to points instead of lines. If for some reason you need the lines, just remove linestyle="None" from the call to plt.plot().
UPDATE
(per comments)
To make this work with a y-axis category set of arbitrary length, use ax.set_yticks() and ax.set_yticklabels() to map to y-axis values.
For example, given a set of potential y-axis values labels, let N be the size of a subset of labels (here we'll set it to 4, but it could be any size).
Then draw a random sample data of y values and plot against time, labeling the y-axis ticks based on the full set labels. Note that we still use set_yticks() first with numerical markers, and then replace with our category labels with set_yticklabels().
labels = np.array(['A','B','C','D','E','F','G'])
N = 4
# example data
times = pd.date_range(start='2017-10-17 00:00', end='2017-10-17 5:00', freq='H')
data = np.random.choice(np.arange(len(labels)), size=len(times))
fig, ax = plt.subplots(figsize=(15,10))
ax.plot(times, data, marker='o', linestyle="None")
ax.set_yticks(np.arange(len(labels)))
ax.set_yticklabels(labels)
plt.xlabel("time")
This gives the exact desired plot:
import matplotlib.pyplot as plt
from collections import OrderedDict
T_OLD = {'10' : 'need1', '11':'need2', '12':'need1', '13':'need2','14':'need1'}
T_SRT = OrderedDict(sorted(T_OLD.items(), key=lambda t: t[0]))
plt.plot(map(int, T_SRT.keys()), map(lambda x: int(x[-1]), T_SRT.values()),'r')
plt.ylim([0.9,2.1])
ax = plt.gca()
ax.set_yticks([1,2])
ax.set_yticklabels(['need1', 'need2'])
plt.title('T_OLD')
plt.xlabel('time')
plt.ylabel('need')
plt.show()
For Python 3.X the plotting lines needs to explicitly convert the map() output to lists:
plt.plot(list(map(int, T_SRT.keys())), list(map(lambda x: int(x[-1]), T_SRT.values())),'r')
as in Python 3.X map() returns an iterator as opposed to a list in Python 2.7.
The plot uses the dictionary keys converted to ints and last elements of need1 or need2, also converted to ints. This relies on the particular structure of your data, if the values where need1 and need3 it would need a couple more operations.
After plotting and changing the axes limits, the program simply modifies the tick labels at y positions 1 and 2. It then also adds the title and the x and y axis labels.
Important part is that the dictionary/input data has to be sorted. One way to do it is to use OrderedDict. Here T_SRT is an OrderedDict object sorted by keys in T_OLD.
The output is:
This is a more general case for more values/labels in T_OLD. It assumes that the label is always 'needX' where X is any number. This can readily be done for a general case of any string preceding the number though it would require more processing,
import matplotlib.pyplot as plt
from collections import OrderedDict
import re
T_OLD = {'10' : 'need1', '11':'need8', '12':'need11', '13':'need1','14':'need3'}
T_SRT = OrderedDict(sorted(T_OLD.items(), key=lambda t: t[0]))
x_val = list(map(int, T_SRT.keys()))
y_val = list(map(lambda x: int(re.findall(r'\d+', x)[-1]), T_SRT.values()))
plt.plot(x_val, y_val,'r')
plt.ylim([0.9*min(y_val),1.1*max(y_val)])
ax = plt.gca()
y_axis = list(set(y_val))
ax.set_yticks(y_axis)
ax.set_yticklabels(['need' + str(i) for i in y_axis])
plt.title('T_OLD')
plt.xlabel('time')
plt.ylabel('need')
plt.show()
This solution finds the number at the end of the label using re.findall to accommodate for the possibility of multi-digit numbers. Previous solution just took the last component of the string because numbers were single digit. It still assumes that the number for plotting position is the last number in the string, hence the [-1]. Again for Python 3.X map output is explicitly converted to list, step not necessary in Python 2.7.
The labels are now generated by first selecting unique y-values using set and then renaming their labels through concatenation of the strings 'need' with its corresponding integer.
The limits of y-axis are set as 0.9 of the minimum value and 1.1 of the maximum value. Rest of the formatting is as before.
The result for this test case is:
I am plotting huge data sets (arrays in length of over 5E5), where the x-values are utc timestamps. I want to convert them to format HH:MM:SS instead of e.g. 1.47332886e+09 seconds. Therefore I had made a small function to convert the timestamps. Since the data sets, which I am plotting, are huge, I can not convert all the timestamps to datetime tuples. It would take too long. So I figured that I can read the xtick values and convert only these values to desired format. The Problem is, as I do this, the x-tick labels are fixed and by zooming the x-tick labels they are staying the same. So basically I need to run my function every time I zoom. I would rather automate this. So I tried to use event handling for that, but couldn't find a way to call my function in Event handler.
How can I call my function in event handler correctly? Or is there a better way to achieve my goal?
(I am using Python 3.3)
Here is my code:
def timeStamp2dateTime(timeArray):
import datetime
import numpy as np
# first loop definition:
timeArrayExport = []
timeArrayExport = datetime.datetime.fromtimestamp(timeArray[0])
for i in range(1,len(timeArray)):
# Convert timestamps to datetime
timeArrayExport = np.append( timeArrayExport, datetime.datetime.fromtimestamp(timeArray[i]) )
return(timeArrayExport)
def set_xticklabels_timestamp2time(ax):
'''
this function reads xtick values (assuming that they are timestamps) and
converts the xtick values to datetime format. from datetime format is
xticklabel list generated in format HH:MM:SS and also added to the plot
which has the handle "ax" (function input).
'''
import matplotlib.pyplot as plt
# manipulating the x-ticks -----
plt.pause(0.1) # update the plot
xticks = ax.get_xticks()
xticks_dt = timeStamp2dateTime(xticks)
xlabels = []
for item in xticks_dt:
xlabels.append(str(item.hour).zfill(2) +':'+ str(item.minute).zfill(2) +':'+ str(item.second).zfill(2))
ax.set_xticklabels(xlabels)
plt.gcf().autofmt_xdate() # rotates the x-axis values so that it is more clear to read
plt.pause(0.001) # update the plot
return(ax)
def onrelease(event):
ax = set_xticklabels_timestamp2time(ax)
import numpy as np
import matplotlib.pyplot as plt
# example data
x = np.arange(1.47332886e+09,1.47333886e+09) # UTC timestamps
y = np.sin(np.arange(len(x))/1000) + np.cos(np.arange(len(x))/100)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(x,y,'.-')
ax.grid(True)
ax = set_xticklabels_timestamp2time(ax)
# try to automate the xtick label convertion
try:
cid = fig.canvas.mpl_connect('button_release_event', set_xticklabels_timestamp2time(ax))
except:
cid = fig.canvas.mpl_connect('button_release_event', onrelease)
# => both ways fails
Thank you for reading this!
What you are looking for is a ticklabelformatter. See the documentation or this example. What it does is to let matplotlib figure out what the ticks should be and every time they change, put the correct label. In the linked example they use matplotlib.ticker.FuncFormatter, which takes a user defined function
def millions(x, pos):
'The two args are the value (x) and tick position (pos)'
return '$%1.1fM' % (x*1e-6)
simply scaling the input value x to millions, and appending it to the string x_in_millions M, which is returned as the new tick-label. This should work with your case too, if you create the function that takes your timestamp in seconds, and transforms it to your preferred formatting and return a string. When the function is defined you set it with the following commands:
from matplotlib.ticker import FuncFormatter
formatter = FuncFormatter(millions)
ax.yaxis.set_major_formatter(formatter)