Curve fitting for large datasets in Python - python-3.x

I have a very large set of data, ( around 100k points) and I want to fit a curve to this plot.
I tried the filters suggested by answers to another question, but that lead to overfitting.
I am using numpy and matplotlib as of now.
This is the type of scatter plot I am trying to fit.
Edit 1:
Please ignore the data points to the side of the central main set of data points(Thus only a single curve can fit this)
Here is the dataset, download the file as a text file to separate the columns, consider the columns 3 and 9 ( 1-based indexing), the y-axis has column 3 while the x-axis plots the difference of column 3 and column 9.
Edit 2: Ignore the negative values
Edit 3: As there appears to be a lot of noise, consider the column 33 which accounts for probability and consider stars only which have >90% probability

Here is are comparison scatterplots using the data in your link, along with the python code I used to read, parse, and plot the data. Note that my plot also has an inverted y axis for direct comparison. This shows me that the data in the posted link, parsed per your directions, cannot be fit as it is per your question. My hope is that you can find some error in my work, and a model can in fact be made.
import matplotlib.pyplot as plt
dataFileName = 'temp.dat'
dataCount = 0
xlist = []
ylist = []
with open(dataFileName) as f:
for line in f:
if line[0] == '#': # comments
continue
spl = line.split()
col3 = float(spl[2])
col9 = float(spl[8])
if col3 < 0.0 or col9 < 0.0:
continue
x = abs(col3 - col9)
y = col3
xlist.append(x)
ylist.append(y)
f = plt.figure()
axes = f.add_subplot(111)
axes.invert_yaxis()
axes.scatter(xlist, ylist,color='black', marker='o', lw=0, s=1)
plt.show()

Related

python-plotly multiple lines in same graph with same Y axis

I have a csv file that looks like this:
time,price,m1,m2,m3,m4,m5,m6,m7,m8,buy/sell
10.30.01,102,105,100.5,103.5,110,100.9,103.02,111,105.0204,
10.30.02,103,104.5,101,104,110.2,101.4,104.03,110.5,104.5204,
10.30.03,104,104,101.5,104.5,110.4,101.9,105.04,110,104.0204,
10.30.04,105,103.5,102,105,110.6,102.4,106.05,109.5,103.5204,
10.30.05,106,103,102.5,105.5,110.8,102.9,107.06,109,103.0204,
10.30.06,107,102.5,103,106,111,103.4,108.07,108.5,102.5204,
10.30.07,108,102,103.5,106.5,111.2,103.9,109.08,108,102.0204,
10.30.08,109,101.5,104,107,111.4,104.4,110.09,107.5,101.5204,BUY
10.30.09,110,101,104.5,107.5,111.6,104.9,111.1,107,101.0204,
10.30.10,111,100.5,105,108,111.8,105.4,112.11,106.5,100.5204,
10.30.11,112,101,105.5,108.5,112,105.9,113.12,106,101.0204,
10.30.12,113,101.5,106,109,112.2,106.4,114.13,105.5,101.5204,SELL
10.30.13,114,102,106.5,109.5,112.4,106.9,115.14,105,102.0204,
10.30.14,115,102.5,107,110,112.6,107.4,116.15,104.5,102.5204,
10.30.15,116,103,107.5,110.5,112.8,107.9,117.16,104,103.0204,BUY
10.30.16,117,103.5,108,111,113,108.4,118.17,103.5,103.5204,
I want to take time in x-axis and price,m1,m2,m3,m4,m5,m6,m7,m8 in Y axis, since its the same range all are in same y-axis as line graphs. and buy/sell column in the same graph as scatter plot. How to do this with plotly ?
sorry for the simple question (if it is one), I tried a lot couldn't crack it. thank you in advance
A great resource for Scatter plot related questions is Plotly's documentation on scatter plots.
Plotting all of the columns price,m1,m2,m3,m4,m5,m6,m7,m8 can be done by looping through a list, and adding each of these columns as a trace.
Then I would recommend that you draw vertical lines in the Scatter plot for each time with BUY or SELL, by iterating through the non-null entries in the buy/sell column and using a shape to create a vertical line. You can also add an arrow and text pointing to each line using an annotation.
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots
df = pd.read_csv("buysell.csv")
fig = go.Figure()
cols = ['price','m1','m2','m3','m4','m5','m6','m7','m8']
for col in cols:
fig.add_trace(go.Scatter(
x=df['time'],
y=df[col],
name=col
))
# iterate over any rows with 'BUY' or 'SELL'
for index, row in df.dropna(subset=['buy/sell']).iterrows():
fig.add_shape(
type='line',
x0=row['time'],
y0=0,
x1=row['time'],
y1=1,
yref='paper',
line=dict(
color="red",
width=1,
dash="dot",
)
)
df_max, df_min = df[cols].max().max(), df[cols].min().min()
fig.add_annotation(
x=row['time'],
y=df_max,
text=row['buy/sell'],
showarrow=True,
arrowhead=4,
)
fig.show()

Control marker properties in seaborn pairwise boxplot

I'm trying to plot a boxplot for two different datasets on the same plot. The x axis are the hours in a day, while the y axis goes from 0 to 1 (let's call it Efficiency). I would like to have different markers for the means of each dataset' boxes. I use the 'meanprops' for seaborn but that changes the marker style for both datasets at the same time. I've added 2000 lines of data in the excel that can be downloaded here. The values might not coincide with the ones in the picture but should be enough.
Basically I want the red squares to be blue on the orange boxplot, and red on the blue boxplot. Here is what I managed to do so far:
I tried changing the meanprops by using a dictionary with the labels as keys , but it seems to be entering a loop (in PyCharm is says Evaluating...)
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
#make sure you have your path sorted out
group1 = pd.read_excel('group1.xls')
ax,fig = plt.subplots(figsize = (20,10))
#does not work
#ax = sns.boxplot(data=group1, x='hour', y='M1_eff', hue='labels',showfliers=False, showmeans=True,\
# meanprops={"marker":{'7':"s",'8':'s'},"markerfacecolor":{'7':"white",'8':'white'},
#"markeredgecolor":{'7':"blue",'8':'red'})
#works but produces similar markers
ax = sns.boxplot(data=group1, x='hour', y='M1_eff', hue='labels',showfliers=False, showmeans=True,\
meanprops={"marker":"s","markerfacecolor":"white", "markeredgecolor":"blue"})
plt.legend(title='Groups', loc=2, bbox_to_anchor=(1, 1),borderaxespad=0.5)
# Add transparency to colors
for patch in ax.artists:
r, g, b, a = patch.get_facecolor()
patch.set_facecolor((r, g, b, .4))
ax.set_xlabel("Hours",fontsize=14)
ax.set_ylabel("M1 Efficiency",fontsize=14)
ax.tick_params(labelsize=10)
plt.show()
I also tried the FacetGrid but to no avail (Stops at 'Evaluating...'):
g = sns.FacetGrid(group1, col="M1_eff", hue="labels",hue_kws=dict(marker=["^", "v"]))
g = (g.map(plt.boxplot, "hour", "M1_eff")
.add_legend())
g.show()
Any help is appreciated!
I don't think you can do this using sns.boxplot() directly. I think you'll have to draw the means "by hand"
N=100
df = pd.DataFrame({'hour':np.random.randint(0,3,size=(N,)),
'M1_eff': np.random.random(size=(N,)),
'labels':np.random.choice([7,8],size=(N,))})
x_col = 'hour'
y_col = 'M1_eff'
hue_col = 'labels'
width = 0.8
hue_order=[7,8]
marker_colors = ['red','blue']
# get the offsets used by boxplot when hue-nesting is used
# https://github.com/mwaskom/seaborn/blob/c73055b2a9d9830c6fbbace07127c370389d04dd/seaborn/categorical.py#L367
n_levels = len(hue_order)
each_width = width / n_levels
offsets = np.linspace(0, width - each_width, n_levels)
offsets -= offsets.mean()
fig, ax = plt.subplots()
ax = sns.boxplot(data=df, x=x_col, y=y_col, hue=hue_col, hue_order=hue_order, showfliers=False, showmeans=False)
means = df.groupby([hue_col,x_col])[y_col].mean()
for (gr,temp),o,c in zip(means.groupby(level=0),offsets,marker_colors):
ax.plot(np.arange(temp.values.size)+o, temp.values, 's', c=c)

Gantt Chart for USGS Hydrology Data with Python?

I have a compiled a dataframe that contains USGS streamflow data at several different streamgages. Now I want to create a Gantt chart similar to this. Currently, my data has columns as site names and a date index as rows.
Here is a sample of my data.
The problem with the Gantt chart example I linked is that my data has gaps between the start and end dates that would normally define the horizontal time-lines. Many of the examples I found only account for the start and end date, but not missing values that may be in between. How do I account for the gaps where there is no data (blanks or nan in those slots for values) for some of the sites?
First, I have a plot that shows where the missing data is.
import missingno as msno
msno.bar(dfp)
Now, I want time on the x-axis and a horizontal line on the y-axis that tracks when the sites contain data at those times. I know how to do this the brute force way, which would mean manually picking out the start and end dates where there is valid data (which I made up below).
from datetime import datetime
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as dt
df=[('RIO GRANDE AT EMBUDO, NM','2015-7-22','2015-12-7'),
('RIO GRANDE AT EMBUDO, NM','2016-1-22','2016-8-5'),
('RIO GRANDE DEL RANCHO NEAR TALPA, NM','2014-12-10','2015-12-14'),
('RIO GRANDE DEL RANCHO NEAR TALPA, NM','2017-1-10','2017-11-25'),
('RIO GRANDE AT OTOWI BRIDGE, NM','2015-8-17','2017-8-21'),
('RIO GRANDE BLW TAOS JUNCTION BRIDGE NEAR TAOS, NM','2015-9-1','2016-6-1'),
('RIO GRANDE NEAR CERRO, NM','2016-1-2','2016-3-15'),
]
df=pd.DataFrame(data=df)
df.columns = ['A', 'Beg', 'End']
df['Beg'] = pd.to_datetime(df['Beg'])
df['End'] = pd.to_datetime(df['End'])
fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111)
ax = ax.xaxis_date()
ax = plt.hlines(df['A'], dt.date2num(df['Beg']), dt.date2num(df['End']))
How do I make a figure (like the one shown above) with the dataframe I provided as an example? Ideally I want to avoid the brute force method.
Please note: values of zero are considered valid data points.
Thank you in advance for your feedback!
Find date ranges of non-null data
2020-02-12 Edit to clarify logic in loop
df = pd.read_excel('Downloads/output.xlsx', index_col='date')
Make sure the dates are in order:
df.sort_index(inplace=True)
Loop thru the data and find the edges of the good data ranges. Get the corresponding index values and the name of the gauge and collect them all in a list:
# Looping feels like defeat. However, I'm not clever enough to avoid it
good_ranges = []
for i in df:
col = df[i]
gauge_name = col.name
# Start of good data block defined by a number preceeded by a NaN
start_mark = (col.notnull() & col.shift().isnull())
start = col[start_mark].index
# End of good data block defined by a number followed by a Nan
end_mark = (col.notnull() & col.shift(-1).isnull())
end = col[end_mark].index
for s, e in zip(start, end):
good_ranges.append((gauge_name, s, e))
good_ranges = pd.DataFrame(good_ranges, columns=['gauge', 'start', 'end'])
Plotting
Nothing new here. Copied pretty much straight from your question:
fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111)
ax = ax.xaxis_date()
ax = plt.hlines(good_ranges['gauge'],
dt.date2num(good_ranges['start']),
dt.date2num(good_ranges['end']))
fig.tight_layout()
Here's an approach that you could use, it's a bit hacky so perhaps some else will produce a better solution but it should produce your desired output. First use pd.where to replace non NaN values with an integer which will later determine the position of the lines on y-axis later, I do this row by row so that all data which belongs together will be at the same height. If you want to increase the spacing between the lines of the gantt chart you can add a number to i, I've provided an example in the comments in the code block below.
The y-labels and their positions are produced in the data munging steps, so this method will work regardless of the number of columns and will position the labels correctly when you change the spacing described above.
This approach returns matplotlib.pyplot.axes and matplotlib.pyplot.Figure object, so you can adjust the asthetics of the chart to suit your purposes (i.e. change the thickness of the lines, colours etc.). Link to docs.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_excel('output.xlsx')
dates = pd.to_datetime(df.date)
df.index = dates
df = df.drop('date', axis=1)
new_rows = [df[s].where(df[s].isna(), i) for i, s in enumerate(df, 1)]
# To increase spacing between lines add a number to i, eg. below:
# [df[s].where(df[s].isna(), i+3) for i, s in enumerate(df, 1)]
new_df = pd.DataFrame(new_rows)
### Plotting ###
fig, ax = plt.subplots() # Create axes object to pass to pandas df.plot()
ax = new_df.transpose().plot(figsize=(40,10), ax=ax, legend=False, fontsize=20)
list_of_sites = new_df.transpose().columns.to_list() # For y tick labels
x_tick_location = new_df.iloc[:, 0].values # For y tick positions
ax.set_yticks(x_tick_location) # Place ticks in correct positions
ax.set_yticklabels(list_of_sites) # Update labels to site names

MatPlotLib Plot last few items differently

I'm exploring MatPlotLib and would like to know if it is possible to show last few items in a dataset differently.
Example: If my dataset contains 100 numbers, I want to display last 5 items in different color.
So far I could do it with one last record using annotate, but want to show last few items dotted with 'red' color as against the blue line.
I could finally achieve this by changing few things in my code.
Below is what I have done.
Let me know in case there is a better way. :)
series_df = pd.read_csv('my_data.csv')
series_df = series_df.fillna(0)
series_df = series_df.sort_values(['Date'], ascending=True)
# Created a new DataFrame for last 5 items series_df2
plt.plot(series_df["Date"],series_df["Values"],color="red", marker='+')
plt.plot(series_df2["Date"],series_df2["Values"],color="blue", marker='+')
You should add some minimal code example or a figure with the desired output to make your question clear. It seems you want to highlight some of the last few points with a marker. You can achieve this by calling plot() twice:
import numpy as np
import matplotlib.pyplot as plt
N = 50
x = np.arange(N)
y = np.random.rand(N)
plt.figure()
plt.plot(x, y)
plt.plot(x[-5:], y[-5:], ls='', c='tab:red', marker='.', ms=10)

Too long processing time for a 3-d plot

I am comparatively new to python, so I am not able to assess if there is something wrong with my code or is the process taking too long to complete or anything else.
I wrote a code for plotting a large dataset (3d array) in a 3d plot, but my PC takes forever to complete (or not complete). I have been waiting for about one hour for it to complete nearly.
a = pd.DataFrame(np.array([Ensemble_test,df['RF'],y])).transpose()
a # is a dataset with dimentions 335516 rows × 3 columns
### All the 3 rows are numbers
Output:
0 1 2
0 172.981614 130.624674 -42.356940
1 189.851754 139.632304 -50.219450
## I tried plotting using following
from mpl_toolkits.mplot3d import Axes3D
df=a.unstack().reset_index()
df.columns=["X","Y","Z"]
df['X']=pd.Categorical(df['X'])
df['X']=df['X'].cat.codes
# Make the plot
fig = plt.figure(figsize = (8,8))
ax = fig.gca(projection='3d')
im = ax.plot_trisurf(df['Y'], df['X'], df['Z'], cmap='Spectral', linewidth=0.001, vmax = 30,
vmin = -30, antialiased=True)
ax.view_init(40,20)
#fig.colorbar(im, ax=ax, fraction = 0.023)
ax.set_ylabel('RD')
ax.set_zlabel('Difference')
ax.set_xlabel('Ensemble')
I wanted to have a 3-d plot but the process takes too long. I don't know what the problem is.
Any other alternatives/suggestions for 3-d plotting are also welcome.
[My PC is 8th gen 'i7' with '16 GB' RAM]

Resources