How do i set the domain of an axis to a value that isn't a multiple of five in Altair? - altair

I'm trying to set the x-axis domain to between 0-36, as some data I'm processing was collected in 6-week increments. Following the documentation i used the scale=alt.Scale(domain=[0,36]). However, this continues to show the chart up to 40.
df = pd.DataFrame({'x':[0,6,12,18,24,30,36],'y':[0,3,1,4,2,5,3]})
alt.Chart(df).mark_line(point=True).encode(
x=alt.X('x:Q',
axis=alt.Axis(values=[0,6,12,18,24,30,36]),
scale=alt.Scale(domain=[0,36])),
y=alt.Y('y:Q'),
)
Output of code above
Changing the above code to cut off between 30 and 35 i.e., scale=alt.Scale(domain=[0,31]) generates this behavior, where the chart axis gets truncated at 30 (but shows the data after 30, appropriately since the data hasn't been clipped).
But why can't I cut off the graph at values that aren't multiples of 5?
I'm using Altair v4.0.1

The Vega-Lite renderer defaults to choosing "nice" values for the scale. If you want to disable this behavior, you can pass nice=False:
import pandas as pd
import altair as alt
df = pd.DataFrame({'x':[0,6,12,18,24,30,36],'y':[0,3,1,4,2,5,3]})
alt.Chart(df).mark_line(point=True).encode(
x=alt.X('x:Q',
axis=alt.Axis(values=[0,6,12,18,24,30,36]),
scale=alt.Scale(domain=[0,36], nice=False)),
y=alt.Y('y:Q'),
)

Related

Plotting graphs with Altair from a Pandas Dataframe

I am trying to read table values from a spreadsheet and plot different charts using Altair.
The spreadsheet can be found here
import pandas as pd
xls_file = pd.ExcelFile('PET_PRI_SPT_S1_D.xls')
xls_file
crude_df = xls_file.parse('Data 1')
crude_df
I am setting the second row values as column headers of the data frame.
crude_df.columns = crude_df.iloc[1]
crude_df.columns
Index(['Date', 'Cushing, OK WTI Spot Price FOB (Dollars per Barrel)',
'Europe Brent Spot Price FOB (Dollars per Barrel)'],
dtype='object', name=1)
The following is a modified version of Altair code got from documentation examples
crude_df_header = crude_df.head(100)
import altair as alt
alt.Chart(crude_df_header).mark_circle().encode(
# Mapping the WTI column to y-axis
y='Cushing, OK WTI Spot Price FOB (Dollars per Barrel)'
)
This does not work.
Error is shown as
TypeError: Object of type datetime is not JSON serializable
How to make 2 D plots with this data?
Also, how to make plots for number of values exceeding 5000 in Altair? Even this results in errors.
Your error is due to the way you parsed the file. You have set the column name but forgot to remove the first two rows, including the ones which are now the column names. The presence of these string values resulted in the error.
The proper way of achieving what you are looking for will be as follow:
import pandas as pd
import altair as alt
crude_df = pd.read_excel(open('PET_PRI_SPT_S1_D.xls', 'rb'),
sheet_name='Data 1',index_col=None, header=2)
alt.Chart(crude_df.head(100)).mark_circle().encode(
x ='Date',
y='Cushing, OK WTI Spot Price FOB (Dollars per Barrel)'
)
For the max rows issue, you can use the following
alt.data_transformers.disable_max_rows()
But be mindful of the official warning
If you choose this route, please be careful: if you are making multiple plots with the dataset in a particular notebook, the notebook will grow very large and performance may suffer.

How to format the last line segment based on a datetime column?

I'm trying to make the final segment of a line plot dashed to indicate incomplete data. From what I can tell I should be able to do this using a condition on strokeDash. However I can't figure out how to get the condition predicate to work using a datetime field.
alt.Chart(rates)
.mark_line(point=True)
.encode(
x=alt.X("start_date:T", scale=alt.Scale(nice="week")),
y="install_rate",
strokeDash=alt.condition(
f"datum.start_date > toDate({start_dates[-2].isoformat()})",
alt.value([5, 5]), # dashed line: 5 pixels dash + 5 pixels space
alt.value([0]), # solid line
)
)
This gives me an error:
Error: Illegal callee type: MemberExpression
You can fix the error you are encountering by making sure that pandas reads in the dates as a temporal data type:
import pandas as pd
import altair as alt
rates = pd.DataFrame({
'start_date': pd.to_datetime(['2022-05-06', '2022-05-13', '2022-05-19', '2022-05-25']),
'install_rate': [0.05, 0.06, 0.08, 0.09],
})
alt.Chart(rates).mark_line(point=True).encode(
x=alt.X("start_date:T"),
y="install_rate",
color=alt.condition(
f"datum.start_date > toDate('2022-05-19')",
alt.value('blue'),
alt.value('red')
)
)
However, as you can see the lines is not amenable to modifications via a condition. I think this is because it is considered a single continuous mark whereas the points are split up and can be changed individually.
You could group the line by creating a new separate field and grouping by it, which creates two separate lines.
rates['above_threshold'] = rates['start_date'] > '2022-05-13'
alt.Chart(rates).mark_line(point=True).encode(
x=alt.X("start_date:T"),
y="install_rate",
color='above_threshold')
However, that causes issues with the gap as you can see above. I think for your case the easiest might be to layer two charts with filter transforms:
base = alt.Chart(rates).encode(
x=alt.X("start_date:T"),
y="install_rate",
)
base.mark_line(strokeDash=[5, 5]).transform_filter(
f"datum.start_date > toDate('2022-05-19')"
) + base.mark_line().transform_filter(
f"datum.start_date < toDate('2022-05-20')"
)

Python visualization - histograms

the following two questions are regarding a histogram I am trying to build.
1) I want the bins to be as follows:
[0-10,10-20,...,580-590, 590-600]. I tried the following code:
bins_range=[]
for i in range(0,610,10):
bins_range.append(i)
plt.hist(df['something'], bins=bins_range, rwidth=0.95)
I expected to see bins as above with their corresponding amount of samples for each bin, but instead I got only 10 bins (as the default parameter).
2) How can I change the y-axis as follows: say my max bin contains 40 samples, so instead of 40 on the y-axis I want it to be 100%, and the others correspondly. I.e., 30 will be 75%, 20 will be 50% and so on.
Your code seems to be working OK. You can even pass the range command directly to the bins parameter of hist.
To get the y-axis as percentages, I think you need two passes: first calculate the bins to know how much the highest bin contains. Then, do the plotting using 1/highest as weights. There is a numpy np.hist that does all the calculations without plotting.
Use the PercentFormatter() to display the axis in percentages. It gets a parameter to tell how many 100% represents. Use PercentFormatter(max(hist)) to get the highest value as 100%. If you just want the total as 100%, just pass PercentFormatter(len(x)), without the need to calculate the histogram twice. As internally the y-axis is still in values, the ticks don't show up at the desired positions. You can use plt.yticks(np.linspace(0, max(hist), 11)) to have ticks for every 10%.
To get nicer separations between the bars, you can set an explicit edge color. Best without the rwidth=0.95
Example code:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
x = np.random.rayleigh(200, 50000)
hist, bins = np.histogram(x, bins=range(0, 610, 10))
plt.hist(x, bins=bins, ec='white', fc='darkorange')
plt.gca().yaxis.set_major_formatter(PercentFormatter(max(hist)))
plt.yticks(np.linspace(0, max(hist), 11))
plt.show()
PS: To use matplotlib's standard yticks, and having the y-axis also internally in percentages, you can use the weights parameter of hist. This can be handy when you want to interactively resize or zoom the plot, or need horizontal lines at specific percentages.
plt.hist(x, bins=bins, ec='white', fc='dodgerblue', weights=np.ones_like(x)/max(hist))
plt.gca().yaxis.set_major_formatter(PercentFormatter(1))

How to change scatter plot marker color in plotting loop using pandas?

I'm trying to write a simple program that reads in a CSV with various datasets (all of the same length) and automatically plots them all (as a Pandas Dataframe scatter plot) on the same figure. My current code does this well, but all the marker colors are the same (blue). I'd like to figure out how to make a colormap so that in the future, if I have much larger data sets (let's say, 100+ different X-Y pairings), it will automatically color each series as it plots. Eventually, I would like for this to be a quick and easy method to run from the command line. I did not have luck reading the documentation or stack exchange, hopefully this is not a duplicate!
I've tried the recommendations from these posts:
1)Setting different color for each series in scatter plot on matplotlib
2)https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.scatter.html
3) https://matplotlib.org/users/colormaps.html
However, the first one essentially grouped the data points according to their position on the x-axis and made those groups of data the same color (not what I want, each series of data is roughly a linearly increasing function). The second and third links seemed to have worked, but I don't like the colormap choices (e.g. "viridis", many colors are too similar and it's hard to distinguish data points).
This is a simplified version of my code so far (took out other lines that automatically named axes, etc. to make it easier to read). I've also removed any attempts I've made to specify a colormap, for more of a blank canvas feel:
''' Importing multiple scatter data and plotting '''
import pandas as pd
import matplotlib.pyplot as plt
### Data file path (please enter Dataframe however you like)
path = r'/Users/.../test_data.csv'
### Read in data CSV
data = pd.read_csv(path)
### List of headers
header_list = list(data)
### Set data type to float so modified data frame can be plotted
data = data.astype(float)
### X-axis limits
xmin = 1e-4;
xmax = 3e-3;
## Create subplots to be plotted together after loop
fig, ax = plt.subplots()
### Since there are multiple X-axes (every other column), this loop only plots every other x-y column pair
for i in range(len(header_list)):
if i % 2 == 0:
dfplot = data.plot.scatter(x = "{}".format(header_list[i]), y = "{}".format(header_list[i + 1]), ax=ax)
dfplot.set_xlim(xmin,xmax) # Setting limits on X axis
plot.show()
The dataset can be found in the google drive link below. Thanks for your help!
https://drive.google.com/drive/folders/1DSEs8D7lIDUW4NIPBl2qW2EZiZxslGyM?usp=sharing

Optimal way to display data with different ranges

I have an application which I pull data from an FPGA & display it for the engineers. Good application ... until you start displaying data which are extremely different in ranges...
say: a signal perturbating around +4000 and another around zero (both with small peak-peak).
At the moment the only real workaround is to "export to csv" and then view in Excel but I would like to improve the application so that this isn't needed
Option 1 is a more dynamic pointer that will give you readings of ALL visible plots for the present x
Option 2. Multiple Y axis. This is where it gets a bit ... tight with respect to UI area.
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import host_subplot
import mpl_toolkits.axisartist as AA
import numpy as np
t = np.arange(0,1,0.00001)
data = [5000*np.sin(t*2*np.pi*10),
10*np.sin(t*2*np.pi*20),
20*np.sin(t*2*np.pi*30),
np.sin(t*2*np.pi*40)+5000,
np.sin(t*2*np.pi*50)-5000,
np.sin(t*2*np.pi*60),
np.sin(t*2*np.pi*70),
]
fig = plt.figure()
host = host_subplot(111, axes_class=AA.Axes)
axis_list = [None]*7
for i in range(len(axis_list)):
axis_list[i] = host.twinx()
new_axis = axis_list[i].get_grid_helper().new_fixed_axis
axis_list[i].axis['right'] = new_axis(loc='right',
axes=axis_list[i],
offset=(60*i,0))
axis_list[i].axis['right'].toggle(all=True)
axis_list[i].plot(t,data[i])
plt.show()
for i in data:
plt.plot(t,i)
plt.show()
This code snippet doesn't contain any figure resize to ensure all 7 y-axis are visible BUT ignoring that, you can see it is quite large...
Any advice with respect to multi-Y or a better solution to displaying no more than 7 datasets?

Resources