multi series chart with non-repeating colors and strokes - altair

I have a Pandas dataframe of three columns: date, price and symbol.
The symbol column has many rows of 20 different values (categories). I'm trying to create a line chart where each of those categories will be a different line of a different color. The x axis is date and y axis is price.
As there are so many different values on the symbol column, when I try to chart them I get lines with repeated colors. I thought that I could use strokeDash to fix the issue, but on each color repetition I get the same type of dash, so I'm unable to differentiate them.
Is there a way in which I can, for example, resort the types of dashes in a different order so that the next time I get the same color I get a different dash?
alt.Chart(source).mark_line().encode(
x='date',
y='price',
color='symbol',
strokeDash='symbol')
Thanks!

If you're trying to distinguish 20 different values, it may make more sense to use a color scale that has 20 different values. Taking a look at the list of supported vega color schemes documentation, category20 seems like a good option. Here is a quick example of using it:
import pandas as pd
import numpy as np
import altair as alt
rng = np.random.RandomState(1701)
source = pd.DataFrame({
'date': np.tile(np.arange(10), 20),
'price': rng.randn(20, 10).cumsum(1).ravel(),
'symbol': np.repeat(list('ABCDEFGHIJKLMNOPQRST'), 10)
})
alt.Chart(source).mark_line().encode(
x='date',
y='price',
color=alt.Color('symbol', scale=alt.Scale(scheme='category20')),
)

Related

Align the values on the bar in matplotlib barplot

I am trying to obtain a bar chart displaying value_counts() for each year.
I use the following code:
import matplotlib.pyplot as plt
df.Year.value_counts(sort=False).sort_index().plot(kind='barh')
for index, value in enumerate(df['Year'].value_counts()):
plt.text(value, index,
str(value))
plt.show()
The chart that I obtain is as follows:
My bar chart
While it is correct, the issue is that all values are not aligned above the bars neatly and the alignment of values looks very haphazard. In short, they are not aesthetically pleasing.
Can someone please tell me how to fix this part (perhaps adding some height parameters in the code) so that all the values look neatly aligned on the bars.
While plotting, you are using sort_index() which sorts the bars based on year. But, this uses the sorted values for plotting. When you use the plt.text() the numbers are written based on the unsorted values. So, while the text values are in the correct location, the bars are not.
If you provide the data used, can provide the updates. But, I am using the data based on the plot and working backwards here. Note that I am NOT using value_counts(), but plotting with just the values in the sample below. See if this helps...
import matplotlib.pyplot as plt
df = pd.DataFrame({'Year': [2019,2020,2021,2022], 'Value':[275,237,235,170]})
df.set_index(['Year'], inplace=True)
df.sort_index(inplace=True)
print(df)
df.Value.plot(kind='barh')
for index, value in enumerate(df['Value']):
plt.text(value, index, str(value))
plt.show()
Output
>> df
Value
Year
2019 275
2020 237
2021 235
2022 170

Identify and extract OHLC pattern on candlestick chart using plotly or pandas?

I'm using the Ameritrade API and pandas/plotly to chart a simple stock price on the minute scale, I'd like to use some of the properties of the produced chart to identify and extract a specific candlestick pattern.
Here I build my dataframe and plot it as a candlestick:
frame = pd.DataFrame({'open': pd.json_normalize(df, 'candles').open,
'high': pd.json_normalize(df, 'candles').high,
'low': pd.json_normalize(df, 'candles').low,
'close': pd.json_normalize(df, 'candles').close,
'datetime': pd.DatetimeIndex(pd.to_datetime(pd.json_normalize(df, 'candles').datetime, unit='ms')).tz_localize('UTC').tz_convert('US/Eastern')})
fig = go.Figure(data=[go.Candlestick(x=frame['datetime'],
open=frame['open'],
high=frame['high'],
low=frame['low'],
close=frame['close'])])
fig.update_layout(xaxis_rangeslider_visible=False)
fig.show()
The plot:
The pattern I'm searching for is simply the very first set in each day's trading of four consecutive red candles.
A red candle can be defined as:
close < open & close < prev.close
So in this case, I don't have access to prev.close for the very first minute of trading because I don't have pre-market/extended hours data.
I'm wondering if it's even possible to access the plotly figure data, because if so, I could just extract the first set of four consecutive red candles, and their data - but if not, I would just define my pattern and extract it using pandas but haven't gotten that far yet.
Would this be easier to do using plotly or pandas, and what would a simple implementation look like?
Not sure about Candlestick, but in pandas, you could try something like this. Note: I assume the data have 1 row for each business day already and is sorted. The first thing is to create a column named red with True where the condition for a red candle as described in you question is True:
df['red'] = df['close'].lt(df['open'])&df['close'].lt(df['close'].shift())
Then you want to see if it happens 4 days in a row and assuming the data is sorted ascending (usually), the idea is to reverse the dataframe with [::-1], use rolling with a window of 4, sum the column red created just above and check where it is equal to 4.
df['next_4days_red'] = df[::-1].rolling(4)['red'].sum().eq(4)
then if you want the days that are at the beginning of 4 consecutive red trading days you do loc:
df.loc[df['next_4days_red'], 'datetime'].tolist()
Here with a little example with dummy varaibles:
df = pd.DataFrame({'close': [10,12,11,10,9,8,7,10,9,10],
'datetime':pd.bdate_range('2020-04-01', periods=10 )})\
.assign(open=lambda x: x['close']+0.5)
df['red'] = df['close'].lt(df['open'])&df['close'].lt(df['close'].shift())
df['next_4days_red'] = df[::-1].rolling(4)['red'].sum().eq(4)
print (df.loc[df['next_4days_red'], 'datetime'].tolist())
[Timestamp('2020-04-03 00:00:00'), Timestamp('2020-04-06 00:00:00')]
Note: it catches two successive dates because it is a 5 days consecutive decrease, not sure if in this case you wanted the two dates

How can I make the index values display on the x-axis ticks?

I am plotting data from a pandas dataframe that has a weekday names as the index. The plots look good; however, the x-axis does not show the weekdays (Monday through Sunday order). How can I get the days to display?
As shown in my code, I have attempted some workarounds found in other answers around the site, but all I have accomplished is getting the first (Monday) tick to appear with its label.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(12,6))
ax1 = fig.add_subplot(111)
ax1.xaxis.set_ticks([1])
ax1.xaxis.set_label_text(" ")
ax1.set_title(" item1 ", fontweight='bold')
avgs_df.loc[order,['item1']].plot(ax=ax1, legend=False)
Plot here:
The lines with the .xaxis.set_ticks([1]) I do not fully understand, but have thus far been the only thing that has put a day label on the plot. Changing the value in the square brackets seems to print the same, single label on top of the first.
I am expecting to get all seven days on the plots' x-axis as ticks and labels.
The dataframe contains seven fields: Time (hour of the day), and six item fields with numbers 0:1. Looking at the head yields:
avgs_df.head()
Image of output here (I don't have enough reputation):
.xaxis.set_ticks([1]) set the ticks at position 1 only. That's why you have Monday only.
To have 7 ticks, you should pass something like .xaxis.set_ticks(range(7)) (position starts from 0 usually, so range(7) produces the correct tick positions, from 0 to 6).
I'm not really sure what is the dataframe index here (your picture shows all Fridays), but if the index are the weekdays as I suspect, it should work even by simply removing the .xaxis.set_ticks([1]) lines. matplotlib will put all of them automatically in the picture if there is enough space.
EDIT after comments
So you have 168 rows, one for each hour of the days of the week. If using ax1.xaxis.set_ticks(np.arange(0, 168, 24.0)) allows you to add just the ticks you want, you can set the text of the ticks by using:
ax1.set_xticklabels(["Monday", "Tuesday", ...]) #make a list of all days
Just adding this line after .xaxis.set_ticks() should be ok. Be sure to provide a list with the same length of the ticks.
If your dataframe indexes are the day names, you can instead use:
ax1.set_xticklabels(avgs_df.index[np.arange(0, 168, 24.0)])
to get the exact text of the index and place it as text of the ticks.

Seaborn pairplot: vars and conditional elements

In the dataset below some graphs will plot entries for (0,0).
import seaborn as sns
test_grid = pd.DataFrame({'a':[0,10,20,30,40,50,60],'b':[0,60,50,30,20,40,80],'c':[10,40,70,30,50,80,0],'d':[50,60,80,100,50,80,0]})
sns.pairplot(test_grid)
How can I tell pairplot to ignore coordinates where x=0 and y=0 for any row in a given pairwise plot?
I'd like to do this in a conditional fashion with a combination of any two numbers- i.e. not by rebuilding the data frame with NaN to use dropna or anything like that.
I'm going to answer my own question in case someone finds it useful:
I haven't found a way to filter coordinates at the plotting level
so have instead sorted to filtering coordinates at the pandas level by generating a new DataFrame by applying multiple filters
test_grid_filtered = test_grid[(test_grid['a'] > 0) & (test_grid['b'] > 0)]
and then calling the plot

Matplotlib Scatter Plot Color One Point

Given the following data frame:
import pandas as pd
df = pd.DataFrame({'X':[1,2,3],
'Y':[4,5,6],
'Site':['foo','bar','baz']
})
df
Site X Y
0 foo 1 4
1 bar 2 5
2 baz 3 6
I want to iterate through rows in the data frame to produce 3 scatter plots (in this case, though a general solution for n rows is needed):
One in which the dot for "foo" is red and the rest are blue,
another in which the dot for "bar" is red and the rest are blue,
and a third in which the dot for "baz" is red and the rest are blue.
Here's a sample for "foo" done manually:
import matplotlib.pyplot as plt
%matplotlib inline
color=['r','b','b']
x=df['X']
y=df['Y']
plt.scatter(x, y, c=color, alpha=1,s=234)
plt.show()
Thanks in advance!
You have two options:
Create two "views" from the data, one with the to-be-red elements, and other with the remaining elements. You would apply conditional slicing for that, for example. Then, you plot one set in red, and the other in blue. That would be two scatter commands in the same figure. Repeat for each set.
Quite conveniently, the default jet colormap is blue and red in its extremes. You can then just call scatter once for all the data, but have the c argument of scatter set to a boolean array taken from the original data via conditional slicing. That would make the colors for the desired items be mapped to "1", while the other, false items would be "0", and colormapped accordingly.
Note: when I talk about conditional slicing, its like:
interesting_items = array[array == interesting_value]
or some equivalent in Pandas.

Resources