Drop duplicates from Structured Numpy Array Python3.x - python-3.x
Take the following Array:
import numpy as np
arr_dupes = np.array(
[
('2017-09-13T11:05:00.000000', 1.32685, 1.32704, 1.32682, 1.32686, 1.32684, 1.32702, 1.32679, 1.32683, 246),
('2017-09-13T11:05:00.000000', 1.32685, 1.32704, 1.32682, 1.32686, 1.32684, 1.32702, 1.32679, 1.32683, 246),
('2017-09-13T11:05:00.000000', 1.32685, 1.32704, 1.32682, 1.32686, 1.32684, 1.32702, 1.32679, 1.32683, 222),
('2017-09-13T11:04:00.000000', 1.32683, 1.32686, 1.32682, 1.32685, 1.32682, 1.32684, 1.3268 , 1.32684, 97),
('2017-09-13T11:03:00.000000', 1.32664, 1.32684, 1.32663, 1.32683, 1.32664, 1.32683, 1.32661, 1.32682, 268),
('2017-09-13T11:02:00.000000', 1.3268 , 1.32692, 1.3266 , 1.32664, 1.32678, 1.32689, 1.32658, 1.32664, 299),
('2017-09-13T11:02:00.000000', 1.3268 , 1.32692, 1.3266 , 1.32664, 1.32678, 1.32689, 1.32658, 1.32664, 299),
('2017-09-13T11:01:00.000000', 1.32648, 1.32682, 1.32648, 1.3268 , 1.32647, 1.32682, 1.32647, 1.32678, 322),
('2017-09-13T11:00:00.000000', 1.32647, 1.32649, 1.32628, 1.32648, 1.32644, 1.32651, 1.32626, 1.32647, 285)],
dtype=[('date', '<M8[us]'), ('askopen', '<f8'), ('askhigh', '<f8'), ('asklow', '<f8'), ('askclose', '<f8'),
('bidopen', '<f8'), ('bidhigh', '<f8'), ('bidlow', '<f8'), ('bidclose', '<f8'), ('volume', '<i8')]
)
What is the fastest solution to remove duplicates, using the dates as an index and keeping the last value?
Pandas DataFrame equivalent is
In [5]: df = pd.DataFrame(arr_dupes, index=arr_dupes['date'])
In [6]: df
Out[6]:
date askopen askhigh asklow askclose bidopen bidhigh bidlow bidclose volume
2017-09-13 11:05:00 2017-09-13 11:05:00 1.32685 1.32704 1.32682 1.32686 1.32684 1.32702 1.32679 1.32683 246
2017-09-13 11:05:00 2017-09-13 11:05:00 1.32685 1.32704 1.32682 1.32686 1.32684 1.32702 1.32679 1.32683 246
2017-09-13 11:05:00 2017-09-13 11:05:00 1.32685 1.32704 1.32682 1.32686 1.32684 1.32702 1.32679 1.32683 222
2017-09-13 11:04:00 2017-09-13 11:04:00 1.32683 1.32686 1.32682 1.32685 1.32682 1.32684 1.32680 1.32684 97
2017-09-13 11:03:00 2017-09-13 11:03:00 1.32664 1.32684 1.32663 1.32683 1.32664 1.32683 1.32661 1.32682 268
2017-09-13 11:02:00 2017-09-13 11:02:00 1.32680 1.32692 1.32660 1.32664 1.32678 1.32689 1.32658 1.32664 299
2017-09-13 11:02:00 2017-09-13 11:02:00 1.32680 1.32692 1.32660 1.32664 1.32678 1.32689 1.32658 1.32664 299
2017-09-13 11:01:00 2017-09-13 11:01:00 1.32648 1.32682 1.32648 1.32680 1.32647 1.32682 1.32647 1.32678 322
2017-09-13 11:00:00 2017-09-13 11:00:00 1.32647 1.32649 1.32628 1.32648 1.32644 1.32651 1.32626 1.32647 285
In [7]: df.reset_index().drop_duplicates(subset='date', keep='last').set_index('date')
Out[7]:
index askopen askhigh asklow askclose bidopen bidhigh bidlow bidclose volume
date
2017-09-13 11:05:00 2017-09-13 11:05:00 1.32685 1.32704 1.32682 1.32686 1.32684 1.32702 1.32679 1.32683 222
2017-09-13 11:04:00 2017-09-13 11:04:00 1.32683 1.32686 1.32682 1.32685 1.32682 1.32684 1.32680 1.32684 97
2017-09-13 11:03:00 2017-09-13 11:03:00 1.32664 1.32684 1.32663 1.32683 1.32664 1.32683 1.32661 1.32682 268
2017-09-13 11:02:00 2017-09-13 11:02:00 1.32680 1.32692 1.32660 1.32664 1.32678 1.32689 1.32658 1.32664 299
2017-09-13 11:01:00 2017-09-13 11:01:00 1.32648 1.32682 1.32648 1.32680 1.32647 1.32682 1.32647 1.32678 322
2017-09-13 11:00:00 2017-09-13 11:00:00 1.32647 1.32649 1.32628 1.32648 1.32644 1.32651 1.32626 1.32647 285
numpy.unique seems to compare the entire tuple and will return duplicates.
Final output should look like this.
array([
('2017-09-13T11:05:00.000000', 1.32685, 1.32704, 1.32682, 1.32686, 1.32684, 1.32702, 1.32679, 1.32683, 222),
('2017-09-13T11:04:00.000000', 1.32683, 1.32686, 1.32682, 1.32685, 1.32682, 1.32684, 1.3268 , 1.32684, 97),
('2017-09-13T11:03:00.000000', 1.32664, 1.32684, 1.32663, 1.32683, 1.32664, 1.32683, 1.32661, 1.32682, 268),
('2017-09-13T11:02:00.000000', 1.3268 , 1.32692, 1.3266 , 1.32664, 1.32678, 1.32689, 1.32658, 1.32664, 299),
('2017-09-13T11:01:00.000000', 1.32648, 1.32682, 1.32648, 1.3268 , 1.32647, 1.32682, 1.32647, 1.32678, 322),
('2017-09-13T11:00:00.000000', 1.32647, 1.32649, 1.32628, 1.32648, 1.32644, 1.32651, 1.32626, 1.32647, 285)],
dtype=[('date', '<M8[us]'), ('askopen', '<f8'), ('askhigh', '<f8'), ('asklow', '<f8'), ('askclose', '<f8'),
('bidopen', '<f8'), ('bidhigh', '<f8'), ('bidlow', '<f8'), ('bidclose', '<f8'), ('volume', '<i8')]
)
Thank-you
It seems that the solution to your problem doesn't have to mimic pandas drop_duplicates() function, but I'll provide one that mimics it and one that doesn't.
If you need the exact same behavior as pandas drop_duplicates() then the following code is a way to go:
#initialization of arr_dupes here
#actual algorithm
helper1, helper2 = np.unique(arr_dupes['date'][::-1], return_index = True)
result = arr_dupes[::-1][helper2][::-1]
When arr_dupes is initialized you need to pass only the 'date' column to numpy.unique(). Also since you are interested in the last of non-unique elements in an array you have to reverse the order of the array that you pass to unique() with [::-1]. This way unique() will throw out every non-unique element except last one.
Then unique() returns a list of unique elements (helper1) as first return value and a list of indices of those elements in original array (helper2) as second return value.
Lastly a new array is created by picking elements listed in helper2 from the original array arr_dupes.
This solution is about 9.898 times faster than pandas version.
Now let me explain what I meant in the beginning of this answer. It seems to me that your array is sorted by the 'date' column. If it is true then we can assume that duplicates are going to be grouped together. If they are grouped together then we only need to keep rows whose next rows 'date' column is different than the current rows 'date' column. So for example if we take a look at the following array rows:
...
('2017-09-13T11:05:00.000000', 1.32685, 1.32704, 1.32682, 1.32686, 1.32684, 1.32702, 1.32679, 1.32683, 246),
('2017-09-13T11:05:00.000000', 1.32685, 1.32704, 1.32682, 1.32686, 1.32684, 1.32702, 1.32679, 1.32683, 246),
('2017-09-13T11:05:00.000000', 1.32685, 1.32704, 1.32682, 1.32686, 1.32684, 1.32702, 1.32679, 1.32683, 222),
('2017-09-13T11:04:00.000000', 1.32683, 1.32686, 1.32682, 1.32685, 1.32682, 1.32684, 1.3268 , 1.32684, 97),
...
The third rows 'date' column is different than the fourths and we need to keep it. No need to do any more checks. First rows 'date' column is the same as the second rows and we don't need that row. Same goes for the second row.
So in code it looks like this:
#initialization of arr_dupes here
#actual algorithm
result = arr_dupes[np.concatenate((arr_dupes['date'][:-1] != arr_dupes['date'][1:], np.array([True])))]
First every element of a 'date' column is compared with the next element. This creates an array of trues and falses. If an index in this boolean array has a true asigned to it then an arr_dupes element with that index needs to stay. Otherwise it needs to go.
Next, concatenate() just adds one last true value to this boolean array since last element always needs to stay in the resulting array.
This solution is about 17 times faster than pandas version.
Related
df.groupby(df.index.to_period('M')).last() generates a new date index exceeding the range of the initial date index
Given a sample data as follows: import pandas as pd import numpy as np np.random.seed(2021) dates = pd.date_range('20130226', periods=90) df = pd.DataFrame(np.random.uniform(0, 10, size=(90, 4)), index=dates, columns=['A_values', 'B_values', 'C_values', 'target']) df Out: A_values B_values C_values target 2013-02-26 6.059783 7.333694 1.389472 3.126731 2013-02-27 9.972433 1.281624 1.789931 7.529254 2013-02-28 6.621605 7.843101 0.968944 0.585713 2013-03-01 9.623960 6.165574 0.866300 5.612724 2013-03-02 6.165247 9.638430 5.743043 3.711608 ... ... ... ... 2013-05-22 0.589729 6.479978 3.531450 6.872059 2013-05-23 6.279065 3.837670 8.853146 8.209883 2013-05-24 5.533017 5.241127 1.388056 5.355926 2013-05-25 1.596038 4.665995 2.406251 1.971875 2013-05-26 3.269001 1.787529 6.659690 7.545569 With code below, we can see the last row's index is outside of range of the initial date index (maximum date is 2013-05-26): df.groupby(pd.Grouper(freq='M')).last() Out[177]: A_values B_values C_values target 2013-02-28 6.621605 7.843101 0.968944 0.585713 2013-03-31 5.906967 8.545341 6.326550 8.684117 2013-04-30 5.358775 1.473809 5.231534 0.604810 2013-05-31 3.269001 1.787529 6.659690 7.545569 and: df.groupby(df.index.to_period('M')).apply(lambda x: x.index.max()) Out[178]: 2013-02 2013-02-28 2013-03 2013-03-31 2013-04 2013-04-30 2013-05 2013-05-26 Freq: M, dtype: datetime64[ns] But I hope to get an expected result as follows, how could I do that? Thanks. A_values B_values C_values target 2013-02-28 6.621605 7.843101 0.968944 0.585713 2013-03-31 5.906967 8.545341 6.326550 8.684117 2013-04-30 5.358775 1.473809 5.231534 0.604810 2013-05-26 3.269001 1.787529 6.659690 7.545569 # date should be `2013-05-26` based on the original data
Idea is create helper column from DatetimeIndex and after last convert column to index: df = (df.assign(new=df.index) .groupby(pd.Grouper(freq='M')) .last() .set_index('new') .rename_axis(None)) print (df) A_values B_values C_values target 2013-02-28 6.621605 7.843101 0.968944 0.585713 2013-03-31 5.906967 8.545341 6.326550 8.684117 2013-04-30 5.358775 1.473809 5.231534 0.604810 2013-05-26 3.269001 1.787529 6.659690 7.545569
CRS error while clipping rioxarray to shapefile
I'm trying to clip a rioxarray dataset to a shapefile, but get the following error: > data_clipped = data.rio.clip(shape.geometry.apply(mapping)) MissingCRS: CRS not found. Please set the CRS with 'set_crs()' or 'write_crs()'. Data variable: precip This error seems straightforward, but I can't figure out which CRS needs to be set. Both the dataset and the shapefile have CRS values that rio can find: > print(data.rio.crs) EPSG:4326 > print(shape.crs) epsg:4326 The dataarray within the dataset, called 'precip', does not have a CRS, but it also doesn't seem to respond to the set_crs() command: > print(data.precip.rio.crs) None > data.precip.rio.set_crs(data.rio.crs) > print(data.precip.rio.crs) None What am I missing here? For reference, rioxarray set_crs() documentation - this shows set_crs() working on data arrays, unlike my experience with data.precip My data, in case I have something unusual: > print(data) <xarray.Dataset> Dimensions: (x: 541, y: 411) Coordinates: * y (y) float64 75.0 74.9 74.8 74.7 74.6 ... 34.3 34.2 34.1 34.0 * x (x) float64 -12.0 -11.9 -11.8 -11.7 ... 41.7 41.8 41.9 42.0 time object 2020-01-01 00:00:00 spatial_ref int64 0 Data variables: precip (y, x) float64 nan nan nan ... 1.388e-17 1.388e-17 1.388e-17 Attributes: Conventions: CF-1.6 history: 2021-01-05 01:36:52 GMT by grib_to_netcdf-2.16.0: /opt/ecmw... > print(shape) ID name orgn_name geometry 0 Albania Shqipëria MULTIPOLYGON (((19.50115 40.96230, 19.50563 40... 1 Andorra Andorra POLYGON ((1.43992 42.60649, 1.45041 42.60596, ... 2 Austria Österreich POLYGON ((16.00000 48.77775, 16.00000 48.78252...
This issue is resolved if the set_crs() is used in the same command as the clip operation: data_clipped = data.precip.rio.set_crs('WGS84').rio.clip(shape.geometry.apply(mapping))
train Word2vec model using Gensim
this is my code.it reads reviews from an excel file (rev column) and make a list of list. xp is like this ["['intrepid', 'bumbling', 'duo', 'deliver', 'good', 'one'],['better', 'offering', 'considerable', 'cv', 'freshly', 'qualified', 'private', 'investigator', 'thrust', 'murder', 'investigation', 'invisible'],[ 'man', 'alone', 'tell', 'fun', 'flow', 'decent', 'clip', 'need', 'say', 'sequence', 'comedy', 'gold', 'like', 'scene', 'restaurant', 'excellent', 'costello', 'pretending', 'work', 'ball', 'gym', 'final', 'reel']"] but when use list for model, it gives me error"TypeError: 'float' object is not iterable".i don't know where is my problem. Thanks. xp=[] import gensim import logging import pandas as pd file = r'FileNamelast.xlsx' df = pd.read_excel(file,sheet_name='FileNamex') pages = [i for i in range(0,1000)] for page in pages: text =df.loc[page,["rev"]] xp.append(text[0]) model = gensim.models.Word2Vec (xp, size=150, window=10, min_count=2, workers=10) model.train(xp,total_examples=len(xp),epochs=10) this is what i got.TypeError: 'float' object is not iterable --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-32-aa34c0e432bf> in <module>() 14 15 ---> 16 model = gensim.models.Word2Vec (xp, size=150, window=10, min_count=2, workers=10) 17 model.train(xp,total_examples=len(xp),epochs=10) C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\word2vec.py in __init__(self, sentences, corpus_file, size, alpha, window, min_count, max_vocab_size, sample, seed, workers, min_alpha, sg, hs, negative, ns_exponent, cbow_mean, hashfxn, iter, null_word, trim_rule, sorted_vocab, batch_words, compute_loss, callbacks, max_final_vocab) 765 callbacks=callbacks, batch_words=batch_words, trim_rule=trim_rule, sg=sg, alpha=alpha, window=window, 766 seed=seed, hs=hs, negative=negative, cbow_mean=cbow_mean, min_alpha=min_alpha, compute_loss=compute_loss, --> 767 fast_version=FAST_VERSION) 768 769 def _do_train_epoch(self, corpus_file, thread_id, offset, cython_vocab, thread_private_mem, cur_epoch, C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\base_any2vec.py in __init__(self, sentences, corpus_file, workers, vector_size, epochs, callbacks, batch_words, trim_rule, sg, alpha, window, seed, hs, negative, ns_exponent, cbow_mean, min_alpha, compute_loss, fast_version, **kwargs) 757 raise TypeError("You can't pass a generator as the sentences argument. Try an iterator.") 758 --> 759 self.build_vocab(sentences=sentences, corpus_file=corpus_file, trim_rule=trim_rule) 760 self.train( 761 sentences=sentences, corpus_file=corpus_file, total_examples=self.corpus_count, C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\base_any2vec.py in build_vocab(self, sentences, corpus_file, update, progress_per, keep_raw_vocab, trim_rule, **kwargs) 934 """ 935 total_words, corpus_count = self.vocabulary.scan_vocab( --> 936 sentences=sentences, corpus_file=corpus_file, progress_per=progress_per, trim_rule=trim_rule) 937 self.corpus_count = corpus_count 938 self.corpus_total_words = total_words C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\word2vec.py in scan_vocab(self, sentences, corpus_file, progress_per, workers, trim_rule) 1569 sentences = LineSentence(corpus_file) 1570 -> 1571 total_words, corpus_count = self._scan_vocab(sentences, progress_per, trim_rule) 1572 1573 logger.info( C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\word2vec.py in _scan_vocab(self, sentences, progress_per, trim_rule) 1552 sentence_no, total_words, len(vocab) 1553 ) -> 1554 for word in sentence: 1555 vocab[word] += 1 1556 total_words += len(sentence) TypeError: 'float' object is not iterable
The sentences corpus argument to Word2Vec should be an iterable sequence of lists-of-word-tokens. Your reported value for xp is actually a list with one long string in it: [ "['intrepid', 'bumbling', 'duo', 'deliver', 'good', 'one'],['better', 'offering', 'considerable', 'cv', 'freshly', 'qualified', 'private', 'investigator', 'thrust', 'murder', 'investigation', 'invisible'],[ 'man', 'alone', 'tell', 'fun', 'flow', 'decent', 'clip', 'need', 'say', 'sequence', 'comedy', 'gold', 'like', 'scene', 'restaurant', 'excellent', 'costello', 'pretending', 'work', 'ball', 'gym', 'final', 'reel']" ] I don't see how this would give the error you've reported, but it's definitely wrong, so should be fixed. You should perhaps print xp just before you instantiate Word2Vec to be sure you know what it contains. A true list, with each item being a list-of-string-tokens, would work. So if xp were the following that'd be correct: [ ['intrepid', 'bumbling', 'duo', 'deliver', 'good', 'one'], ['better', 'offering', 'considerable', 'cv', 'freshly', 'qualified', 'private', 'investigator', 'thrust', 'murder', 'investigation', 'invisible'], [ 'man', 'alone', 'tell', 'fun', 'flow', 'decent', 'clip', 'need', 'say', 'sequence', 'comedy', 'gold', 'like', 'scene', 'restaurant', 'excellent', 'costello', 'pretending', 'work', 'ball', 'gym', 'final', 'reel'] ] Note, however: Word2Vec doesn't do well with toy-sized datasets. So while this tiny setup may be helpful to check for basic syntax/format issues, don't expect realistic results until you're training with many hundreds-of-thousands of words. You don't need to call train() if you already supplied your corpus at instantiation, as you have. The model will do all steps automatically. (If, on the other hand, you don't supply your corpus, you'd then have to call both build_vocab() and train().) If you enable logging at the INFO level all the steps happening behind the scenes will be clearer.
How to divide the dataframe into bins of specific length with unequal number of points?
I have a dataframe and I want to divide that dataframe into bins of equal width (Number of data points in each bins may not be same). I have tried the following approach df = pc13.sort_values(by = ['A'], ascending=True) df_temp = np.array_split(df, 20) But this approach is dividing the dataframe into bins with equal number of data points. Instead of that I want to divide the dataframe into bins of particular width, also number of data points in each bin may not be same. The minimum value in the dataframe column A is -0.04843731030699292 and maximum value is 0.05417013917000033. I tried uploading the entire dataframe but it is very big file.
you can do something like: # create a random df df = pd.DataFrame(np.random.randn(10, 10), columns=list('ABCDEFGHIJ')) # sort valeus df = df.sort_values(by = ['A'], ascending=True) # use your code but on a transposed dataframe new = np.array_split(df.T, 5) # split columns into 5 bins # list comprehension to transposed dataframes dfs = [new[i].T for i in range(len(new))] update # random df df = pd.DataFrame(np.random.randn(1000, 5), columns=list('ABCDE')) # sort on A df.sort_values('A', inplace=True) # create bins df['bin'] = pd.cut(df['A'], 20, include_lowest = True) # group on bin group = df.groupby('bin') # list comprehension to split groups into list of dataframes dfs = [group.get_group(x) for x in group.groups] [ A B C D E bin 218 -2.716093 0.833726 -0.771400 0.691251 0.162448 (-2.723, -2.413] 207 -2.581388 -2.318333 -0.001467 0.035277 1.219666 (-2.723, -2.413] 380 -2.499710 1.946709 -0.519070 1.653383 0.309689 (-2.723, -2.413] 866 -2.492050 0.246500 -0.596392 0.872888 2.371652 (-2.723, -2.413] 876 -2.469238 -0.156470 -0.841065 -1.248793 -0.489665 (-2.723, -2.413] 314 -2.456308 0.630691 -0.072146 1.139697 0.663674 (-2.723, -2.413] 310 -2.455353 0.075842 0.589515 -0.427233 1.207979 (-2.723, -2.413] 660 -2.427255 0.890125 -0.042716 -1.038401 0.651324 (-2.723, -2.413], A B C D E bin 571 -2.355430 0.383794 -1.266575 -1.214833 -0.862611 (-2.413, -2.11] 977 -2.354416 -1.964189 0.440376 0.028032 -0.181360 (-2.413, -2.11] 83 -2.276908 0.288462 0.370555 -0.546359 -2.033892 (-2.413, -2.11] 196 -2.213729 -1.087783 -0.592884 1.233886 1.051164 (-2.413, -2.11] 227 -2.146631 0.365183 -0.095293 -0.882414 0.385117 (-2.413, -2.11] 39 -2.136800 -1.150065 0.180182 -0.424071 0.040370 (-2.413, -2.11], A B C D E bin 104 -2.108961 -0.396602 -1.014224 -1.277124 0.001030 (-2.11, -1.806] 360 -2.098928 1.093483 1.438421 -0.980215 0.010359 (-2.11, -1.806] 530 -2.088592 1.043201 -0.522468 0.482176 -0.680166 (-2.11, -1.806] 158 -2.062759 2.070387 2.124621 -2.751532 0.674055 (-2.11, -1.806] 971 -2.053039 0.347577 -0.498513 1.917305 -1.746493 (-2.11, -1.806] 658 -2.002482 -1.222292 -0.398816 0.279228 -1.485782 (-2.11, -1.806] 90 -1.985261 3.499251 -2.089028 1.238524 -1.781089 (-2.11, -1.806] 466 -1.973640 -1.609920 -1.029454 0.809143 -0.228893 (-2.11, -1.806] 40 -1.966016 -1.479240 -1.564966 -0.310133 1.338023 (-2.11, -1.806] 279 -1.943666 0.762493 0.060038 0.449159 0.244411 (-2.11, -1.806] 204 -1.940045 0.844901 -0.343691 -1.144836 1.385915 (-2.11, -1.806] 780 -1.918548 0.212452 0.225789 0.216110 1.710532 (-2.11, -1.806] 289 -1.897438 0.847664 0.689778 -0.454152 -0.747836 (-2.11, -1.806] 159 -1.848425 0.477726 0.391384 -0.477804 0.168160 (-2.11, -1.806], . . .
Areas between lines not filled correctly with "fill_between" in Matplotlib
Set of time-series data are plotted with Matplotlib as: import matplotlib.pyplot as plt %matplotlib inline def WillR(ohlc, window): high, low, close = ohlc['High'], ohlc['Low'], ohlc['Close'] R = (np.nanmax(high, axis=0) - close) / (np.nanmax(high, axis=0) - np.nanmin(low, axis=0)) * -100 return R ohlc = df[::-1] R = WillR(ohlc, 14) xAxis = ohlc.index fig = plt.figure(figsize=(18,9), dpi=100) ax = fig.add_axes([0.0, 0.6, 1.0, 0.6]) plt.ylabel('Williams '+'%R', fontsize = 16) ax.xaxis.grid(color='grey', linestyle=':') ax.yaxis.grid(color='grey', linestyle=':') ax.plot(R, color='k', alpha=0.8, linewidth=1.2) ax.axhline(-20, color='r', linewidth=0.8) ax.axhline(-50, color='k', linestyle='dashed', linewidth=1.2) ax.axhline(-80, color='g', linewidth=0.8) ax.fill_between(xAxis, R, -20, where=(R >= -20), facecolor='r', edgecolor='r', alpha=0.5) ax.fill_between(xAxis, R, -80, where=(R <= -80), facecolor='g', edgecolor='g', alpha=0.5) In the "Williams %R" graph, the area above "-20" line is to be filled in red and below "-80" green. I could not get the bound areas filled completely - see Figure 1: I have also tried adding the argument "interpolate=True" and the result is still unsatisfactory - see Figure 2: How do I make it work correctly? Thanks Sample data for df as follows: Date Open High Low Close Date 2017-06-27 2017-06-27 10.75 10.850 10.665 10.70 2017-06-26 2017-06-26 10.57 10.740 10.570 10.69 2017-06-23 2017-06-23 10.59 10.710 10.500 10.63 2017-06-22 2017-06-22 10.56 10.720 10.530 10.65 2017-06-21 2017-06-21 10.75 10.750 10.500 10.53 2017-06-20 2017-06-20 10.88 10.925 10.780 10.81 2017-06-19 2017-06-19 10.77 10.920 10.760 10.82 2017-06-16 2017-06-16 10.74 10.910 10.740 10.78 2017-06-15 2017-06-15 10.86 10.930 10.690 10.76 2017-06-14 2017-06-14 10.96 11.040 10.910 11.02 2017-06-13 2017-06-13 10.57 10.930 10.570 10.93 2017-06-09 2017-06-09 10.59 10.650 10.500 10.57 2017-06-08 2017-06-08 10.28 10.590 10.280 10.57 2017-06-07 2017-06-07 10.35 10.420 10.290 10.35 2017-06-06 2017-06-06 10.66 10.700 10.280 10.38 2017-06-05 2017-06-05 10.80 10.850 10.560 10.70 2017-06-02 2017-06-02 11.32 11.390 11.190 11.19 2017-06-01 2017-06-01 11.26 11.330 11.160 11.20 2017-05-31 2017-05-31 11.22 11.400 11.190 11.27 2017-05-30 2017-05-30 11.10 11.260 11.070 11.23 2017-05-29 2017-05-29 11.36 11.370 11.045 11.08 2017-05-26 2017-05-26 11.55 11.590 11.320 11.36 2017-05-25 2017-05-25 11.62 11.670 11.455 11.66 2017-05-24 2017-05-24 11.67 11.755 11.590 11.62 2017-05-23 2017-05-23 11.77 11.835 11.590 11.64 2017-05-22 2017-05-22 12.06 12.110 11.800 11.83 2017-05-19 2017-05-19 12.15 12.235 11.990 12.03 2017-05-18 2017-05-18 12.00 12.130 11.910 12.13 2017-05-17 2017-05-17 12.28 12.300 12.120 12.14 2017-05-16 2017-05-16 12.23 12.385 12.220 12.30 2017-05-15 2017-05-15 12.10 12.255 12.030 12.22 2017-05-12 2017-05-12 12.30 12.300 12.050 12.12 2017-05-11 2017-05-11 12.30 12.445 12.270 12.34 2017-05-10 2017-05-10 11.99 12.415 11.950 12.25 2017-05-09 2017-05-09 12.07 12.100 11.730 11.79 2017-05-08 2017-05-08 12.14 12.230 12.060 12.10 2017-05-05 2017-05-05 12.24 12.250 12.020 12.04 2017-05-04 2017-05-04 12.23 12.270 12.050 12.14 2017-05-03 2017-05-03 12.43 12.460 12.200 12.23 2017-05-02 2017-05-02 12.44 12.470 12.310 12.46 2017-05-01 2017-05-01 12.25 12.440 12.230 12.44 2017-04-28 2017-04-28 12.47 12.480 12.230 12.32 2017-04-27 2017-04-27 12.39 12.515 12.380 12.50 2017-04-26 2017-04-26 12.03 12.510 12.030 12.37 2017-04-24 2017-04-24 12.00 12.055 11.920 12.03 2017-04-21 2017-04-21 11.88 11.990 11.840 11.88 2017-04-20 2017-04-20 11.78 11.840 11.710 11.80 2017-04-19 2017-04-19 11.70 11.745 11.610 11.74 2017-04-18 2017-04-18 11.95 11.950 11.605 11.74 2017-04-13 2017-04-13 11.95 12.010 11.920 11.95 2017-04-12 2017-04-12 12.05 12.050 11.945 12.01 2017-04-11 2017-04-11 11.95 12.140 11.945 12.08 2017-04-10 2017-04-10 11.79 11.930 11.780 11.91 2017-04-07 2017-04-07 11.83 11.830 11.690 11.78 2017-04-06 2017-04-06 11.85 11.870 11.740 11.78 2017-04-05 2017-04-05 12.06 12.140 11.800 11.96 2017-04-04 2017-04-04 12.03 12.160 11.930 12.07 2017-04-03 2017-04-03 12.14 12.180 11.860 11.98 2017-03-30 2017-03-30 12.01 12.230 11.970 12.14 2017-03-29 2017-03-29 11.99 12.050 11.900 12.03
The data needs to be sorted. Then the option interpolate=True will provide the required plot. u = u""" Date Date1 Open High Low Close 2017-06-27 2017-06-27 10.75 10.850 10.665 10.70 2017-06-26 2017-06-26 10.57 10.740 10.570 10.69 2017-06-23 2017-06-23 10.59 10.710 10.500 10.63 2017-06-22 2017-06-22 10.56 10.720 10.530 10.65 2017-06-21 2017-06-21 10.75 10.750 10.500 10.53 2017-06-20 2017-06-20 10.88 10.925 10.780 10.81 2017-06-19 2017-06-19 10.77 10.920 10.760 10.82 2017-06-16 2017-06-16 10.74 10.910 10.740 10.78 2017-06-15 2017-06-15 10.86 10.930 10.690 10.76 2017-06-14 2017-06-14 10.96 11.040 10.910 11.02 2017-06-13 2017-06-13 10.57 10.930 10.570 10.93 2017-06-09 2017-06-09 10.59 10.650 10.500 10.57 2017-06-08 2017-06-08 10.28 10.590 10.280 10.57 2017-06-07 2017-06-07 10.35 10.420 10.290 10.35 2017-06-06 2017-06-06 10.66 10.700 10.280 10.38 2017-06-05 2017-06-05 10.80 10.850 10.560 10.70 2017-06-02 2017-06-02 11.32 11.390 11.190 11.19 2017-06-01 2017-06-01 11.26 11.330 11.160 11.20 2017-05-31 2017-05-31 11.22 11.400 11.190 11.27 2017-05-30 2017-05-30 11.10 11.260 11.070 11.23 2017-05-29 2017-05-29 11.36 11.370 11.045 11.08 2017-05-26 2017-05-26 11.55 11.590 11.320 11.36 2017-05-25 2017-05-25 11.62 11.670 11.455 11.66 2017-05-24 2017-05-24 11.67 11.755 11.590 11.62 2017-05-23 2017-05-23 11.77 11.835 11.590 11.64 2017-05-22 2017-05-22 12.06 12.110 11.800 11.83 2017-05-19 2017-05-19 12.15 12.235 11.990 12.03 2017-05-18 2017-05-18 12.00 12.130 11.910 12.13 2017-05-17 2017-05-17 12.28 12.300 12.120 12.14 2017-05-16 2017-05-16 12.23 12.385 12.220 12.30 2017-05-15 2017-05-15 12.10 12.255 12.030 12.22 2017-05-12 2017-05-12 12.30 12.300 12.050 12.12 2017-05-11 2017-05-11 12.30 12.445 12.270 12.34 2017-05-10 2017-05-10 11.99 12.415 11.950 12.25 2017-05-09 2017-05-09 12.07 12.100 11.730 11.79 2017-05-08 2017-05-08 12.14 12.230 12.060 12.10 2017-05-05 2017-05-05 12.24 12.250 12.020 12.04 2017-05-04 2017-05-04 12.23 12.270 12.050 12.14 2017-05-03 2017-05-03 12.43 12.460 12.200 12.23 2017-05-02 2017-05-02 12.44 12.470 12.310 12.46 2017-05-01 2017-05-01 12.25 12.440 12.230 12.44 2017-04-28 2017-04-28 12.47 12.480 12.230 12.32 2017-04-27 2017-04-27 12.39 12.515 12.380 12.50 2017-04-26 2017-04-26 12.03 12.510 12.030 12.37 2017-04-24 2017-04-24 12.00 12.055 11.920 12.03 2017-04-21 2017-04-21 11.88 11.990 11.840 11.88 2017-04-20 2017-04-20 11.78 11.840 11.710 11.80 2017-04-19 2017-04-19 11.70 11.745 11.610 11.74 2017-04-18 2017-04-18 11.95 11.950 11.605 11.74 2017-04-13 2017-04-13 11.95 12.010 11.920 11.95 2017-04-12 2017-04-12 12.05 12.050 11.945 12.01 2017-04-11 2017-04-11 11.95 12.140 11.945 12.08 2017-04-10 2017-04-10 11.79 11.930 11.780 11.91 2017-04-07 2017-04-07 11.83 11.830 11.690 11.78 2017-04-06 2017-04-06 11.85 11.870 11.740 11.78 2017-04-05 2017-04-05 12.06 12.140 11.800 11.96 2017-04-04 2017-04-04 12.03 12.160 11.930 12.07 2017-04-03 2017-04-03 12.14 12.180 11.860 11.98 2017-03-30 2017-03-30 12.01 12.230 11.970 12.14 2017-03-29 2017-03-29 11.99 12.050 11.900 12.03""" import io import matplotlib.pyplot as plt import pandas as pd import numpy as np #%matplotlib inline def WillR(ohlc, window): high, low, close = ohlc['High'], ohlc['Low'], ohlc['Close'] R = (np.nanmax(high, axis=0) - close) / (np.nanmax(high, axis=0) - np.nanmin(low, axis=0)) * -100 return R df = pd.read_csv(io.StringIO(u), delim_whitespace=True, index_col=0, parse_dates=True) ohlc = df[::-1] R = WillR(ohlc, 14) xAxis = ohlc.index fig = plt.figure(figsize=(18,9), dpi=100) ax = plt.subplot() plt.ylabel('Williams '+'%R', fontsize = 16) ax.xaxis.grid(color='grey', linestyle=':') ax.yaxis.grid(color='grey', linestyle=':') ax.plot(R, color='k', alpha=0.8, linewidth=1.2) ax.axhline(-20, color='r', linewidth=0.8) ax.axhline(-50, color='k', linestyle='dashed', linewidth=1.2) ax.axhline(-80, color='g', linewidth=0.8) ax.fill_between(xAxis, R, -20, where=(R >= -20), facecolor='r', edgecolor='r', alpha=0.5, interpolate=True) ax.fill_between(xAxis, R, -80, where=(R <= -80), facecolor='g', edgecolor='g', alpha=0.5, interpolate=True) plt.show()