Slightly different question than you guys may be used to seeing in regards to the pandas SetWithCopyWarning. In order to better facilitate my understanding, I've been actively trying to generate this warning and have actually had trouble...but I stumbled onto two interesting block of code, which I believe should both generate the setwithcopy code but only one does:
Creating the Dataframe:
import numpy as np
import pandas as pd
np.random.seed(1000)
df = pd.DataFrame(np.random.randn(5,5))
Following code generates a SetWithCopyWarning as expected:
df2=df.loc[:,:3]
df2.loc[0,0] = 99
This line of code has one small difference and does not generate a SetWithCopyWarning:
df2=df.loc[:4,:3]
df2.loc[0,0] = 99
Why does only changing the first index item in .loc on the first row from ":" to ":4" fail to generate a SetWithCopyWarning? Even though both are pulling all rows labeled 0-4?
Edited to make my code pop out more.
Related
I'm using style in pandas to display a dataframe consisting of a timestamp on jupyter notebook.
The displayed value, 1623838447949609984 turned out to be different from the input, 1623838447949609899.
pandas version, 1.4.2.
Can someone please explain the reason of the following code and output?
Thanks.
import pandas as pd
pd.DataFrame([[1623838447949609899]]).style
Pandas Styler, within its render script, contains the line return f"{x:,.0f}" when x is an integer.
In python if you execute
>>> "{:.0f}".format(1623838447949609899)
'1623838447949609984'
you obtain the result you cite. I suspect this is due to data storage of integers. Although why it pandas might be converting from a 64 bit int to a 32 bit int is unclear, and not related to Styler
My data is in a large multi-indexed pandas DataFrame. I re-index to flatten the DataFrame and then feed it through ColumnDataSource, but I need to group my data row wise in order to plot it correctly (think bunch of torque curves corresponding to a bunch of gears for a car). If I just plot the dictionary output of ColumnDataSource, it's a mess.
I've tried converting the ColumnDataSource output back to DataFrame, but then I lose the update functionality, the callback won't touch the DataFrame, and the plots won't change. Anyone have any ideas?
The short answer to the question in the title is "Yes". The ColumnDataSource is the special, central data structure of Bokeh. It provides the data for all the glyphs in a plot, or content in data tables, and automatically keeps that data synchronized on the Python and JavaScript sides, so that you don't have to, e.g write a bunch of low-level websocket code yourself. To update things like glyphs in a plot, you update the CDS that drives them.
It's possible there are improvements that could be made in your approach to updating the CDS, but it is impossibe to speculate without seeing actual code for what you have tried.
I just updated pandas from 0.17.1 to 0.21.0 to take advantage of some new functionalities, and ran into compatibility issue with matplotlib (which I also updated to latest 2.1.0). In particular, the Timestamp object seems to be changed significantly.
I happen to have another machine still running the older versions of pandas(0.17.1)/matplotlib(1.5.1) which I used to compared the differences:
Both versions show my DataFrame index to be dtype='datetime64[ns]
DatetimeIndex(['2017-03-13', '2017-03-14', ... '2017-11-17'], type='datetime64[ns]', name='dates', length=170, freq=None)
But when calling type(df.index[0]), 0.17.1 gives pandas.tslib.Timestamp and 0.21.0 gives pandas._libs.tslib.Timestamp.
When plotting with df.index as x-axis:
plt.plot(df.index, df['data'])
matplotlibs by default formats the x-axis labels as dates for pandas 0.17.1 but fails to recognize it for pandas 0.21.0 and simply gives raw number 1.5e18 (epoch time in nanosec).
I also have a customized cursor that reports clicked location on the graph by using matplotlib.dates.DateFormatter on the x-value which fails for 0.21.0 with:
OverflowError: signed integer is greater than maximum
I can see in debug the reported x-value is around 736500 (i.e. day count since year 0) for 0.17.1 but is around 1.5e18 (i.e. nanosec epoch time) for 0.21.0.
I am surprised at this break of compatibility between matplotlib and pandas as they are obviously used together by most people. Am I missing something in the way I call the plot function above for the newer versions?
Update as I mentioned above, I prefer directly calling plot with a given axes object but just for the heck of it, I tried calling the plot method of the DataFrame itself df.plot(). As soon as this is done, all subsequent plots correctly recognize the Timestamp within the same python session. It's as if an environment variable is set, because I can reload another DataFrame or create another axes with subplots and no where does the 1.5e18 show up. This really smells like a bug as the latest pandas doc says pandas:
The plot method on Series and DataFrame is just a simple wrapper around plt.plot()
But clearly it does something to the python session such that subsequent plots deal with the Timestamp index properly.
In fact, simply running the example at the above pandas link:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
Depending on whether ts.plot() is called or not, the following plot either correctly formats x-axis as dates or not:
plt.plot(ts.index,ts)
plt.show()
Once a member plot is called, subsequently calling plt.plot on new Series or DataFrame will autoformat correctly without needing to call the member plot method again.
There is an issue with pandas datetimes and matplotlib coming from the recent release of pandas 0.21, which does not register its converters any more at import. Once you use those converters once (within pandas) they'll be registered and automatically used by matplotlib as well.
A workaround would be to register them manually,
import pandas.plotting._converter as pandacnv
pandacnv.register()
In any case the issue is well known at both pandas and matplotlib side, so there will be some kind of fix for the next releases. Pandas is thinking about readding the register in an upcomming release. So this issue may be there only temporarily. An option is also to revert to pandas 0.20.x where this should not occur.
Update: this is no longer an issue with current versions of matplotlib (2.2.2)/pandas(0.23.1), and likely many that have been released since roughly December 2017, when this was fixed.
Update 2: As of pandas 0.24 or higher the recommended way to register the converters is
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
or if pandas is already imported as pd,
pd.plotting.register_matplotlib_converters()
After opening an issue on pandas github, I learned that this was indeed a known issue between pandas and matplotlib regarding auto-registration of unit converter. In fact it was listed on what's new page which I had failed to see before, along with the proper way to register the converters:
from pandas.tseries import converter
converter.register()
This is also done the first time a member plot method is called on a Series or DataFrame which explains what I observed above.
It appears to have been done with the intention that matplotlib is supposed to implement some basic support for pandas datetime, but indeed a deprecation warning of some sort could be useful for such a break. However until matplotlib actually implements such support (or some sort of lazy registration mechanism), practically I'm always putting those two lines at the pandas import. So I'm not sure why pandas would want to disable the automatic registration on import before things are ready on the matplotlib side.
It looks this issue has been fixed in the future version of matplotlib.
Try to run "pip install --upgrade matplotlib".
I met the same issue "AttributeError: 'numpy.datetime64' object has no attribute 'toordinal'". It was fixed when upgrade the package matplotlib.
I am using rolling mean on my data to smoothen it. My data can be found here.
An illustration of my original data is;
Currently, I am using
import pandas as pd
import numpy as np
data = pd.read_excel('data.xlsx')
data = np.array(data, dtype=np.float)
window_length = 9
res = pd.rolling_mean(np.array(data[:, 2]), window_length, min_periods=1, center=True)
This is what I get after applying rolling mean with a window_length of 9;
And when i increase the window_length to 20, I get a smoother image but at boundaries, the data seems to be erroneous.
The problem is, as seen in the figures above, the rolling mean introduces some sort of errors at the boundaries of my data which do not exist in the original data.
Is there any way to correct this?
My guess is, at the boundary, since part of the window_length is found outside my data, it exaggerates the mean.
Is there a way to correct this error using pandas rolling mean or is there a better pythonic way in doing this? Thanks.
Ps. I am aware the panda function of rolling mean i am using is deprecated in the new versiĆ³n.
You can try a native 2D convolution method such as scipy.ndimage.filters.convolve with weights so just make the kernel an average (mean) function.
The weights would be:
n = 3. # size of kernel over which to calculate mean
weights = np.ones(n,n)/n**2
If the white area of your data are represented by nans, this would reduce the footprint of the result by n since any kernel stamp with a nan included will return a nan. If this is really an issue try look at astropy.convolution, which has better nan handling.
I have an input source that gives me integers in [0..256].
I want to be able to locate spikes in this data, i.e. a new input.
I've tried using a rolling average in conjunction with finding the percent error. But this doesn't really work.
Basically, I want my program to find where a graph of the data would spike up, but I want it to ignore smooth transitions.
Thoughts?
A simple thought which follows my comment. First
>>> import numpy as np
Suppose we have the following time series
>>> sample = np.random.random_integers(0,256,size=(100,))
To know whether or not a spike can be considered as a rare event, we have to know the likelihood associated to each event. Since you are dealing with "rates of change", let us compute those
>>> sample_vars = np.abs(-1 + 1.*sample[1:]/sample[:-1]) # the "1.*" to get floats... (python<3)
We can then define the variation which has at most 5 percent (sample-) chance of occurring
>>> spike_defining_threshold = np.percentile(sample_vars, 95)
Finally if sample_vars[-1]>spike_defining_threshold
Would be great if others have thoughts to share as well...