Is this a bug for display sparse multi-index in Pandas.
(Running with iPython 6.2.1, Python 3.6.4)
In [1]:
import pandas as pd
from io import StringIO
data = """
code\tname\ttyp\tntf\n
A5411\tWD\tAF\t\n
A5411\tWD\tAF\t210194618\n
B5498\tSH\tNC\t\n
B5498\tSH\tNC\t210213014\n
"""
df = pd.read_table(StringIO(data))
In [2]: df.set_index(['name','code'])
Out[2]:
typ ntf
name code
WD A5411 AF NaN
A5411 AF 210194618.0
SH B5498 NC NaN
B5498 NC 210213014.0
I am expecting the Out[2] should be something like Out[3]
In [3]: df.set_index(['name', 'code', 'typ'])
Out[3]:
ntf
name code typ
WD A5411 AF NaN
AF 210194618.0
SH B5498 NC NaN
NC 210213014.0
Any idea on this?
Related
I am trying to plot the residuals from statsmodels' AutoRegResults, but results.resid only returns NaN when I call the method. However, when I call plot_diagnostics() it is able to plot the regularized residuals with no issues. How can I get the actual residuals?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.tsa.ar_model import AutoReg
import warnings
df=pd.read_csv('Bank_of_England_Database.csv',
sep=',',
parse_dates=["Date"],
dayfirst=True,
index_col="Date")
df.rename({list(df.columns)[-1] : 'Spot Exchange Rate'},
axis='columns',
inplace=True)
df['RW 11'] = df.rolling(window=11, min_periods=11, center=True).mean()
xbar = df['Spot Exchange Rate'].mean()
df['demean'] = df['Spot Exchange Rate'] - xbar
fig = plt.figure()
fig.suptitle("AR(p) Residuals")
lags = [1] #, 2, 3]
for lag in lags:
warnings.filterwarnings("ignore") # Stops a FutureWarning and ValueWarning about dates
model = AutoReg(df['demean'], lags=lag)
results = model.fit()
resid = results.resid # Returns NaN
print(resid.head())
plt.plot(df.index[lag:], resid, label=f"lag={lag}")
results.plot_diagnostics()
plt.show()
Date
2015-05-01 NaN
2015-05-05 NaN
2015-05-06 NaN
2015-05-07 NaN
2015-05-08 NaN
dtype: float64
No handles with labels found to put in legend.
My residual plot, which is just NaN
plot_diagnostics
EDIT
Updated code showing version with the same issue:
import pandas as pd
import numpy as np
from statsmodels.tsa.api import AutoReg
import statsmodels as sm
import matplotlib.pyplot as plt
print(f"statsmodel version: {sm.__version__}")
df=pd.read_csv('Bank_of_England_Database.csv',
sep=',',
parse_dates=["Date"],
dayfirst=True,
index_col="Date")
df.rename({list(df.columns)[-1] : 'Spot Exchange Rate'},
axis='columns',
inplace=True)
df['demean'] = df['Spot Exchange Rate'] - df['Spot Exchange Rate'].mean()
res = AutoReg(df['demean'], lags=2).fit()
results.plot_diagnostics()
print(f"All NaN: {np.isnan(res.resid).all()}")
plt.show()
Results:
statsmodel version: 0.12.2
/usr/local/lib/python3.9/site-packages/statsmodels/tsa/base/tsa_model.py:581: ValueWarning: A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
warnings.warn('A date index has been provided, but it has no'
/usr/local/lib/python3.9/site-packages/statsmodels/tsa/ar_model.py:248: FutureWarning: The parameter names will change after 0.12 is released. Set old_names to False to use the new names now. Set old_names to True to use the old names.
warnings.warn(
All NaN True
It doesn't seem to be possible to reproduce this issue using statsmodels 0.12.2.
from statsmodels.tsa.arima_process import ArmaProcess
from statsmodels.tsa.api import AutoReg
import numpy as np
y = ArmaProcess.from_coeffs([1.8,-0.9]).generate_sample(250)
res = AutoReg(y,lags=2,old_names=False).fit()
print(f"All finite (no nan/inf): {np.all(np.isfinite(res.resid))}")
# Try a Series
ys = pd.Series(y, index=pd.date_range("1900-1-1",periods=250,freq="M"))
res = AutoReg(ys,lags=2,old_names=False).fit()
print(f"All finite using Series (no nan/inf): {np.all(np.isfinite(res.resid))}")
# Try a Series with no freq
idx = pd.date_range("1900-1-1",periods=750,freq="M")
uneven_idx = sorted(np.random.default_rng().choice(idx, size=250, replace=False))
ys = pd.Series(y, index=uneven_idx)
res = AutoReg(y,lags=2,old_names=False).fit()
print(f"All finite using Series w/o freq (no nan/inf): {np.all(np.isfinite(res.resid))}")
This produces
All finite (no nan/inf): True
All finite using Series (no nan/inf): True
All finite using Series w/o freq (no nan/inf): True
Upgrading to the latest release might be needed.
I have a dataframe as shown below, which is in xlsx format
Date in_days t_factor
2020-02-01 1 5
2020-02-06 6 14
2020-02-09 9 14
2020-02-03 3 11
2020-02-11 11 14
I would like to create a flask app to read this data from its location and convert it back to list of dictionary.
I am first time in flask
I tried below code in jupyter notebook to return list of dictionary
import datetime as dt
import pandas as pd
df = pd.read_excel('data.xlsx')
def get_function_model_data(caliberation_df):
df = caliberation_df.copy()
df.fillna(0, inplace=True)
df['Date'] = df['Date'].apply(lambda x:
dt.datetime.strftime(x,'%Y-%m-%d'))
df.rename(columns={'in_days':'InDays'}, inplace=True)
df = df [['Date', 'InDays']]
return list(df.T.to_dict().values())
I would like to create a flask app to read the excel from its location and return the list of dictionary of selected column.
I tried below code
import pandas as pd
import numpy as np
import datetime as dt
from flask import Flask
app = Flask(__name__)
def get_function_model_data():
df = pd.read_excel('data.xlsx')
df.fillna(0, inplace=True)
df['Date'] = df['Date'].apply(lambda x:
dt.datetime.strftime(x,'%Y-%m-%d'))
df.rename(columns={'in_days':'InDays'}, inplace=True)
df = df [['Date', 'InDays']]
return list(df.T.to_dict().values())
I am trying to get domain names from the url from a column into another column. Its working on a string like object, when I apply to dataframe it doesn't work. How to do I apply this to a data frame?
Tried:
from urllib.parse import urlparse
import pandas as pd
id1 = [1,2,3]
ls = ['https://google.com/tensoflow','https://math.com/some/website',np.NaN]
df = pd.DataFrame({'id':id1,'url':ls})
df
# urlparse(df['url']) # ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
# df['url'].map(urlparse) # AttributeError: 'float' object has no attribute 'decode'
working on string:
string = 'https://google.com/tensoflow'
parsed_uri = urlparse(string)
result = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
result
looking for a column:
col3
https://google.com/
https://math.com/
nan
Errror
You can try something like this.
Here I have used pandas.Series.apply() to solve.
» Initialization and imports
>>> from urllib.parse import urlparse
>>> import pandas as pd
>>> id1 = [1,2,3]
>>> import numpy as np
>>> ls = ['https://google.com/tensoflow','https://math.com/some/website',np.NaN]
>>> ls
['https://google.com/tensoflow', 'https://math.com/some/website', nan]
>>>
» Inspect the newly created DataFrame.
>>> df = pd.DataFrame({'id':id1,'url':ls})
>>> df
id url
0 1 https://google.com/tensoflow
1 2 https://math.com/some/website
2 3 NaN
>>>
>>> df["url"]
0 https://google.com/tensoflow
1 https://math.com/some/website
2 NaN
Name: url, dtype: object
>>>
» Applying a function using pandas.Series.apply(func) on url column..
>>> df["url"].apply(lambda url: "{uri.scheme}://{uri.netloc}/".format(uri=urlparse(url)) if not pd.isna(url) else np.nan)
0 https://google.com/
1 https://math.com/
2 NaN
Name: url, dtype: object
>>>
>>> df["url"].apply(lambda url: "{uri.scheme}://{uri.netloc}/".format(uri=urlparse(url)) if not pd.isna(url) else str(np.nan))
0 https://google.com/
1 https://math.com/
2 nan
Name: url, dtype: object
>>>
>>>
» Store the above result in a variable (not mandatory, just to simply).
>>> s = df["url"].apply(lambda url: "{uri.scheme}://{uri.netloc}/".format(uri=urlparse(url)) if not pd.isna(url) else str(np.nan))
>>> s
0 https://google.com/
1 https://math.com/
2 nan
Name: url, dtype: object
>>>
» Finally
>>> df2 = pd.DataFrame({"col3": s})
>>> df2
col3
0 https://google.com/
1 https://math.com/
2 nan
>>>
» To make sure, what is s and what is df2, check types (again, not mandatory).
>>> type(s)
<class 'pandas.core.series.Series'>
>>>
>>>
>>> type(df2)
<class 'pandas.core.frame.DataFrame'>
>>>
Reference links:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.isnull.html
I have two files named "Posterior_C.txt" and "Posterior_l.txt", each containing 5000 float entries, that I would like to import and concatenate into a dataframe (for plotting in seaborn). Each entry belonging to Posterior_C should be given a label C and each entry belonging to Posterior_l should be called l.
How can I import the data and concatenate them, while creating an unique identifier for each. E.g.
0.012 Posterior_C
0.0021 Posterior_C
0.2 Posterior_l
0.52 Posterior_l
This is what I've got so far:
import pandas as pd
import numpy as np
C=np.loadtxt("Posterior_C.txt")
l=np.loadtxt("Posterior_l.txt")
df={C,l}
df=pd.DataFrame(df)
import numpy as np
xc = np.array(["C"])
c=np.repeat(xc, 5000, axis=0)
import numpy as np
xl = np.array(["l"])
l=np.repeat(xl, 5000, axis=0)
But a bit stuck now.
*In R i would do *
C<-read.table("Posterior_C.txt,header=FALSE)
l<-read.table("Posterior_l.txt,header=FALSE)
df=rbind(C,l)
df<-as.data.frame(df)
dfID=rbind(rep("C",NROW(C),rep("l",NROW(l))
df$ID<-cbind(df,dfID[,1] )
or something similar.
Something like this:
c = pd.read_table("Posterior_C.txt", header=None)
l = pd.read_table("Posterior_l.txt", header=None)
c['ID'] = 'C'
l['ID'] = 'l'
df = pd.concat([c, l], ignore_index=True)
I want to read a pickle file in python 3.5. I am using the following code.
The following is my output, I want to load it as pandas dataframe.
when I try to convert into pd Dataframe, using df = pd.DataFrame(df), I am getting the below error.
ValueError: arrays must all be same length
link to data- https://drive.google.com/file/d/1lSFBPLbUCluWfPjzolUZKmD98yelTSXt/view?usp=sharing
I think you need dict comprehension with concat:
from pandas.io.json import json_normalize,
import pickle
fh = open("imdbnames40.pkl", 'rb')
d = pickle.load(fh)
df = pd.concat({k:json_normalize(v, 'scores', ['best']) for k,v in d.items()})
print (df.head())
ethnicity score best
'Aina Rapoza 0 Asian 0.89 Asian
1 GreaterAfrican 0.05 Asian
2 GreaterEuropean 0.06 Asian
3 IndianSubContinent 0.11 GreaterEastAsian
4 GreaterEastAsian 0.89 GreaterEastAsian
Then if need column from first level of MultiIndex:
df = df.reset_index(level=1, drop=True).rename_axis('names').reset_index()