I have just started using Quandl and Pandas when I came across this code.
import quandl
import pandas as pd
api_key=open('quandlapi.txt','r').read()
df = quandl.get("FMAC/HPI_TX", authtoken=api_key)
fiddy_states = pd.read_html('https://simple.wikipedia.org/wiki/List_of_U.S._states')
main_df = pd.DataFrame()
for abbv in fiddy_states[0][0][1:]:
query="FMAC/HPI_"+str(abbv)
df = quandl.get(query, authtoken=api_key)
if main_df.empty:
main_df = df
else:
main_df = main_df.join(df)
But when I run it I get the following error :
ValueError: columns overlap but no suffix specified: Index(['Value'], dtype='object')
Can anyone tell me what wrong I am doing here.?
Related
I am trying to extract data from Google Trends by using the pytrends library to analyze it in MS PowerBI by using the following script:
import pandas as pd
from pytrends.request import TrendReq
pytrends = TrendReq()
data = pd.DataFrame()
kw_list = ["Bitcoin", "Ethereum"]
pytrends.build_payload(kw_list, timeframe='today 3-m')
data = pytrends.interest_over_time()
print(data)
When using the simple script in PowerBI, the date-column suddenly disappears. How can I include the date-column ?
import pandas as pd
from pytrends.request import TrendReq
pytrends = TrendReq()
data = pd.DataFrame()
kw_list = ["Bitcoin", "Ethereum"]
pytrends.build_payload(kw_list, timeframe='today 3-m')
data = pytrends.interest_over_time()
data.reset_index(inplace=True)
print(data)
Date column is index, you just need to add second last line
Hope this will work
Thanks!
I want the output of this code in int64 format but the output of this code is in float. how can change it? pls suggest
import pandas as pd
import numpy as np
df = pd.read_csv('https://query.data.world/s/HqjNNadqEnwSq1qnoV_JqyRJkc7o6O')
df = df[df.isnull().sum(axis=1) < 5]
print(round(100*(df.isnull().sum()/len(df.index))),2)
Something like this should do the trick...
import pandas as pd
import numpy as np
df = pd.read_csv('https://query.data.world/s/HqjNNadqEnwSq1qnoV_JqyRJkc7o6O')
df = df[df.isnull().sum(axis=1) < 5]
x = round(100*(df.isnull().sum()/len(df.index)))
y = x.astype(np.int64)
print(y)
The key bit being x.astype(np.int64) to convert the format.
I'm having a dataset. Where I was practicing feature engineering by converting categorical objects to numbers, with the following lines of code:
import pandas as pd
import numpy as np
from sklearn import preprocessing
df = pd.read_csv(r'train.csv',index_col='Id')
print(df.shape)
df.head()
colsNum = df.select_dtypes(np.number).columns
colsObj = df.columns.difference(colsNum)
df[colsNum] = df[colsNum].fillna(df[colsNum].mean()//1)
df[colsObj] = df[colsObj].fillna(df[colsObj].mode().iloc[0])
label_encoder = preprocessing.LabelEncoder()
for col in colsObj:
df[col] = label_encoder.fit_transform(df[col])
df.head()
for col in colsObj:
df[col] = label_encoder.inverse_transform(df[col])
df.head()
But here the inverse_tranform() wasn't returning the original dataset. Please help me!
You need one encoder per column - you cannot encode all columns with the same encoder:
import pandas as pd
import numpy as np
from sklearn import preprocessing
df = pd.read_csv(r'train.csv', index_col='Id')
print(df.shape)
colsNum = df.select_dtypes(np.number).columns
colsObj = df.columns.difference(colsNum)
df[colsNum] = df[colsNum].fillna(df[colsNum].mean()//1)
df[colsObj] = df[colsObj].fillna(df[colsObj].mode().iloc[0])
print(df.head())
encoder = {}
for col in colsObj:
encoder[col] = preprocessing.LabelEncoder()
df[col] = encoder[col].fit_transform(df[col])
print(df.head())
for col in colsObj:
df[col] = encoder[col].inverse_transform(df[col])
print(df.head())
You can also check out this answer for further details.
I get a keyerror for "Displacement" when I try to plot Force against Displacement with pandas for these group of dataframes. Please help.
The link to the excel sheet being used:
https://www.dropbox.com/s/f8lnp973ojv3ish/neurospheress.xlsx?dl=0
I tried clearing any space in the column titles but it doesn't work
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_excel('neurospheress.xlsx', sep='\s*,\s*', sheet_name = 'LS')
df1 = data.iloc[:80,:2]
df2 = data.iloc[:80,2:4]
df3 = data.iloc[:80,4:]
dfs = [df1,df2,df3]
for i,df in enumerate(dfs):
plt.plot(df['Displacement'], df['Force'], linestyle='--', alpha= 0.8, label='df{}'.format(i))
plt.legend(loc='best')
plt.show()
The below solution works, it basically adds two things to your solution
a) Skip the first row from excel
b) Rename the column names for df2 and df3
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_excel('neurospheress.xlsx', sep='\s*,\s*', sheet_name = 'LS',skiprows=1)
df1 = data.iloc[:80,:2]
df2 = data.iloc[:80,2:4]
df3 = data.iloc[:80,4:]
dfs = [df1,df2,df3]
df2.rename(columns={'Force.1':'Force','Displacement.1':'Displacement'},inplace=True)
df3.rename(columns={'Force.2':'Force','Displacement.2':'Displacement'},inplace=True)
print(data.columns)
print(df1.columns)
print(df2.columns)
for i,df in enumerate(dfs):
plt.plot(df['Displacement'], df['Force'], linestyle='--', alpha= 0.8, label='df{}'.format(i))
plt.legend(loc='best')
plt.show()
I'm following the bokeh tutorial and in the basic plotting section, I can't manage to show a plot. I only get the axis. What am I missing?
Here is the code:
df = pd.DataFrame.from_dict(AAPL)
weekapple = df.loc["2000-03-01":"2000-04-01"]
p = figure(x_axis_type="datetime", title="AAPL", plot_height=350, plot_width=800)
p.xgrid.grid_line_color=None
p.ygrid.grid_line_alpha=0.5
p.xaxis.axis_label = 'Time'
p.yaxis.axis_label = 'Value'
p.line(weekapple.date, weekapple.close)
show(p)
I get this:
My result
I'm trying to complete the exercise here (10th Code cell - Exercise with AAPL data) I was able to follow all previous code up to that point correctly.
Thanks in advance!
In case this is still relevant, this is how you should do you selection:
df = pd.DataFrame.from_dict(AAPL)
# Convert date column in df from strings to the proper datetime format
date_format="%Y-%m-%d"
df["date"] = pd.to_datetime(df['date'], format=date_format)
# Use the same conversion for selected dates
weekapple = df[(df.date>=dt.strptime("2000-03-01", date_format)) &
(df.date<=dt.strptime("2000-04-01", date_format))]
p = figure(x_axis_type="datetime", title="AAPL", plot_height=350, plot_width=800)
p.xgrid.grid_line_color=None
p.ygrid.grid_line_alpha=0.5
p.xaxis.axis_label = 'Time'
p.yaxis.axis_label = 'Value'
p.line(weekapple.date, weekapple.close)
show(p)
To make this work, before this code, I have (in my Jupyter notebook):
import numpy as np
from bokeh.io import output_notebook, show
from bokeh.plotting import figure
import bokeh
import pandas as pd
from datetime import datetime as dt
bokeh.sampledata.download()
from bokeh.sampledata.stocks import AAPL
output_notebook()
As descried at, https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html, .loc is used in operations with the index (or boolean lists); date is not in the index in your dataframe (it is a regular column).
I hope this helps.
You dataframe sub-view is empty:
In [3]: import pandas as pd
...: from bokeh.sampledata.stocks import AAPL
...: df = pd.DataFrame.from_dict(AAPL)
...: weekapple = df.loc["2000-03-01":"2000-04-01"]
In [4]: weekapple
Out[4]:
Empty DataFrame
Columns: [date, open, high, low, close, volume, adj_close]
Index: []