Concatenated data from pandas_datareader - python-3.x

I am trying to create a dataframe which columns from 2 different datarame.
import pandas as pd
import numpy as np
from statsmodels import api as sm
import pandas_datareader.data as web
import datetime
start = datetime.datetime(2016,12,2)
end = datetime.datetime.today()
df = web.get_data_yahoo(['F'], start, end)
df1 = web.get_data_yahoo(['^GSPC'], start, end)
df3 = pd.concat([df['Adj Close'], df1['Adj Close']])
With this i wanted to get df3 with 2 columns containing data of [Adj Close]. What i got instead is :
F ^GSPC
Date
2016-12-01 10.297861 NaN
2016-12-02 10.140451 NaN
2016-12-05 10.306145 NaN
2016-12-06 10.405562 NaN
2016-12-07 10.819797 NaN
... ... ...
2019-11-22 NaN 3110.290039
2019-11-25 NaN 3133.639893
2019-11-26 NaN 3140.520020
2019-11-27 NaN 3153.629883
2019-11-29 NaN 3140.979980
1508 rows × 2 columns
What do i need to do to get rid of NaN values and why is it there?

Add parameter axis=1 for concanecate by columns in concat:
df3 = pd.concat([df['Adj Close'], df1['Adj Close']], axis=1)
But I think your solution should be simplify with pass list to get_data_yahoo:
df3 = web.get_data_yahoo(['F', '^GSPC'], start, end)

Related

Python Pandas SQL NaN value issues

How to fix the NaN? i actually have data/values from the first query.
The float value always print as NaN from Pandas but the simple SQL query shows the float values properly.
My code is below
import pyodbc as conn
import pandas as pd
import matplotlib.pyplot as plot
connection = conn.connect("Driver={SQL Server};"
"Server=GOPALPC\SQLSERVER;"
"Database=SCADADB;Trusted_Connection=yes")
mycursor: object = connection.cursor()
SQL_Query_01 = mycursor.execute(
"SELECT TOP 5 [Channel1_PLC_001_Flow],[Channel1_PLC_001_Level],[Channel1_PLC_001_Water] FROM [SCADADB].[dbo].[DataLogDB]")
myresult = mycursor.fetchall()
for x in myresult:
print(x)
SQL_Query_02 = pd.read_sql_query(
"SELECT TOP 5 [Channel1_PLC_001_Flow],[Channel1_PLC_001_Level],[Channel1_PLC_001_Water] FROM [SCADADB].[dbo].[DataLogDB]",connection)
df = pd.DataFrame(SQL_Query_02, columns=['FLOW', 'PRESSURE', 'TEMPERATURE'])
print (df)
Result of SQL_Query_01
(171.5, 171.5, 171.5)
(170.25, 170.25, 170.25)
(169.5, 169.5, 169.5)
(168.75, 168.75, 168.75)
(168.0, 168.0, 168.0)
Result of SQL_Query_02
FLOW PRESSURE TEMPERATURE
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
Process finished with exit code 0
import mysql.connector
import pandas as pd
dbcon = mysql.connector.connect(host = 'localhost',database = 'nifty',user= 'root',password = 'Your_password')
cursor = dbcon.cursor()
sql_query = '''SELECT low,high FROM adaniports LIMIT 3'''
result = cursor.execute(sql_query)
myresult = cursor.fetchall()
for i in myresult:
print(i)
sql_query2 = pd.read_sql_query(sql_query,dbcon)
df = pd.DataFrame(sql_query2,columns=['low','high'])
print(df)
The above code works fine for me.
output -
result of first query
(770.0, 1050.0)
(874.0, 990.0)
(841.0, 914.75)
result of second query
low high
0 770.0 1050.00
1 874.0 990.00
2 841.0 914.75

Assign array values to NaN Dataframe Pandas

I am trying to fill a dataframe which originally has NaN values with the same number of values taken from an array. All the values in the dictionary leagueList (NFL,NBA, etc.) are individual dataframes.
Sorry, I can't place them here as the post will become too long.
The idea behind the loop below is to get the series of paired t-tests (p_value) between all leagues in the dataframe and compare them based on columns called 'win_loss_ratio'.
The resulting array with the same number of values as in the empty dataframe should be used to replace the NaN values in the dataframe but I am stuck on this part. How this could be accomplished?
leagueList={'NFL':NFL,'NBA':NBA,'NHL':NHL,'MLB':MLB}
df = pd.DataFrame(columns = leagueList, index = leagueList)
print(df)
NFL NBA NHL MLB
NFL NaN NaN NaN NaN
NBA NaN NaN NaN NaN
NHL NaN NaN NaN NaN
MLB NaN NaN NaN NaN
#Double loop for making all possible league combinations
for a in leagueList.values():
for b in leagueList.values():
df_comb=pd.merge(a,b,left_index=True,right_index=True,how='inner')
teststat,p_value=stats.ttest_rel(df_comb[['win_loss_ratio_x']],df_comb[['win_loss_ratio_y']])
print(p_value)
[nan]
[0.94179205]
[0.03088317]
[0.80206949]
[0.94179205]
[nan]
[0.02229705]
[0.95053998]
[0.03088317]
[0.02229705]
[nan]
[0.00070784]
[0.80206949]
[0.95053998]
[0.00070784]
[nan]
Put the p-values into a list to either use .fillna, or just construct it straight a way:
import pandas as pd
from scipy import stats
#some sample data
NFL = pd.DataFrame([.5,.6,.7], columns=['win_loss_ratio'])
NBA = pd.DataFrame([.7,.5,.3], columns=['win_loss_ratio'])
NHL = pd.DataFrame([.4,.3,.2], columns=['win_loss_ratio'])
MLB = pd.DataFrame([.9,.8,.9], columns=['win_loss_ratio'])
leagueList={'NFL':NFL,'NBA':NBA,'NHL':NHL,'MLB':MLB}
#Double loop for making all possible league combinations
rows = []
for a in leagueList.values():
for b in leagueList.values():
df_comb=pd.merge(a,b,left_index=True,right_index=True,how='inner')
teststat,p_value=stats.ttest_rel(df_comb[['win_loss_ratio_x']],df_comb[['win_loss_ratio_y']])
rows.append(p_value[0])
n=len(leagueList)
data = [rows[i * n:(i + 1) * n] for i in range((len(rows) + n - 1) // n )]
df = pd.DataFrame(data, columns = leagueList, index = leagueList)
Output:
print (df.to_string())
NFL NBA NHL MLB
NFL NaN 0.622036 0.12169 0.057191
NBA 0.622036 NaN 0.07418 0.092735
NHL 0.121690 0.074180 NaN 0.013560
MLB 0.057191 0.092735 0.01356 NaN

Calculation of percentile and mean

I want to find the 3% percentile of the following data and then average the data.
Given below is the data structure.
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
... ...
96927 NaN
96928 NaN
96929 NaN
96930 NaN
96931 NaN
Over here the concerned data lies exactly between the data from 13240:61156.
Given below is my code:
enter code here
import pandas as pd
import numpy as np
load_var=pd.read_excel(r'path\file name.xlsx')
load_var
a=pd.DataFrame(load_var['column whose percentile is to be found'])
print(a)
b=np.nanpercentile(a,3)
print(b)
Please suggest the changes in the code.
Thank you.
Use Series.quantile with mean in Series.agg:
df = pd.DataFrame({
'col':[7,8,9,4,2,3, np.nan],
})
f = lambda x: x.quantile(0.03)
f.__name__ = 'q'
s = df['col'].agg(['mean', f])
print (s)
mean 5.50
q 2.15
Name: col, dtype: float64

How to stop sort_values sorting by column names alphabetically?

I am working with a pandas dataframe, in which some of the columns have no entries. I want to put all columns at the end and I manage to do it (see code below), but I also notice that after sorting the remaining columns were also sorted alphabetically by column names in descending order. Can I prevent this from happening?
Input dataframe:
,colA,colB,colC,colD,colF
rowA,X,nan,nan,X,nan
rowB,nan,X,nan,nan,X
rowC,X,nan,nan,X,X
rowD,X,nan,nan,nan,nan
rowE,nan,X,nan,nan,X
Code:
import pandas as pd
df = pd.read_csv (r'q1.csv', dtype= 'str', index_col=0, na_values = 'nan')
ind = df.notnull().astype('int').any().sort_values(ascending= False).index
out = df.loc[:,ind]
out.to_csv(r'out.csv', na_rep= 'nan')
Output dataframe:
,colF,colD,colB,colA,colC
rowA,nan,X,nan,X,nan
rowB,X,nan,X,nan,nan
rowC,X,X,nan,X,nan
rowD,nan,nan,nan,X,nan
rowE,X,nan,X,nan,nan
Essentially, I want to keep order as it is for all other columns.
Thanks.
If I understand correctly, you may try this.
m = df.isna().all().sort_values(kind='mergesort')
df_new = df[m.index]
Out[243]:
colA colB colD colF colC
rowA X NaN X NaN NaN
rowB NaN X NaN X NaN
rowC X NaN X X NaN
rowD X NaN NaN NaN NaN
rowE NaN X NaN X NaN

Convert a numerical relative index (=months) to datetime

Given is a Pandas DataFrame with a numerical index representing the relative number of months:
df = pd.DataFrame(columns=['A', 'B'], index=np.arange(1,100))
df
A B
1 NaN NaN
2 NaN NaN
3 NaN NaN
...
How can the index be converted to a DateTimeIndex by specifying a start date (e.g., 2018-11-01)?
magic_function(df, start='2018-11-01', delta='month')
A B
2018-11-01 NaN NaN
2018-12-01 NaN NaN
2019-01-01 NaN NaN
...
I would favor a general solution that also works with arbitrary deltas, e.g. daily or yearly series.
Using date_range
idx=pd.date_range(start='2018-11-01',periods =len(df),freq='MS')
df.index=idx
I'm not sure with Pandas, but with plain datetime can't you just do this?
import datetime
start=datetime.date(2018,1,1)
months = 15
adjusted = start.replace(year=start.year + int(months/12), month=months%12)

Resources