Output columns do not match with the data

Output columns do not match with the data - python-3.x

I am trying to make a dataframe with Historical data of daily No. of stock Advancing and declining with their respective volumes of Nifty 50 index.
Being new to python I am having trouble handling pandas dataframe and conditions.
Below is the code that I wrote, but the output's columns are wrong:
import datetime
from datetime import date, timedelta
import nsepy as ns
from nsepy.derivatives import get_expiry_date
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#setting default dates
end_date = date.today()
start_date = end_date - timedelta(365)
#Deriving the names of 50 stocks in Nifty 50 Index
nifty_50 = pd.read_html('https://en.wikipedia.org/wiki/NIFTY_50')
nifty50_symbols = nifty_50[1][1]
results = []
for x in nifty50_symbols:
data = ns.get_history(symbol = x, start=start_date, end=end_date)
results.append(data)
df = pd.concat(results)
output = []
for x in df.index:
Dates = df[df.index == x]
adv = 0
dec = 0
net = 0
advol = 0
devol = 0
netvol = 0
for s in Dates['Symbol']:
y = Dates[Dates['Symbol'] == s]
#print(y.loc[x,'Close'])
cclose = y.loc[x,'Close']
#print(cclose)
copen = y.loc[x,'Open']
#print(copen)
cvol = y.loc[x,'Volume']
if cclose > copen:
adv = adv + 1
advol = advol + cvol
elif copen > cclose:
dec = dec + 1
devol = devol + cvol
else:
net = net + 1
netvol = netvol + cvol
data = [x,adv,dec,advol,devol]
output.append(data)
final = pd.DataFrame(output, columns = {'Date','Advance','Decline','Adv_Volume','Dec_Volume'})
print(final)
Output:
Dec_Volume Adv_Volume Date Decline Advance
0 2017-02-06 27 23 88546029 70663663
1 2017-02-07 15 35 53775268 127004815
2 2017-02-08 27 23 76150502 96895043
3 2017-02-09 20 30 48815099 121956144
4 2017-02-10 19 31 47713187 156262469
5 2017-02-13 23 27 78460358 86575050
6 2017-02-14 15 35 65543372 100474945
7 2017-02-15 13 37 35055563 160091302
8 2017-02-16 35 15 114283658 73082870
9 2017-02-17 22 28 91383781 193246678
10 2017-02-20 34 16 100148171 54036281
11 2017-02-21 29 21 87434834 75182662
12 2017-02-22 13 37 77086733 148499613
13 2017-02-23 20 29 104469151 192787014
14 2017-02-27 13 37 41823692 140518994
15 2017-02-28 21 29 76949655 142799485
As you can see from output that the column names do not match with that data under them. Why is this happening and how do I fix it?
If I print the value of Output list after the series of loops are over then the data looks exactly the way I want it to be(as far as a novice like me can see). The problem is happening when I am converting the Output list into a DataFrame.

I think the solution is simply to pass your column names as a Python list (using []), which has a well-defined element order, rather than as a set ({}) of unordered elements:
final = pd.DataFrame(output, columns = ['Date','Advance','Decline','Adv_Volume','Dec_Volume'])

Related

Box Whisker plot of date frequency

Good morning all!
I have a Pandas df and Im trying to create a monthly box and whisker of 30 years ofdata.
DataFrame
datetime year month day hour lon lat
0 3/18/1986 10:17 1986 3 18 10 -124.835 46.540
1 6/7/1986 13:38 1986 6 7 13 -121.669 46.376
2 7/17/1986 20:56 1986 7 17 20 -122.436 48.044
3 7/26/1986 2:46 1986 7 26 2 -123.071 48.731
4 8/2/1986 19:54 1986 8 2 19 -123.654 48.480
Trying to see the mean amount of occurrences in X month, the median, and the max/min occurrence ( and date of max and min)..
Ive been playing around with pandas.DataFrame.groupby() but dont fully understand it.
I have grouped the date by month and day occurrences. I like this format:
Code:
df = pd.read_csv(masterCSVPath)
months = df['month']
test = df.groupby(['month','day'])['day'].count()
output: ---->
month day
1 1 50
2 103
3 97
4 29
5 60
...
12 27 24
28 7
29 17
30 18
31 9
So how can i turn that df above into a box/whisker plot?
The x-axis i want to be months..
y axis == occurrences

Try this (without doing groupby):
import matplotlib.pyplot as plt
import seaborn as sns
sns.boxplot(x = 'month', y = 'day', data = df)
In case you want the months to be in Jan, Feb format then try this:
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
df['month_new'] = df['datetime'].dt.strftime('%b')
sns.boxplot(x = 'month_new', y = 'day', data = df)

I'm not able to add column for all rows in pandas dataframe

I'm pretty new in python / pandas, so its probably pretty simple question...but I can't handle it:
I have two dataframe loaded from Oracle SQL. One with 300 rows / 2 column and second with one row/one column. I would like to add column from second dataset to the first for each row as new column. But I can only get it for the first row and the others are NaN.
`import cx_Oracle
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.externals import joblib
dsn_tns = cx_Oracle.makedsn('127.0.1.1', '1521', 'orcl')
conn = cx_Oracle.connect(user='MyName', password='MyPass', dsn=dsn_tns)
d_score = pd.read_sql_query(
'''
SELECT
ID
,RESULT
,RATIO_A
,RATIO_B
from ORCL_DATA
''', conn) #return 380 rows
d_score['ID'] = d_score['ID'].astype(int)
d_score['RESULT'] = d_score['RESULT'].astype(int)
d_score['RATIO_A'] = d_score['RATIO_A'].astype(float)
d_score['RATIO_B'] = d_score['RATIO_B'].astype(float)
d_score_features = d_score.iloc [:,2:4]
#d_train_target = d_score.iloc[:,1:2] #target is RESULT
DM_train = xgb.DMatrix(data= d_score_features)
loaded_model = joblib.load("bst.dat")
pred = loaded_model.predict(DM_train)
i = pd.DataFrame({'ID':d_score['ID'],'Probability':pred})
print(i)
s = pd.read_sql_query('''select max(id_process) as MAX_ID_PROCESS from PROCESS''',conn) #return only 1 row
m =pd.DataFrame(data=s, dtype=np.int64,columns = ['MAX_ID_PROCESS'] )
print(m)
i['new'] = m ##Trying to add MAX_ID_PROCESS to all rows
print(i)
i =
ID Probability
0 20101 0.663083
1 20105 0.486774
2 20106 0.441300
3 20278 0.703176
4 20221 0.539185
....
379 20480 0.671976
m =
MAX_ID_PROCESS
0 274
i =
ID_MATCH Probability new
0 20101 0.663083 274.0
1 20105 0.486774 NaN
2 20106 0.441300 NaN
3 20278 0.703176 NaN
4 20221 0.539185 NaN
I need value 'new' for all rows...

Since your second dataframe is only having one value, you can assign it like this:
df1['new'] = df2.MAX_ID_PROCESS[0]
# Or using .loc
df1['new'] = df2.MAX_ID_PROCESS.loc[0]
In your case, it should be:
i['new'] = m.MAX_ID_PROCESS[0]
You should now see:
ID Probability new
0 20101 0.663083 274.0
1 20105 0.486774 274.0
2 20106 0.441300 274.0
3 20278 0.703176 274.0
4 20221 0.539185 274.0

As we know that we can append one column of dataframe1 to dataframe2 as new column using the code: dataframe2["new_column_name"] = dataframe1["column_to_copy"].
We can extend this approach to solve your problem.
import pandas as pd
import numpy as np
df1 = pd.DataFrame()
df1["ColA"] = [1, 12, 32, 24,12]
df1["ColB"] = [23, 11, 6, 45,25]
df1["ColC"] = [10, 25, 3, 23,15]
print(df1)
Output:
ColA ColB ColC
0 1 23 10
1 12 11 25
2 32 6 3
3 24 45 23
4 12 25 15
Now we create a new dataframe and add a row to it.
df3 = pd.DataFrame()
df3["ColTest"] = [1]
Now we store the value of the first row of the second dataframe as we want to add it to all the rows in dataframe1 as a new column:
val = df3.iloc[0]
print(val)
Output:
ColTest 1
Name: 0, dtype: int64
Now, we will store this value for as many rows as we have in dataframe1.
rows = len(df1)
for row in range(rows):
df3.loc[row]=val
print(df3)
Output:
ColTest
0 1
1 1
2 1
3 1
4 1
Now we will append this column to the first dataframe and solve your problem.
df["ColTest"] = df3["ColTest"]
print(df)
Output:
ColA ColB ColC ColTest
0 1 23 10 1
1 12 11 25 1
2 32 6 3 1
3 24 45 23 1
4 12 25 15 1

Split dates into time ranges in pandas

14 [2018-03-14, 2018-03-13, 2017-03-06, 2017-02-13]
15 [2017-07-26, 2017-06-09, 2017-02-24]
16 [2018-09-06, 2018-07-06, 2018-07-04, 2017-10-20]
17 [2018-10-03, 2018-09-13, 2018-09-12, 2018-08-3]
18 [2017-02-08]
this is my data, every ID has it's own dates that range between 2017-02-05 and 2018-06-30. I need to split dates into 5 time ranges of 4 months each, so that for the first 4 months every ID should have dates only in that time range (from 2017-02-05 to 2017-06-05), like this
14 [2017-03-06, 2017-02-13]
15 [2017-02-24]
16 [null] # or delete empty rows, it doesn't matter
17 [null]
18 [2017-02-08]
then for 2017-06-05 to 2017-10-05 and so on for every 4 month ranges. Also I can't use nested for loops because the data is too big. This is what I tried so far
months_4 = individual_dates.copy()
for _ in months_4['Date']:
_ = np.where(pd.to_datetime(_) <= pd.to_datetime('2017-9-02'), _, np.datetime64('NaT'))
and
months_8 = individual_dates.copy()
range_8 = pd.date_range(start='2017-9-02', end='2017-11-02')
for _ in months_8['Date']:
_ = _[np.isin(_, range_8)]
achieved absolutely no result, data stays the same no matter what
update: I did what you said
individual_dates['Date'] = individual_dates['Date'].str.strip('[]').str.split(', ')
df = pd.DataFrame({
'Date' : list(chain.from_iterable(individual_dates['Date'].tolist())),
'ID' : individual_dates['ClientId'].repeat(individual_dates['Date'].str.len())
})
df
and here is the result
Date ID
0 '2018-06-30T00:00:00.000000000' '2018-06-29T00... 14
1 '2017-03-28T00:00:00.000000000' '2017-03-27T00... 15
2 '2018-03-14T00:00:00.000000000' '2018-03-13T00... 16
3 '2017-12-14T00:00:00.000000000' '2017-03-28T00... 17
4 '2017-05-30T00:00:00.000000000' '2017-05-22T00... 18
5 '2017-03-28T00:00:00.000000000' '2017-03-27T00... 19
6 '2017-03-27T00:00:00.000000000' '2017-03-26T00... 20
7 '2017-12-15T00:00:00.000000000' '2017-11-20T00... 21
8 '2017-07-05T00:00:00.000000000' '2017-07-04T00... 22
9 '2017-12-12T00:00:00.000000000' '2017-04-06T00... 23
10 '2017-05-21T00:00:00.000000000' '2017-05-07T00... 24

For better performance I suggest convert list to column - flatten it and then filtering by isin with boolean indexing:
from itertools import chain
df = pd.DataFrame({
'Date' : list(chain.from_iterable(individual_dates['Date'].tolist())),
'ID' : individual_dates['ID'].repeat(individual_dates['Date'].str.len())
})
range_8 = pd.date_range(start='2017-02-05', end='2017-06-05')
df['Date'] = pd.to_datetime(df['Date'])
df = df[df['Date'].isin(range_8)]
print (df)
Date ID
0 2017-03-06 14
0 2017-02-13 14
1 2017-02-24 15
4 2017-02-08 18

Scraping an html table with beautiful soup into pandas

I'm trying to scrape an html table using beautiful soup and import it into pandas -- http://www.baseball-reference.com/teams/NYM/2017.shtml -- the "Team Batting" table.
Finding the table is no problem:
table = soup.find('div', attrs={'class': 'overthrow table_container'})
table_body = table.find('tbody')
Finding the rows of data isn't a problem either:
for i in table.findAll('tr')[2]: #increase to 3 to get next row in table...
print(i.get_text())
And I can even find the header names:
table_head = table.find('thead')
for i in table_head.findAll('th'):
print(i.get_text())
Now I'm having trouble putting everything together into a data frame. Here's what I have so far:
header = []
for th in table_head.findAll('th'):
key = th.get_text()
header.append(key)
row= []
for tr in table.findAll('tr')[2]:
value = tr.get_text()
row.append(value)
od = OrderedDict(zip(head, row))
df = pd.DataFrame(d1, index=[0])
This only works for one row at a time. My question is how can I do this for every row in the table at the same time?

I have tested that the below will work for your purposes. Basically you need to create a list, loop over the players, use that list to populate a DataFrame. It is advisable to not create the DataFrame row by row as that will probably be significantly slower.
import collections as co
import pandas as pd
from bs4 import BeautifulSoup
with open('team_batting.html','r') as fin:
soup = BeautifulSoup(fin.read(),'lxml')
table = soup.find('div', attrs={'class': 'overthrow table_container'})
table_body = table.find('tbody')
table_head = table.find('thead')
header = []
for th in table_head.findAll('th'):
key = th.get_text()
header.append(key)
# loop over table to find number of rows with '' in first column
endrows = 0
for tr in table.findAll('tr'):
if tr.findAll('th')[0].get_text() in (''):
endrows += 1
rows = len(table.findAll('tr'))
rows -= endrows + 1 # there is a pernicious final row that begins with 'Rk'
list_of_dicts = []
for row in range(rows):
the_row = []
try:
table_row = table.findAll('tr')[row]
for tr in table_row:
value = tr.get_text()
the_row.append(value)
od = co.OrderedDict(zip(header,the_row))
list_of_dicts.append(od)
except AttributeError:
continue
df = pd.DataFrame(list_of_dicts)

This solution uses only pandas, but it cheats a little by knowing in advance that the team batting table is the tenth table. With that knowledge, the following uses pandas's read_html function and grabbing the tenth DataFrame from the list of returned DataFrame objects. The remaining after that is just some data cleaning:
import pandas as pd
url = 'http://www.baseball-reference.com/teams/NYM/2017.shtml'
# Take 10th dataframe
team_batting = pd.read_html(url)[9]
# Take columns whose names don't contain "Unnamed"
team_batting.drop([x for x in team_batting.columns if 'Unnamed' in x], axis=1, inplace=True)
# Remove the rows that are just a copy of the headers/columns
team_batting = team_batting.ix[team_batting.apply(lambda x: x != team_batting.columns,axis=1).all(axis=1),:]
# Take out the Totals rows
team_batting = team_batting.ix[~team_batting.Rk.isnull(),:]
# Get a glimpse of the data
print(team_batting.head(5))
# Rk Pos Name Age G PA AB R H 2B ... OBP SLG OPS OPS+ TB GDP HBP SH SF IBB
# 0 1 C Travis d'Arnaud 28 12 42 37 6 10 2 ... .357 .541 .898 144 20 1 1 0 0 1
# 1 2 1B Lucas Duda* 31 13 50 42 4 10 2 ... .360 .571 .931 153 24 1 0 0 0 2
# 2 3 2B Neil Walker# 31 14 62 54 5 12 3 ... .306 .278 .584 64 15 2 0 0 1 0
# 3 4 SS Asdrubal Cabrera# 31 15 67 63 10 17 2 ... .313 .397 .710 96 25 0 0 0 0 0
# 4 5 3B Jose Reyes# 34 15 59 53 3 5 2 ... .186 .132 .319 -9 7 0 0 0 0 0
I hope this helps.

How do I copy to a range, rather than a list, of columns?

I am looking to append several columns to a dataframe.
Let's say I start with this:
import pandas as pd
dfX = pd.DataFrame({'A': [1,2,3,4],'B': [5,6,7,8],'C': [9,10,11,12]})
dfY = pd.DataFrame({'D': [13,14,15,16],'E': [17,18,19,20],'F': [21,22,23,24]})
I am able to append the dfY columns to dfX by defining the new columns in list form:
dfX[[3,4]] = dfY.iloc[:,1:3].copy()
...but I would rather do so this way:
dfX.iloc[:,3:4] = dfY.iloc[:,1:3].copy()
The former works! The latter executes, returns no errors, but does not alter dfX.

Are you looking for
dfX = pd.concat([dfX, dfY], axis = 1)
It returns
A B C D E F
0 1 5 9 13 17 21
1 2 6 10 14 18 22
2 3 7 11 15 19 23
3 4 8 12 16 20 24
And you can append several dataframes in this like pd.concat([dfX, dfY, dfZ], axis = 1)
If you need to append say only column D and E from dfY to dfX, go for
pd.concat([dfX, dfY[['D', 'E']]], axis = 1)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Output columns do not match with the data - python-3.x

I think the solution is simply to pass your column names as a Python list (using []), which has a well-defined element order, rather than as a set ({}) of unordered elements: final = pd.DataFrame(output, columns = ['Date','Advance','Decline','Adv_Volume','Dec_Volume'])

Related

Box Whisker plot of date frequency

I'm not able to add column for all rows in pandas dataframe

Split dates into time ranges in pandas

Scraping an html table with beautiful soup into pandas

How do I copy to a range, rather than a list, of columns?

Categories

Resources