I'm not able to add column for all rows in pandas dataframe - python-3.x

I'm pretty new in python / pandas, so its probably pretty simple question...but I can't handle it:
I have two dataframe loaded from Oracle SQL. One with 300 rows / 2 column and second with one row/one column. I would like to add column from second dataset to the first for each row as new column. But I can only get it for the first row and the others are NaN.
`import cx_Oracle
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.externals import joblib
dsn_tns = cx_Oracle.makedsn('127.0.1.1', '1521', 'orcl')
conn = cx_Oracle.connect(user='MyName', password='MyPass', dsn=dsn_tns)
d_score = pd.read_sql_query(
'''
SELECT
ID
,RESULT
,RATIO_A
,RATIO_B
from ORCL_DATA
''', conn) #return 380 rows
d_score['ID'] = d_score['ID'].astype(int)
d_score['RESULT'] = d_score['RESULT'].astype(int)
d_score['RATIO_A'] = d_score['RATIO_A'].astype(float)
d_score['RATIO_B'] = d_score['RATIO_B'].astype(float)
d_score_features = d_score.iloc [:,2:4]
#d_train_target = d_score.iloc[:,1:2] #target is RESULT
DM_train = xgb.DMatrix(data= d_score_features)
loaded_model = joblib.load("bst.dat")
pred = loaded_model.predict(DM_train)
i = pd.DataFrame({'ID':d_score['ID'],'Probability':pred})
print(i)
s = pd.read_sql_query('''select max(id_process) as MAX_ID_PROCESS from PROCESS''',conn) #return only 1 row
m =pd.DataFrame(data=s, dtype=np.int64,columns = ['MAX_ID_PROCESS'] )
print(m)
i['new'] = m ##Trying to add MAX_ID_PROCESS to all rows
print(i)
i =
ID Probability
0 20101 0.663083
1 20105 0.486774
2 20106 0.441300
3 20278 0.703176
4 20221 0.539185
....
379 20480 0.671976
m =
MAX_ID_PROCESS
0 274
i =
ID_MATCH Probability new
0 20101 0.663083 274.0
1 20105 0.486774 NaN
2 20106 0.441300 NaN
3 20278 0.703176 NaN
4 20221 0.539185 NaN
I need value 'new' for all rows...

Since your second dataframe is only having one value, you can assign it like this:
df1['new'] = df2.MAX_ID_PROCESS[0]
# Or using .loc
df1['new'] = df2.MAX_ID_PROCESS.loc[0]
In your case, it should be:
i['new'] = m.MAX_ID_PROCESS[0]
You should now see:
ID Probability new
0 20101 0.663083 274.0
1 20105 0.486774 274.0
2 20106 0.441300 274.0
3 20278 0.703176 274.0
4 20221 0.539185 274.0

As we know that we can append one column of dataframe1 to dataframe2 as new column using the code: dataframe2["new_column_name"] = dataframe1["column_to_copy"].
We can extend this approach to solve your problem.
import pandas as pd
import numpy as np
df1 = pd.DataFrame()
df1["ColA"] = [1, 12, 32, 24,12]
df1["ColB"] = [23, 11, 6, 45,25]
df1["ColC"] = [10, 25, 3, 23,15]
print(df1)
Output:
ColA ColB ColC
0 1 23 10
1 12 11 25
2 32 6 3
3 24 45 23
4 12 25 15
Now we create a new dataframe and add a row to it.
df3 = pd.DataFrame()
df3["ColTest"] = [1]
Now we store the value of the first row of the second dataframe as we want to add it to all the rows in dataframe1 as a new column:
val = df3.iloc[0]
print(val)
Output:
ColTest 1
Name: 0, dtype: int64
Now, we will store this value for as many rows as we have in dataframe1.
rows = len(df1)
for row in range(rows):
df3.loc[row]=val
print(df3)
Output:
ColTest
0 1
1 1
2 1
3 1
4 1
Now we will append this column to the first dataframe and solve your problem.
df["ColTest"] = df3["ColTest"]
print(df)
Output:
ColA ColB ColC ColTest
0 1 23 10 1
1 12 11 25 1
2 32 6 3 1
3 24 45 23 1
4 12 25 15 1

Related

Getting Dummy Back to Categorical

I have a df called X like this:
Index Class Family
1 Mid 12
2 Low 6
3 High 5
4 Low 2
Created this to dummy variables using below code:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
ohe = OneHotEncoder()
X_object = X.select_dtypes('object')
ohe.fit(X_object)
codes = ohe.transform(X_object).toarray()
feature_names = ohe.get_feature_names(['V1', 'V2'])
X = pd.concat([df.select_dtypes(exclude='object'),
pd.DataFrame(codes,columns=feature_names).astype(int)], axis=1)
Resultant df is like:
V1_Mid V1_Low V1_High V2_12 V2_6 V2_5 V2_2
1 0 0 1 0 0 0
..and so on
Question: How to do I convert my resultant df back to original df ?
I have seen this but it gives me NameError: name 'Series' is not defined.
First we can regroup each original column from your resultant df into the original column names as the first level of a column multi-index:
>>> df.columns = pd.MultiIndex.from_tuples(df.columns.str.split('_', 1).map(tuple))
>>> df = df.rename(columns={'V1': 'Class', 'V2': 'Family'}, level=0)
>>> df
Class Family
Mid Low High 12 6 5 2
0 1 0 0 1 0 0 0
Now we see the second-level of columns are the values. Thus, within each top-level we want to get the column name that has a 1, knowing all the other entries are 0. This can be done with idxmax():
>>> orig_df = pd.concat({col: df[col].idxmax(axis='columns') for col in df.columns.levels[0]}, axis='columns')
>>> orig_df
Class Family
0 Mid 12
An even more simple way is to just stick to pandas.
df = pd.DataFrame({"Index":[1,2,3,4],"Class":["Mid","Low","High","Low"],"Family":[12,6,5,2]})
# Combine features in new column
df["combined"] = list(zip(df["Class"], df["Family"]))
print(df)
Out:
Index Class Family combined
0 1 Mid 12 (Mid, 12)
1 2 Low 6 (Low, 6)
2 3 High 5 (High, 5)
3 4 Low 2 (Low, 2)
You can get the one hot encoding using pandas directly.
one_hot = pd.get_dummies(df["combined"])
print(one_hot)
Out:
(High, 5) (Low, 2) (Low, 6) (Mid, 12)
0 0 0 0 1
1 0 0 1 0
2 1 0 0 0
3 0 1 0 0
Then to get back you just can check the name of the column and select the row in the original dataframe with same tuple.
print(df[df["combined"]==one_hot.columns[0]])
Out:
Index Class Family combined
2 3 High 5 (High, 5)

pandas expand dataframe column with tuples, into multiple columns and rows

I have a data frame where one column contains elements that are a list containing several tuples. I want to turn each tuple in to a column for each element and create a new row for each tuple. So this code shows what I mean and the solution I came up with:
import numpy as np
import pandas as pd
a = pd.DataFrame(data=[['a','b',[(1,2,3),(6,7,8)]],
['c','d',[(10,20,30)]]], columns=['one','two','three'])
df2 = pd.DataFrame(columns=['one', 'two', 'A', 'B','C'])
print(a)
for index,item in a.iterrows():
for xtup in item.three:
temp = pd.Series(item)
temp['A'] = xtup[0]
temp['B'] = xtup[1]
temp['C'] = xtup[2]
temp = temp.drop('three')
df2 = df2.append(temp)
print(df2)
The output is:
one two three
0 a b [(1, 2, 3), (6, 7, 8)]
1 c d [(10, 20, 30)]
one two A B C
0 a b 1 2 3
0 a b 6 7 8
1 c d 10 20 30
Unfortunately, my solution takes 2 hours to run on 55,000 rows! Is there a more efficient way to do this?
We do explode column then explode row
a=a.explode('three')
a=pd.concat([a,pd.DataFrame(a.pop('three').tolist(),index=a.index)],axis=1)
one two 0 1 2
0 a b 1 2 3
0 a b 6 7 8
1 c d 10 20 30

Reorder columns in groups by number embedded in column name?

I have a very large dataframe with 1,000 columns. The first few columns occur only once, denoting a customer. The next few columns are representative of multiple encounters with the customer, with an underscore and the number encounter. Every additional encounter adds a new column, so there is NOT a fixed number of columns -- it'll grow with time.
Sample dataframe header structure excerpt:
id dob gender pro_1 pro_10 pro_11 pro_2 ... pro_9 pre_1 pre_10 ...
I'm trying to re-order the columns based on the number after the column name, so all _1 should be together, all _2 should be together, etc, like so:
id dob gender pro_1 pre_1 que_1 fre_1 gen_1 pro2 pre_2 que_2 fre_2 ...
(Note that the re-order should order the numbers correctly; the current order treats them like strings, which orders 1, 10, 11, etc. rather than 1, 2, 3)
Is this possible to do in pandas, or should I be looking at something else? Any help would be greatly appreciated! Thank you!
EDIT:
Alternatively, is it also possible to re-arrange column names based on the string part AND number part of the column names? So the output would then look similar to the original, except the numbers would be considered so that the order is more intuitive:
id dob gender pro_1 pro_2 pro_3 ... pre_1 pre_2 pre_3 ...
EDIT 2.0:
Just wanted to thank everyone for helping! While only one of the responses worked, I really appreciate the effort and learned a lot about other approaches / ways to think about this.
Here is one way you can try:
# column names copied from your example
example_cols = 'id dob gender pro_1 pro_10 pro_11 pro_2 pro_9 pre_1 pre_10'.split()
# sample DF
df = pd.DataFrame([range(len(example_cols))], columns=example_cols)
df
# id dob gender pro_1 pro_10 pro_11 pro_2 pro_9 pre_1 pre_10
#0 0 1 2 3 4 5 6 7 8 9
# number of columns excluded from sorting
N = 3
# get a list of columns from the dataframe
cols = df.columns.tolist()
# split, create an tuple of (column_name, prefix, number) and sorted based on the 2nd and 3rd item of the tuple, then retrieved the first item.
# adjust "key = lambda x: x[2]" to group cols by numbers only
cols_new = cols[:N] + [ a[0] for a in sorted([ (c, p, int(n)) for c in cols[N:] for p,n in [c.split('_')]], key = lambda x: (x[1], x[2])) ]
# get the new dataframe based on the cols_new
df_new = df[cols_new]
# id dob gender pre_1 pre_10 pro_1 pro_2 pro_9 pro_10 pro_11
#0 0 1 2 8 9 3 6 7 4 5
Luckily there is a one liner in python that can fix this:
df = df.reindex(sorted(df.columns), axis=1)
For Example lets say you had this dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name': [2, 4, 8, 0],
'ID': [2, 0, 0, 0],
'Prod3': [10, 2, 1, 8],
'Prod1': [2, 4, 8, 0],
'Prod_1': [2, 4, 8, 0],
'Pre7': [2, 0, 0, 0],
'Pre2': [10, 2, 1, 8],
'Pre_2': [10, 2, 1, 8],
'Pre_9': [10, 2, 1, 8]}
)
print(df)
Output:
Name ID Prod3 Prod1 Prod_1 Pre7 Pre2 Pre_2 Pre_9
0 2 2 10 2 2 2 10 10 10
1 4 0 2 4 4 0 2 2 2
2 8 0 1 8 8 0 1 1 1
3 0 0 8 0 0 0 8 8 8
Then used
df = df.reindex(sorted(df.columns), axis=1)
Then the dataframe will then look like:
ID Name Pre2 Pre7 Pre_2 Pre_9 Prod1 Prod3 Prod_1
0 2 2 10 2 10 10 2 10 2
1 0 4 2 0 2 2 4 2 4
2 0 8 1 0 1 1 8 1 8
3 0 0 8 0 8 8 0 8 0
As you can see, the columns without underscore will come first, followed by an ordering based on the number after the underscore. However this also sorts of the column names, so the column names that come first in the alphabet will be first.
You need to split you column on '_' then convert to int:
c = ['A_1','A_10','A_2','A_3','B_1','B_10','B_2','B_3']
df = pd.DataFrame(np.random.randint(0,100,(2,8)), columns = c)
df.reindex(sorted(df.columns, key = lambda x: int(x.split('_')[1])), axis=1)
Output:
A_1 B_1 A_2 B_2 A_3 B_3 A_10 B_10
0 68 11 59 69 37 68 76 17
1 19 37 52 54 23 93 85 3
Next case, you need human sorting:
import re
def atoi(text):
return int(text) if text.isdigit() else text
def natural_keys(text):
'''
alist.sort(key=natural_keys) sorts in human order
http://nedbatchelder.com/blog/200712/human_sorting.html
(See Toothy's implementation in the comments)
'''
return [ atoi(c) for c in re.split(r'(\d+)', text) ]
df.reindex(sorted(df.columns, key = lambda x:natural_keys(x)), axis=1)
Output:
A_1 A_2 A_3 A_10 B_1 B_2 B_3 B_10
0 68 59 37 76 11 69 68 17
1 19 52 23 85 37 54 93 3
Try this.
To re-order the columns based on the number after the column name
cols_fixed = df.columns[:3] # change index no based on your df
cols_variable = df.columns[3:] # change index no based on your df
cols_variable = sorted(cols_variable, key=lambda x : int(x.split('_')[1])) # split based on the number after '_'
cols_new = cols_fixed + cols_variable
new_df = pd.DataFrame(df[cols_new])
To re-arrange column names based on the string part AND number part of the column names
cols_fixed = df.columns[:3] # change index no based on your df
cols_variable = df.columns[3:] # change index no based on your df
cols_variable = sorted(cols_variable)
cols_new = cols_fixed + cols_variable
new_df = pd.DataFrame(df[cols_new])

Scraping an html table with beautiful soup into pandas

I'm trying to scrape an html table using beautiful soup and import it into pandas -- http://www.baseball-reference.com/teams/NYM/2017.shtml -- the "Team Batting" table.
Finding the table is no problem:
table = soup.find('div', attrs={'class': 'overthrow table_container'})
table_body = table.find('tbody')
Finding the rows of data isn't a problem either:
for i in table.findAll('tr')[2]: #increase to 3 to get next row in table...
print(i.get_text())
And I can even find the header names:
table_head = table.find('thead')
for i in table_head.findAll('th'):
print(i.get_text())
Now I'm having trouble putting everything together into a data frame. Here's what I have so far:
header = []
for th in table_head.findAll('th'):
key = th.get_text()
header.append(key)
row= []
for tr in table.findAll('tr')[2]:
value = tr.get_text()
row.append(value)
od = OrderedDict(zip(head, row))
df = pd.DataFrame(d1, index=[0])
This only works for one row at a time. My question is how can I do this for every row in the table at the same time?
I have tested that the below will work for your purposes. Basically you need to create a list, loop over the players, use that list to populate a DataFrame. It is advisable to not create the DataFrame row by row as that will probably be significantly slower.
import collections as co
import pandas as pd
from bs4 import BeautifulSoup
with open('team_batting.html','r') as fin:
soup = BeautifulSoup(fin.read(),'lxml')
table = soup.find('div', attrs={'class': 'overthrow table_container'})
table_body = table.find('tbody')
table_head = table.find('thead')
header = []
for th in table_head.findAll('th'):
key = th.get_text()
header.append(key)
# loop over table to find number of rows with '' in first column
endrows = 0
for tr in table.findAll('tr'):
if tr.findAll('th')[0].get_text() in (''):
endrows += 1
rows = len(table.findAll('tr'))
rows -= endrows + 1 # there is a pernicious final row that begins with 'Rk'
list_of_dicts = []
for row in range(rows):
the_row = []
try:
table_row = table.findAll('tr')[row]
for tr in table_row:
value = tr.get_text()
the_row.append(value)
od = co.OrderedDict(zip(header,the_row))
list_of_dicts.append(od)
except AttributeError:
continue
df = pd.DataFrame(list_of_dicts)
This solution uses only pandas, but it cheats a little by knowing in advance that the team batting table is the tenth table. With that knowledge, the following uses pandas's read_html function and grabbing the tenth DataFrame from the list of returned DataFrame objects. The remaining after that is just some data cleaning:
import pandas as pd
url = 'http://www.baseball-reference.com/teams/NYM/2017.shtml'
# Take 10th dataframe
team_batting = pd.read_html(url)[9]
# Take columns whose names don't contain "Unnamed"
team_batting.drop([x for x in team_batting.columns if 'Unnamed' in x], axis=1, inplace=True)
# Remove the rows that are just a copy of the headers/columns
team_batting = team_batting.ix[team_batting.apply(lambda x: x != team_batting.columns,axis=1).all(axis=1),:]
# Take out the Totals rows
team_batting = team_batting.ix[~team_batting.Rk.isnull(),:]
# Get a glimpse of the data
print(team_batting.head(5))
# Rk Pos Name Age G PA AB R H 2B ... OBP SLG OPS OPS+ TB GDP HBP SH SF IBB
# 0 1 C Travis d'Arnaud 28 12 42 37 6 10 2 ... .357 .541 .898 144 20 1 1 0 0 1
# 1 2 1B Lucas Duda* 31 13 50 42 4 10 2 ... .360 .571 .931 153 24 1 0 0 0 2
# 2 3 2B Neil Walker# 31 14 62 54 5 12 3 ... .306 .278 .584 64 15 2 0 0 1 0
# 3 4 SS Asdrubal Cabrera# 31 15 67 63 10 17 2 ... .313 .397 .710 96 25 0 0 0 0 0
# 4 5 3B Jose Reyes# 34 15 59 53 3 5 2 ... .186 .132 .319 -9 7 0 0 0 0 0
I hope this helps.

How do I copy to a range, rather than a list, of columns?

I am looking to append several columns to a dataframe.
Let's say I start with this:
import pandas as pd
dfX = pd.DataFrame({'A': [1,2,3,4],'B': [5,6,7,8],'C': [9,10,11,12]})
dfY = pd.DataFrame({'D': [13,14,15,16],'E': [17,18,19,20],'F': [21,22,23,24]})
I am able to append the dfY columns to dfX by defining the new columns in list form:
dfX[[3,4]] = dfY.iloc[:,1:3].copy()
...but I would rather do so this way:
dfX.iloc[:,3:4] = dfY.iloc[:,1:3].copy()
The former works! The latter executes, returns no errors, but does not alter dfX.
Are you looking for
dfX = pd.concat([dfX, dfY], axis = 1)
It returns
A B C D E F
0 1 5 9 13 17 21
1 2 6 10 14 18 22
2 3 7 11 15 19 23
3 4 8 12 16 20 24
And you can append several dataframes in this like pd.concat([dfX, dfY, dfZ], axis = 1)
If you need to append say only column D and E from dfY to dfX, go for
pd.concat([dfX, dfY[['D', 'E']]], axis = 1)

Resources