Scraping an html table with beautiful soup into pandas - python-3.x

I'm trying to scrape an html table using beautiful soup and import it into pandas -- http://www.baseball-reference.com/teams/NYM/2017.shtml -- the "Team Batting" table.
Finding the table is no problem:
table = soup.find('div', attrs={'class': 'overthrow table_container'})
table_body = table.find('tbody')
Finding the rows of data isn't a problem either:
for i in table.findAll('tr')[2]: #increase to 3 to get next row in table...
print(i.get_text())
And I can even find the header names:
table_head = table.find('thead')
for i in table_head.findAll('th'):
print(i.get_text())
Now I'm having trouble putting everything together into a data frame. Here's what I have so far:
header = []
for th in table_head.findAll('th'):
key = th.get_text()
header.append(key)
row= []
for tr in table.findAll('tr')[2]:
value = tr.get_text()
row.append(value)
od = OrderedDict(zip(head, row))
df = pd.DataFrame(d1, index=[0])
This only works for one row at a time. My question is how can I do this for every row in the table at the same time?

I have tested that the below will work for your purposes. Basically you need to create a list, loop over the players, use that list to populate a DataFrame. It is advisable to not create the DataFrame row by row as that will probably be significantly slower.
import collections as co
import pandas as pd
from bs4 import BeautifulSoup
with open('team_batting.html','r') as fin:
soup = BeautifulSoup(fin.read(),'lxml')
table = soup.find('div', attrs={'class': 'overthrow table_container'})
table_body = table.find('tbody')
table_head = table.find('thead')
header = []
for th in table_head.findAll('th'):
key = th.get_text()
header.append(key)
# loop over table to find number of rows with '' in first column
endrows = 0
for tr in table.findAll('tr'):
if tr.findAll('th')[0].get_text() in (''):
endrows += 1
rows = len(table.findAll('tr'))
rows -= endrows + 1 # there is a pernicious final row that begins with 'Rk'
list_of_dicts = []
for row in range(rows):
the_row = []
try:
table_row = table.findAll('tr')[row]
for tr in table_row:
value = tr.get_text()
the_row.append(value)
od = co.OrderedDict(zip(header,the_row))
list_of_dicts.append(od)
except AttributeError:
continue
df = pd.DataFrame(list_of_dicts)

This solution uses only pandas, but it cheats a little by knowing in advance that the team batting table is the tenth table. With that knowledge, the following uses pandas's read_html function and grabbing the tenth DataFrame from the list of returned DataFrame objects. The remaining after that is just some data cleaning:
import pandas as pd
url = 'http://www.baseball-reference.com/teams/NYM/2017.shtml'
# Take 10th dataframe
team_batting = pd.read_html(url)[9]
# Take columns whose names don't contain "Unnamed"
team_batting.drop([x for x in team_batting.columns if 'Unnamed' in x], axis=1, inplace=True)
# Remove the rows that are just a copy of the headers/columns
team_batting = team_batting.ix[team_batting.apply(lambda x: x != team_batting.columns,axis=1).all(axis=1),:]
# Take out the Totals rows
team_batting = team_batting.ix[~team_batting.Rk.isnull(),:]
# Get a glimpse of the data
print(team_batting.head(5))
# Rk Pos Name Age G PA AB R H 2B ... OBP SLG OPS OPS+ TB GDP HBP SH SF IBB
# 0 1 C Travis d'Arnaud 28 12 42 37 6 10 2 ... .357 .541 .898 144 20 1 1 0 0 1
# 1 2 1B Lucas Duda* 31 13 50 42 4 10 2 ... .360 .571 .931 153 24 1 0 0 0 2
# 2 3 2B Neil Walker# 31 14 62 54 5 12 3 ... .306 .278 .584 64 15 2 0 0 1 0
# 3 4 SS Asdrubal Cabrera# 31 15 67 63 10 17 2 ... .313 .397 .710 96 25 0 0 0 0 0
# 4 5 3B Jose Reyes# 34 15 59 53 3 5 2 ... .186 .132 .319 -9 7 0 0 0 0 0
I hope this helps.

Related

Pandas cannot rename columns or insert column

I am reading a csv file in python without any prior headers or labels. I would like to add names to my columns and I followed the documentation but there is no change to the document I also try to insert a row, but there is also no change to the csv
I am stuck as to why there is no change. See the code below:
data = pd.read_csv('data.csv', header=None, names=['a', 'b', 'label'])
bias=[1 for i in range(79)]
data.insert(0, 'bias', bias)
Sources Used:
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html and Pandas read_csv usecols and names not working properly
I can't reproduce. I create a data frame from mock data with the same length.
>>> data = pd.read_csv(io.StringIO('\n'.join([','.join(['1', '2', 'a'])] * 79)), header=None, names=['a', 'b', 'label'])
a b label
0 1 2 a
1 1 2 a
.. .. .. ...
77 1 2 a
78 1 2 a
[79 rows x 3 columns]
df.insert operates in place, so I don't need to reassign or anything like that.
>>> data.insert(0, 'bias', bias)
>>> data
bias a b label
0 1 1 2 a
1 1 1 2 a
.. ... .. .. ...
77 1 1 2 a
78 1 1 2 a
[79 rows x 4 columns]
But I observe here that the data frame properly reassigns. Could you provide more information?

Fill dataframe with duplicate data until a certain conditin is met

I have a data frame df like,
id name age duration
1 ABC 20 12
2 sd 50 150
3 df 54 40
i want to duplicate this data in same df until the duration sum is more than or equal to 300,
so the df can be like..
id name age duration
1 ABC 20 12
2 sd 50 150
3 df 54 40
2 sd 50 150
so far i have tried the below code, but this is running in infinite loop sometimes :/ .
please help.
def fillPlaylist(df,duration):
print("inside fill playlist fn.")
if(len(df)==0):
print("df len is 0, cannot fill.")
return df;
receivedDf= df
print("receivedDf",receivedDf,flush=True)
print("Received df len = ",len(receivedDf),flush=True)
print("duration to fill ",duration,flush=True)
while df['duration'].sum() < duration:
# random 5% sample of data.
print("filling")
ramdomSampleDuplicates = receivedDf.sample(frac=0.05).reset_index(drop=True)
df = pd.concat([ramdomSampleDuplicates,df])
print("df['duration'].sum() ",df['duration'].sum())
print("after filling df len = ",len(df))
return df;
Try using n instead of frac.
n randomly sample n rows from your dataframe.
sample_df = df.sample(n=1).reset_index(drop=True)
To use frac you can rewrite your code in this way.
def fillPlaylist(df,duration):
while df.duration.sum() < duration:
sample_df = df.sample(frac=0.5).reset_index(drop=True)
df = pd.concat([df,sample_df])
return df

I'm not able to add column for all rows in pandas dataframe

I'm pretty new in python / pandas, so its probably pretty simple question...but I can't handle it:
I have two dataframe loaded from Oracle SQL. One with 300 rows / 2 column and second with one row/one column. I would like to add column from second dataset to the first for each row as new column. But I can only get it for the first row and the others are NaN.
`import cx_Oracle
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.externals import joblib
dsn_tns = cx_Oracle.makedsn('127.0.1.1', '1521', 'orcl')
conn = cx_Oracle.connect(user='MyName', password='MyPass', dsn=dsn_tns)
d_score = pd.read_sql_query(
'''
SELECT
ID
,RESULT
,RATIO_A
,RATIO_B
from ORCL_DATA
''', conn) #return 380 rows
d_score['ID'] = d_score['ID'].astype(int)
d_score['RESULT'] = d_score['RESULT'].astype(int)
d_score['RATIO_A'] = d_score['RATIO_A'].astype(float)
d_score['RATIO_B'] = d_score['RATIO_B'].astype(float)
d_score_features = d_score.iloc [:,2:4]
#d_train_target = d_score.iloc[:,1:2] #target is RESULT
DM_train = xgb.DMatrix(data= d_score_features)
loaded_model = joblib.load("bst.dat")
pred = loaded_model.predict(DM_train)
i = pd.DataFrame({'ID':d_score['ID'],'Probability':pred})
print(i)
s = pd.read_sql_query('''select max(id_process) as MAX_ID_PROCESS from PROCESS''',conn) #return only 1 row
m =pd.DataFrame(data=s, dtype=np.int64,columns = ['MAX_ID_PROCESS'] )
print(m)
i['new'] = m ##Trying to add MAX_ID_PROCESS to all rows
print(i)
i =
ID Probability
0 20101 0.663083
1 20105 0.486774
2 20106 0.441300
3 20278 0.703176
4 20221 0.539185
....
379 20480 0.671976
m =
MAX_ID_PROCESS
0 274
i =
ID_MATCH Probability new
0 20101 0.663083 274.0
1 20105 0.486774 NaN
2 20106 0.441300 NaN
3 20278 0.703176 NaN
4 20221 0.539185 NaN
I need value 'new' for all rows...
Since your second dataframe is only having one value, you can assign it like this:
df1['new'] = df2.MAX_ID_PROCESS[0]
# Or using .loc
df1['new'] = df2.MAX_ID_PROCESS.loc[0]
In your case, it should be:
i['new'] = m.MAX_ID_PROCESS[0]
You should now see:
ID Probability new
0 20101 0.663083 274.0
1 20105 0.486774 274.0
2 20106 0.441300 274.0
3 20278 0.703176 274.0
4 20221 0.539185 274.0
As we know that we can append one column of dataframe1 to dataframe2 as new column using the code: dataframe2["new_column_name"] = dataframe1["column_to_copy"].
We can extend this approach to solve your problem.
import pandas as pd
import numpy as np
df1 = pd.DataFrame()
df1["ColA"] = [1, 12, 32, 24,12]
df1["ColB"] = [23, 11, 6, 45,25]
df1["ColC"] = [10, 25, 3, 23,15]
print(df1)
Output:
ColA ColB ColC
0 1 23 10
1 12 11 25
2 32 6 3
3 24 45 23
4 12 25 15
Now we create a new dataframe and add a row to it.
df3 = pd.DataFrame()
df3["ColTest"] = [1]
Now we store the value of the first row of the second dataframe as we want to add it to all the rows in dataframe1 as a new column:
val = df3.iloc[0]
print(val)
Output:
ColTest 1
Name: 0, dtype: int64
Now, we will store this value for as many rows as we have in dataframe1.
rows = len(df1)
for row in range(rows):
df3.loc[row]=val
print(df3)
Output:
ColTest
0 1
1 1
2 1
3 1
4 1
Now we will append this column to the first dataframe and solve your problem.
df["ColTest"] = df3["ColTest"]
print(df)
Output:
ColA ColB ColC ColTest
0 1 23 10 1
1 12 11 25 1
2 32 6 3 1
3 24 45 23 1
4 12 25 15 1

Remove index from dataframe using Python

I am trying to create a Pandas Dataframe from a string using the following code -
import pandas as pd
input_string="""A;B;C
0;34;88
2;45;200
3;47;65
4;32;140
"""
data = input_string
df = pd.DataFrame([x.split(';') for x in data.split('\n')])
print(df)
I am getting the following result -
0 1 2
0 A B C
1 0 34 88
2 2 45 200
3 3 47 65
4 4 32 140
5 None None
But I need something like the following -
A B C
0 34 88
2 45 200
3 47 65
4 32 140
I added "index = False" while creating the dataframe like -
df = pd.DataFrame([x.split(';') for x in data.split('\n')],index = False)
But, it gives me an error -
TypeError: Index(...) must be called with a collection of some kind, False
was passed
How is this achievable?
Use read_csv with StringIO and index_col parameetr for set first column to index:
input_string="""A;B;C
0;34;88
2;45;200
3;47;65
4;32;140
"""
df = pd.read_csv(pd.compat.StringIO(input_string),sep=';', index_col=0)
print (df)
B C
A
0 34 88
2 45 200
3 47 65
4 32 140
Your solution should be changed with split by default parameter (arbitrary whitespace), pass to DataFrame all values of lists without first with columns parameter and if need first column to index add DataFrame.set_axis:
L = [x.split(';') for x in input_string.split()]
df = pd.DataFrame(L[1:], columns=L[0]).set_index('A')
print (df)
B C
A
0 34 88
2 45 200
3 47 65
4 32 140
For general solution use first value of first list in set_index:
L = [x.split(';') for x in input_string.split()]
df = pd.DataFrame(L[1:], columns=L[0]).set_index(L[0][0])
EDIT:
You can set column name instead index name to A value:
df = df.rename_axis(df.index.name, axis=1).rename_axis(None)
print (df)
A B C
0 34 88
2 45 200
3 47 65
4 32 140
import pandas as pd
input_string="""A;B;C
0;34;88
2;45;200
3;47;65
4;32;140
"""
data = input_string
df = pd.DataFrame([x.split(';') for x in data.split()])
df.columns = df.iloc[0]
df = df.iloc[1:].rename_axis(None, axis=1)
df.set_index('A',inplace = True)
df
output
B C
A
0 34 88
2 45 200
3 47 65
4 32 140

delete specific rows from csv using pandas

I have a csv file in the format shown below:
I have written the following code that reads the file and randomly deletes the rows that have steering value as 0. I want to keep just 10% of the rows that have steering value as 0.
df = pd.read_csv(filename, header=None, names = ["center", "left", "right", "steering", "throttle", 'break', 'speed'])
df = df.drop(df.query('steering==0').sample(frac=0.90).index)
However, I get the following error:
df = df.drop(df.query('steering==0').sample(frac=0.90).index)
locs = rs.choice(axis_length, size=n, replace=replace, p=weights)
File "mtrand.pyx", line 1104, in mtrand.RandomState.choice
(numpy/random/mtrand/mtrand.c:17062)
ValueError: a must be greater than 0
Can you guys help me?
sample DataFrame built with #andrew_reece's code
In [9]: df
Out[9]:
center left right steering throttle brake
0 center_54.jpg left_75.jpg right_39.jpg 1 0 0
1 center_20.jpg left_81.jpg right_49.jpg 3 1 1
2 center_34.jpg left_96.jpg right_11.jpg 0 4 2
3 center_98.jpg left_87.jpg right_34.jpg 0 0 0
4 center_67.jpg left_12.jpg right_28.jpg 1 1 0
5 center_11.jpg left_25.jpg right_94.jpg 2 1 0
6 center_66.jpg left_27.jpg right_52.jpg 1 3 3
7 center_18.jpg left_50.jpg right_17.jpg 0 0 4
8 center_60.jpg left_25.jpg right_28.jpg 2 4 1
9 center_98.jpg left_97.jpg right_55.jpg 3 3 0
.. ... ... ... ... ... ...
90 center_31.jpg left_90.jpg right_43.jpg 0 1 0
91 center_29.jpg left_7.jpg right_30.jpg 3 0 0
92 center_37.jpg left_10.jpg right_15.jpg 1 0 0
93 center_18.jpg left_1.jpg right_83.jpg 3 1 1
94 center_96.jpg left_20.jpg right_56.jpg 3 0 0
95 center_37.jpg left_40.jpg right_38.jpg 0 3 1
96 center_73.jpg left_86.jpg right_71.jpg 0 1 0
97 center_85.jpg left_31.jpg right_0.jpg 3 0 4
98 center_34.jpg left_52.jpg right_40.jpg 0 0 2
99 center_91.jpg left_46.jpg right_17.jpg 0 0 0
[100 rows x 6 columns]
In [10]: df.steering.value_counts()
Out[10]:
0 43 # NOTE: 43 zeros
1 18
2 15
4 12
3 12
Name: steering, dtype: int64
In [11]: df.shape
Out[11]: (100, 6)
your solution (unchanged):
In [12]: df = df.drop(df.query('steering==0').sample(frac=0.90).index)
In [13]: df.steering.value_counts()
Out[13]:
1 18
2 15
4 12
3 12
0 4 # NOTE: 4 zeros (~10% from 43)
Name: steering, dtype: int64
In [14]: df.shape
Out[14]: (61, 6)
NOTE: make sure that steering column has numeric dtype! If it's a string (object) then you would need to change your code as follows:
df = df.drop(df.query('steering=="0"').sample(frac=0.90).index)
# NOTE: ^ ^
after that you can save the modified (reduced) DataFrame to CSV:
df.to_csv('/path/to/filename.csv', index=False)
Here's a one-line approach, using concat() and sample():
import numpy as np
import pandas as pd
# first, some sample data
# generate filename fields
positions = ['center','left','right']
N = 100
fnames = ['{}_{}.jpg'.format(loc, np.random.randint(100)) for loc in np.repeat(positions, N)]
df = pd.DataFrame(np.array(fnames).reshape(3,100).T, columns=positions)
# generate numeric fields
values = [0,1,2,3,4]
probas = [.5,.2,.1,.1,.1]
df['steering'] = np.random.choice(values, p=probas, size=N)
df['throttle'] = np.random.choice(values, p=probas, size=N)
df['brake'] = np.random.choice(values, p=probas, size=N)
print(df.shape)
(100,3)
The first few rows of sample output:
df.head()
center left right steering throttle brake
0 center_72.jpg left_26.jpg right_59.jpg 3 3 0
1 center_75.jpg left_68.jpg right_26.jpg 0 0 2
2 center_29.jpg left_8.jpg right_88.jpg 0 1 0
3 center_22.jpg left_26.jpg right_23.jpg 1 0 0
4 center_88.jpg left_0.jpg right_56.jpg 4 1 0
5 center_93.jpg left_18.jpg right_15.jpg 0 0 0
Now drop all but 10% of rows with steering==0:
newdf = pd.concat([df.loc[df.steering!=0],
df.loc[df.steering==0].sample(frac=0.1)])
With the probability weightings I used in this example, you'll see somewhere between 50-60 remaining entries in newdf, with about 5 steering==0 cases remaining.
Using a mask on steering combined with a random number should work:
df = df[(df.steering != 0) | (np.random.rand(len(df)) < 0.1)]
This does generate some extra random values, but it's nice and compact.
Edit: That said, I tried your example code and it worked as well. My guess is the error is coming from the fact that your df.query() statement is returning an empty dataframe, which probably means that the "sample" column does not contain any zeros, or alternatively that the column is read as strings rather than numeric. Try converting the column to integer before running the above snippet.

Resources