Fill dataframe with duplicate data until a certain conditin is met - python-3.x

I have a data frame df like,
id name age duration
1 ABC 20 12
2 sd 50 150
3 df 54 40
i want to duplicate this data in same df until the duration sum is more than or equal to 300,
so the df can be like..
id name age duration
1 ABC 20 12
2 sd 50 150
3 df 54 40
2 sd 50 150
so far i have tried the below code, but this is running in infinite loop sometimes :/ .
please help.
def fillPlaylist(df,duration):
print("inside fill playlist fn.")
if(len(df)==0):
print("df len is 0, cannot fill.")
return df;
receivedDf= df
print("receivedDf",receivedDf,flush=True)
print("Received df len = ",len(receivedDf),flush=True)
print("duration to fill ",duration,flush=True)
while df['duration'].sum() < duration:
# random 5% sample of data.
print("filling")
ramdomSampleDuplicates = receivedDf.sample(frac=0.05).reset_index(drop=True)
df = pd.concat([ramdomSampleDuplicates,df])
print("df['duration'].sum() ",df['duration'].sum())
print("after filling df len = ",len(df))
return df;

Try using n instead of frac.
n randomly sample n rows from your dataframe.
sample_df = df.sample(n=1).reset_index(drop=True)
To use frac you can rewrite your code in this way.
def fillPlaylist(df,duration):
while df.duration.sum() < duration:
sample_df = df.sample(frac=0.5).reset_index(drop=True)
df = pd.concat([df,sample_df])
return df

Related

Finding intervals in pandas dataframe based on values in another dataframe

I have two data frames. One dataframe (A) looks like:
Name. gender start_coordinate end_coordinate ID
Peter M 30 150 1
Hugo M 4500 6000 2
Jennie F 300 700 3
The other dataframe (B) looks like
ID_sim. position string
1 89 aa
4 568 bb
5 938437 cc
I want to accomplish two tasks here:
I want to get a list of indices for rows (from dataframe B) for which position column falls in the interval (specified by start_coordinate and end_coordinate column) in dataframe A.
The result for this task will be:
lst = [0,1]. ### because row 0 of B falls in interval of row 1 in A and row 1 of B falls in interval of row 3 of A.
The indices that I get from task 1, I want to keep it from dataframe B to create a new dataframe. Thus, the new dataframe will look like:
position string
89 aa
568 bb
I used .between() to accomplish this task. The code is as follows:
lst=dfB[dfB['position'].between(dfA.loc[0,'start_coordinate'],dfA.loc[len(dfA)-1,'end_coordinate'])].index.tolist()
result=dfB[dfB.index.isin(lst)]
result.shape
However, when I run this piece of code I get the following error:
KeyError: 0
What could possibly be raising this error? And how can I solve this?
We can try numpy broadcasting here
s, e = dfA[['start_coordinate', 'end_coordinate']].to_numpy().T
p = dfB['position'].to_numpy()[:, None]
dfB[((p >= s) & (p <= e)).any(1)]
ID_sim. position string
0 1 89 aa
1 4 568 bb
You could use Pandas IntervalIndex to get the positions, and afterwards, use a boolean to pull the relevant rows from B :
Create IntervalIndex:
intervals = pd.IntervalIndex.from_tuples([*zip(A['start_coordinate'],
A['end_coordinate'])
],
closed='both')
Get indexers for B.position, create a boolean array with the values and filter B:
# get_indexer returns -1 if an index is not found.
B.loc[intervals.get_indexer(B.position) >= 0]
Out[140]:
ID_sim. position string
0 1 89 aa
1 4 568 bb
This should work. Less elegant but easier to comprehend.
import pandas as pd
data = [['Name.','gender', 'start_coordinate','end_coordinate','ID'],
['Peter','M',30,150,1],
['Hugo','M',4500,6000,2],
['Jennie','F',300,700,3]]
data2 = [['ID_sim.','position','string'],
['1',89,'aa'],
['4',568,'bb'],
['5',938437,'cc']]
df1 = pd.DataFrame(data[1:], columns=data[0])
df2 = pd.DataFrame(data2[1:], columns=data2[0])
merged = pd.merge(df1, df2, left_index=True, right_index=True)
print (merged[(merged['position'] > merged['start_coordinate']) & (merged['position'] < merged['end_coordinate'])])

Apply a function to every row of a dataframe and store the data to a list/Dataframe in Python

I have the following simplified version of the code:
import pandas as pd
def myFunction(portf, Val):
mydata = {portf: [Val, Val * 2, Val * 3, Val * 4]}
df = pd.DataFrame(mydata, columns=[portf])
return df
data = {'Portfolio': ['Book1', 'Book2', 'Book1', 'Book2'],
'Value': [10, 5, 6, 11]}
df_input = pd.DataFrame(data, columns=['Portfolio', 'Value'])
df_output = myFunction(df_input['Portfolio'][0], df_input['Value'][0])
df_output1 = myFunction(df_input['Portfolio'][1], df_input['Value'][1])
df_output2 = myFunction(df_input['Portfolio'][2], df_input['Value'][2])
df_output3 = myFunction(df_input['Portfolio'][3], df_input['Value'][3])
What I would like is concatenate all the df_output in a single list or even better in a dataframe in an efficient way as the df_input dataframe will have 100+ columns.
I tried to apply the following:
df_input.apply(lambda row : myFunction(row['Portfolio'], row['Value']), axis = 1)
However all the results return to a single column.
Any idea how to achieve that?
Thanks
You can use pd.concat to store all results in a single dataframe:
pd.concat([myFunction(row['Portfolio'], row['Value'])
for _, row in df_input.iterrows()], axis=1)
First you build a list of pd.DataFrames with a list comprehension (you could also use a normal loop). Then you concat all DataFrames along axis=1.
Output:
Book1 Book2 Book1 Book2
0 10 5 6 11
1 20 10 12 22
2 30 15 18 33
3 40 20 24 44
You mentioned df_input has many more rows in the original dataframe. To account for this you neeed another loop (minimal example):
data = {'Portfolio': ['Book1', 'Book2', 'Book1', 'Book2'],
'Value': [10, 5, 6, 11]}
df_input = pd.DataFrame(data, columns=['Portfolio', 'Value'])
df_input['Value2'] = df_input['Value'] * 100
pd.concat([myFunction(row['Portfolio'], row[col])
for col in df_input.columns if col != 'Portfolio'
for (_, row) in df_input.iterrows()], axis=1)
Output:
Book1 Book2 Book1 Book2 Book1 Book2 Book1 Book2
0 10 5 6 11 1000 500 600 1100
1 20 10 12 22 2000 1000 1200 2200
2 30 15 18 33 3000 1500 1800 3300
3 40 20 24 44 4000 2000 2400 4400
You might want to rename the columns or aggregate the resulting dataframe in some other way. But for this I had to guess (and I try not to guess in the face of ambiguity).

Get total of Pandas column and row

I have a Pandas data frame, as shown below,
a b c
A 100 60 60
B 90 44 44
A 70 50 50
Now, I would like to get the total of column and row, skip c, as shown below,
a b sum
A 170 110 280
B 90 44 134
So I do not know how to do, I'm in trouble, please help me, thank you, guys.
My example dataframe is:
df = pd.DataFrame(dict(a=[100, 90,70], b=[60, 44,50],c=[60, 44,50]),index=["A", "B","A"])
(
df.groupby(level=0)['a','b'].sum()
.assign(sum=lambda x: x.sum(1))
)
Use:
#remove unnecessary column
df = df.drop('c', 1)
#get sum of rows
df['sum'] = df.sum(1)
#get sum per index
df = df.sum(level=0)
print (df)
a b sum
A 170 110 280
B 90 44 134
df["sum"] = df[["a","b"]].sum(axis=1) #Column-wise sum of "a" and "b"
df[["a", "b", "sum"]] #show all columns but not "c"
The pandas way is:
#create sum column
df['sum'] = df['a']+df['b']
#remove colimn c
df = df[['a', 'b', 'sum']]

I'm not able to add column for all rows in pandas dataframe

I'm pretty new in python / pandas, so its probably pretty simple question...but I can't handle it:
I have two dataframe loaded from Oracle SQL. One with 300 rows / 2 column and second with one row/one column. I would like to add column from second dataset to the first for each row as new column. But I can only get it for the first row and the others are NaN.
`import cx_Oracle
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.externals import joblib
dsn_tns = cx_Oracle.makedsn('127.0.1.1', '1521', 'orcl')
conn = cx_Oracle.connect(user='MyName', password='MyPass', dsn=dsn_tns)
d_score = pd.read_sql_query(
'''
SELECT
ID
,RESULT
,RATIO_A
,RATIO_B
from ORCL_DATA
''', conn) #return 380 rows
d_score['ID'] = d_score['ID'].astype(int)
d_score['RESULT'] = d_score['RESULT'].astype(int)
d_score['RATIO_A'] = d_score['RATIO_A'].astype(float)
d_score['RATIO_B'] = d_score['RATIO_B'].astype(float)
d_score_features = d_score.iloc [:,2:4]
#d_train_target = d_score.iloc[:,1:2] #target is RESULT
DM_train = xgb.DMatrix(data= d_score_features)
loaded_model = joblib.load("bst.dat")
pred = loaded_model.predict(DM_train)
i = pd.DataFrame({'ID':d_score['ID'],'Probability':pred})
print(i)
s = pd.read_sql_query('''select max(id_process) as MAX_ID_PROCESS from PROCESS''',conn) #return only 1 row
m =pd.DataFrame(data=s, dtype=np.int64,columns = ['MAX_ID_PROCESS'] )
print(m)
i['new'] = m ##Trying to add MAX_ID_PROCESS to all rows
print(i)
i =
ID Probability
0 20101 0.663083
1 20105 0.486774
2 20106 0.441300
3 20278 0.703176
4 20221 0.539185
....
379 20480 0.671976
m =
MAX_ID_PROCESS
0 274
i =
ID_MATCH Probability new
0 20101 0.663083 274.0
1 20105 0.486774 NaN
2 20106 0.441300 NaN
3 20278 0.703176 NaN
4 20221 0.539185 NaN
I need value 'new' for all rows...
Since your second dataframe is only having one value, you can assign it like this:
df1['new'] = df2.MAX_ID_PROCESS[0]
# Or using .loc
df1['new'] = df2.MAX_ID_PROCESS.loc[0]
In your case, it should be:
i['new'] = m.MAX_ID_PROCESS[0]
You should now see:
ID Probability new
0 20101 0.663083 274.0
1 20105 0.486774 274.0
2 20106 0.441300 274.0
3 20278 0.703176 274.0
4 20221 0.539185 274.0
As we know that we can append one column of dataframe1 to dataframe2 as new column using the code: dataframe2["new_column_name"] = dataframe1["column_to_copy"].
We can extend this approach to solve your problem.
import pandas as pd
import numpy as np
df1 = pd.DataFrame()
df1["ColA"] = [1, 12, 32, 24,12]
df1["ColB"] = [23, 11, 6, 45,25]
df1["ColC"] = [10, 25, 3, 23,15]
print(df1)
Output:
ColA ColB ColC
0 1 23 10
1 12 11 25
2 32 6 3
3 24 45 23
4 12 25 15
Now we create a new dataframe and add a row to it.
df3 = pd.DataFrame()
df3["ColTest"] = [1]
Now we store the value of the first row of the second dataframe as we want to add it to all the rows in dataframe1 as a new column:
val = df3.iloc[0]
print(val)
Output:
ColTest 1
Name: 0, dtype: int64
Now, we will store this value for as many rows as we have in dataframe1.
rows = len(df1)
for row in range(rows):
df3.loc[row]=val
print(df3)
Output:
ColTest
0 1
1 1
2 1
3 1
4 1
Now we will append this column to the first dataframe and solve your problem.
df["ColTest"] = df3["ColTest"]
print(df)
Output:
ColA ColB ColC ColTest
0 1 23 10 1
1 12 11 25 1
2 32 6 3 1
3 24 45 23 1
4 12 25 15 1

Scraping an html table with beautiful soup into pandas

I'm trying to scrape an html table using beautiful soup and import it into pandas -- http://www.baseball-reference.com/teams/NYM/2017.shtml -- the "Team Batting" table.
Finding the table is no problem:
table = soup.find('div', attrs={'class': 'overthrow table_container'})
table_body = table.find('tbody')
Finding the rows of data isn't a problem either:
for i in table.findAll('tr')[2]: #increase to 3 to get next row in table...
print(i.get_text())
And I can even find the header names:
table_head = table.find('thead')
for i in table_head.findAll('th'):
print(i.get_text())
Now I'm having trouble putting everything together into a data frame. Here's what I have so far:
header = []
for th in table_head.findAll('th'):
key = th.get_text()
header.append(key)
row= []
for tr in table.findAll('tr')[2]:
value = tr.get_text()
row.append(value)
od = OrderedDict(zip(head, row))
df = pd.DataFrame(d1, index=[0])
This only works for one row at a time. My question is how can I do this for every row in the table at the same time?
I have tested that the below will work for your purposes. Basically you need to create a list, loop over the players, use that list to populate a DataFrame. It is advisable to not create the DataFrame row by row as that will probably be significantly slower.
import collections as co
import pandas as pd
from bs4 import BeautifulSoup
with open('team_batting.html','r') as fin:
soup = BeautifulSoup(fin.read(),'lxml')
table = soup.find('div', attrs={'class': 'overthrow table_container'})
table_body = table.find('tbody')
table_head = table.find('thead')
header = []
for th in table_head.findAll('th'):
key = th.get_text()
header.append(key)
# loop over table to find number of rows with '' in first column
endrows = 0
for tr in table.findAll('tr'):
if tr.findAll('th')[0].get_text() in (''):
endrows += 1
rows = len(table.findAll('tr'))
rows -= endrows + 1 # there is a pernicious final row that begins with 'Rk'
list_of_dicts = []
for row in range(rows):
the_row = []
try:
table_row = table.findAll('tr')[row]
for tr in table_row:
value = tr.get_text()
the_row.append(value)
od = co.OrderedDict(zip(header,the_row))
list_of_dicts.append(od)
except AttributeError:
continue
df = pd.DataFrame(list_of_dicts)
This solution uses only pandas, but it cheats a little by knowing in advance that the team batting table is the tenth table. With that knowledge, the following uses pandas's read_html function and grabbing the tenth DataFrame from the list of returned DataFrame objects. The remaining after that is just some data cleaning:
import pandas as pd
url = 'http://www.baseball-reference.com/teams/NYM/2017.shtml'
# Take 10th dataframe
team_batting = pd.read_html(url)[9]
# Take columns whose names don't contain "Unnamed"
team_batting.drop([x for x in team_batting.columns if 'Unnamed' in x], axis=1, inplace=True)
# Remove the rows that are just a copy of the headers/columns
team_batting = team_batting.ix[team_batting.apply(lambda x: x != team_batting.columns,axis=1).all(axis=1),:]
# Take out the Totals rows
team_batting = team_batting.ix[~team_batting.Rk.isnull(),:]
# Get a glimpse of the data
print(team_batting.head(5))
# Rk Pos Name Age G PA AB R H 2B ... OBP SLG OPS OPS+ TB GDP HBP SH SF IBB
# 0 1 C Travis d'Arnaud 28 12 42 37 6 10 2 ... .357 .541 .898 144 20 1 1 0 0 1
# 1 2 1B Lucas Duda* 31 13 50 42 4 10 2 ... .360 .571 .931 153 24 1 0 0 0 2
# 2 3 2B Neil Walker# 31 14 62 54 5 12 3 ... .306 .278 .584 64 15 2 0 0 1 0
# 3 4 SS Asdrubal Cabrera# 31 15 67 63 10 17 2 ... .313 .397 .710 96 25 0 0 0 0 0
# 4 5 3B Jose Reyes# 34 15 59 53 3 5 2 ... .186 .132 .319 -9 7 0 0 0 0 0
I hope this helps.

Resources