I would like to slice the dataframe according to conditions. I want to keep the area name where the length of codes are 5 or 3.
The dataframeAreaCode is as bellowed
codes area
0 113 Leeds
2 115 Nottingham
3 116 Leicester
... ... ...
596 1985 Warminster
597 1986 Bungay
598 1987 Ebbsfleet
This is the code I wrote, but it didn't work.
# print([AreaCode['codes']>4])
for i in AreaCode['codes']:
if len(i)>4:
print(AreaCode['area'][i])
Related
STUDENT TIME SCORE WANT
JOHN 1 68 146
JOHN 2 78 146
JOHN 3 77 146
JOHN 4 91 146
JOHN 5 96 146
JAMES 1 66 119
JAMES 2 53 119
JAMES 3 80 119
JAMES 4 96 119
JAMES 5 50 119
JAMES 6 94 119
I have data COLUMNS 'STUDENT' AND 'TIME' AND 'SCORE' and wish to create 'WANT' and the rule which for I will need VLOOKUP is this: WANT = the sum of the SCORE values at TIMES 1 and 2, so I WISH TO USE VLOOKUP to find the 'SCORE' values for each 'STUDENT' at TIMES 1 and 2 and take the sum.
You can try SUMIFS() in this way.
=SUM(SUMIFS($C$2:$C$12,$B$2:$B$12,{1,2},$A$2:$A$12,A2))
It may need to array entry for older versions of excel. Array entry by CTRL+SHIFT+ENTER.
Assuming your dataset is ordered by "student name" (with unique student names), then "time", you could use :
Classical way, in F2 :
=IF(AND(B2=1,B3=2,A2=A3),C2+C3,IF(AND(B2=2,B1=1,A2=A1),C2+C1,OFFSET($F$1,MATCH(A2,A$2:A2,0),0)))
Greedy way (Office365 needed), in E2 :
=SUM(FILTER($A$2:$C$12;($B$2:$B$12<=2)*($A$2:$A$12=A2)))-3
Reference :
I'm new to python and pandas, and trying to "learn by doing."
I'm currently working with two football/soccer (depending on where you're from!) dataframes:
player_table has several columns, among others 'player_name' and 'player_id'
player_id player_name
0 223 Lionel Messi
1 157 Cristiano Ronaldo
2 962 Neymar
match_table also has several columns, among others 'home_player_1', '..._2', '..._3' and so on, as well as the corresponding 'away_player_1', '...2' , '..._3' and so on. The content of these columns is a player_id, such that you can tell which 22 (2x11) players participated in a given match through their respective unique IDs.
I'll just post a 2 vs. 2 example here, because that works just as well:
match_id home_player_1 home_player_2 away_player_1 away_player_2
0 321 223 852 729 853
1 322 223 858 157 159
2 323 680 742 223 412
What I would like to do now is to add a new column to player_table which gives the number of appearances - player_table['appearances'] by counting the number of times each player_id is mentioned in the part of the dataframe match_table bound horizontally by (home player 1, away player 2) and vertically by (first match, last match)
Desired result:
player_id player_name appearances
0 223 Lionel Messi 3
1 157 Cristiano Ronaldo 1
2 962 Neymar 0
Coming from other programming languages I think my standard solution would be a nested for loop, but I understand that is frowned upon in python...
I have tried several solutions but none really work, this seems to at least give the number of appearances as "home_player_1"
player_table['appearances'] = player_table['player_id'].map(match_table['home_player_1'].value_counts())
Is there a way to expand the map function to include several columns in a dataframe? Or do I have to stack the 22 columns on top of one another in a new dataframe, and then map? Or is map not the appropriate function?
Would really appreciate your support, thanks!
Philipp
Edit: added specific input and desired output as requested
What you could do is use .melt() on the match_table player columns (so it'll turn your wide table in to a tall/long table of a single column). Then do a .value_counts on the that one column. Finally join it to the player_table on the 'player_id' column
import pandas as pd
player_table = pd.DataFrame({'player_id':[223,157,962],
'player_name':['Lionel Messi','Cristiano Ronaldo','Neymar']})
match_table = pd.DataFrame({
'match_id':[321,322,323],
'home_player_1':[223,223,680],
'home_player_2':[852,858,742],
'away_player_1':[729,157,223],
'away_player_2':[853,159,412]})
player_cols = [x for x in match_table.columns if 'player_' in x]
match_table[player_cols].value_counts(sort=True)
df1 = match_table[player_cols].melt(var_name='columns', value_name='appearances')['appearances'].value_counts(sort=True).reset_index(drop=False).rename(columns={'index':'player_id'})
appearances_df = df1.merge(player_table, how='right', on='player_id')[['player_id','player_name','appearances']].fillna(0)
Output:
print(appearances_df)
player_id player_name appearances
0 223 Lionel Messi 3.0
1 157 Cristiano Ronaldo 1.0
2 962 Neymar 0.0
I am new to python and would like to find out the difference between two column of a dataframe.
What I want is to find the difference between two column along with a respective third column. For example, I have a dataframe Soccer which contains the list of all the team playing soccer with the goals against and for their club. I wanted to find out the goal difference along with the team name. i.e. (Goals Diff=goalsFor-goalsAgainst).
Pos Team Seasons Points GamesPlayed GamesWon GamesDrawn \
0 1 Real Madrid 86 5656 2600 1647 552
1 2 Barcelona 86 5435 2500 1581 573
2 3 Atletico Madrid 80 5111 2614 1241 598
GamesLost GoalsFor GoalsAgainst
0 563 5947 3140
1 608 5900 3114
2 775 4534 3309
I tried creating a function and then iterating through each row of a dataframe as below:
for index, row in football.iterrows():
##pdb.set_trace()
goalsFor=row['GoalsFor']
goalsAgainst=row['GoalsAgainst']
teamName=row['Team']
if not total:
totals=np.array(Goal_diff_count_Formal(int(goalsFor), int(goalsAgainst), teamName))
else:
total= total.append(Goal_diff_count_Formal(int(goalsFor), int(goalsAgainst), teamName))
return total
def Goal_diff_count_Formal(gFor, gAgainst, team):
goalsDifference=gFor-gAgainst
return [team, goalsDifference]
However, I would like to know if there is a quickest way to get this, something like
dataframe['goalsFor'] - dataframe['goalsAgainst'] #along with the team name in the dataframe
Solution if unique values in Team column - create index by Team, get difference and select Team by index:
df = df.set_index('Team')
s = df['GoalsFor'] - df['GoalsAgainst']
print (s)
Team
Real Madrid 2807
Barcelona 2786
Atletico Madrid 1225
dtype: int64
print (s['Atletico Madrid'])
1225
Solution if possible duplicated values in Team column:
I believe you need grouping by Team and aggregate sum first and then get difference:
#change sample data for Team in row 3
print (df)
Pos Team Seasons Points GamesPlayed GamesWon GamesDrawn \
0 1 Real Madrid 86 5656 2600 1647 552
1 2 Barcelona 86 5435 2500 1581 573
2 3 Real Madrid 80 5111 2614 1241 598
GamesLost GoalsFor GoalsAgainst
0 563 5947 3140
1 608 5900 3114
2 775 4534 3309
df = df.groupby('Team')['GoalsFor','GoalsAgainst'].sum()
df['diff'] = df['GoalsFor'] - df['GoalsAgainst']
print (df)
GoalsFor GoalsAgainst diff
Team
Barcelona 5900 3114 2786
Real Madrid 10481 6449 4032
EDIT:
s = df['GoalsFor'] - df['GoalsAgainst']
print (s)
Team
Barcelona 2786
Real Madrid 4032
dtype: int64
print (s['Barcelona'])
2786
I made a csv file using pandas and trying to use it as input for the next step. when I open the file using pandas it will look like this example:
example:
Unnamed: 0 Class_Name Probe_Name small_example1.csv small_example2.csv small_example3.csv
0 0 Endogenous CCNO 196 32 18
1 1 Endogenous MYC 962 974 1114
2 2 Endogenous CD79A 390 115 178
3 3 Endogenous FSTL3 67 101 529
4 4 Endogenous VCAN 943 735 9226
I want to make a plot, to do so, I have to change the data structure.
1- I want to remove Unnamed column
2- then I want to make a data frame for a heatmap. to do so I want to use these columns "probe_name", "small_example1.csv", "small_example2.csv" and "small_example3.csv"
3- I also want to transpose the data frame.
here is the expected output:
Probe_Name CCNO MYC CD79A FSTL3 VCAN
small_example1.csv 196 962 390 67 943
small_example1.csv 32 974 115 101 735
small_example1.csv 18 1114 178 529 9226
I tied to do that using the following code:
df = pd.read_csv('myfile.csv')
result = df.transpose()
but it does not return what I want to get. do you know how to fix it?
df.drop(['Unnamed: 0','Class_Name'],axis=1).set_index('Probe_Name').T
Result:
Probe_Name CCNO MYC CD79A FSTL3 VCAN
small_example1.csv 196 962 390 67 943
small_example2.csv 32 974 115 101 735
small_example3.csv 18 1114 178 529 9226
Here's a suggestion:
Changes 1 & 2 can be tackled in one go:
df = df.loc[:, ["Probe_Name", "small_example1.csv", "small_example2.csv", "small_example3.csv"]] # This only retains the specified columns
In order for change 3 (transposing) to work as desired, the column Probe_Name needs to be set as your index:
df = df.set_index("Probe_Name", drop=True)
df = df.transpose()
I am dealing with a dataset that shows duplicate stock per part and location. Orders from multiple customers are coming in and the stock was just added via a vlookup. I need help writing some sort of looping function in python that cumulatively decreases the stock quantity by the order quantity.
Currently data looks like this:
SKU Plant Order Stock
0 5455 989 2 90
1 5455 989 15 90
2 5455 990 10 80
3 5455 990 20 80
I want to accomplish this:
SKU Plant Order Stock
0 5455 989 2 88
1 5455 989 15 73
2 5455 990 10 70
3 5455 990 20 50
Try:
df.Stock -= df.groupby(['SKU','Plant'])['Order'].cumsum()