selecting rows in a data.frame in which a certain column has values containing one of a set of prefixes - string

I have a data.frame of the type:
> head(engschools)
RECTYPE LEA ESTAB URN SCHNAME TOWN PCODE
1 1 919 2028 138231 Alban City School n.a. E1 3RR
2 1 919 4003 138582 Samuel Ryder Academy St Albans AL1 5AR
3 1 919 2004 138201 Hatfield Community Free School Hatfield AL10 8ES
4 2 919 7012 117671 St Luke's School n.a BR3 7ET
5 1 919 2018 138561 Harpenden Free School Redbourn AL3 7QA
6 2 919 7023 117680 Lakeside School Welwyn Garden City AL8 6YN
And a set of prefixes like this one:
>head(prefixes)
E
AL
I would like to select the rows from the data.frame engschools that have values in column PCODE which contain one of the prefixes in prefixes. The correct result would thus contain rows 1:3 and 5:6 but not row 4.

You can try something like this:
mydf[grep(paste0("^", prefixes, collapse="|"), engschools$PCODE), ]
# RECTYPE LEA ESTAB URN SCHNAME TOWN PCODE
# 1 1 919 2028 138231 Alban City School n.a. E1 3RR
# 2 1 919 4003 138582 Samuel Ryder Academy St Albans AL1 5AR
# 3 1 919 2004 138201 Hatfield Community Free School Hatfield AL10 8ES
# 5 1 919 2018 138561 Harpenden Free School Redbourn AL3 7QA
# 6 2 919 7023 117680 Lakeside School Welwyn Garden City AL8 6YN
Here, we have used:
paste to create our search pattern (in this case, "^E|^AL").
grep to identify the row indexes that match the provided pattern.
Basic [ style extracting to extract the relevant rows.

Related

Extract the mapping dictionary between two columns in pandas

I have a dataframe as shown below.
df:
id player country_code country
1 messi arg argentina
2 neymar bra brazil
3 tevez arg argentina
4 aguero arg argentina
5 rivaldo bra brazil
6 owen eng england
7 lampard eng england
8 gerrard eng england
9 ronaldo bra brazil
10 marria arg argentina
from the above df, I would like to extract the mapping dictionary that relates the country_code with country columns.
Expected Output:
d = {'arg':'argentina', 'bra':'brazil', 'eng':'england'}
Dictionary has unique keys, so is possible convert Series with duplicated index by column country_code:
d = df.set_index('country_code')['country'].to_dict()
If there is possible some country should be different per country_code, then is used last value per country.

In Excel how can a formula verify whether the column location or column element has taken the correct data from its header name?

The Input data
in sheet1
and
the output calculated in sheet2
Now the sheet1 data can be changed by the user for input, so now columns 'Units1' & 'Units2' may not be placed at the same address that are in columns 'C' and 'D' respectively, so suppose a new user will input the data in which 'Avocado' and 'Banana' are in columns C & D , then the 'Output' calculation in Sheet2 will be incorrect because we always want to use Units1 & Units2 for calculation.
How to fix this, so that every time the data is input the formula checks whether the correct columns have been taken for calculation or not?
Is there a way to use INDEX or family of LOOKUP functions or any other function for this.
Maybe by a creating a new sheet and making a table of Indexes which refer to (or point to) the column names of Data sheet
Location
Dates
Units1
Units2
Avocado
Banana
New York
05-01-18
10
12
1
2
Los Angeles
02-02-18
20
23
1
2
Chicago
08-03-18
30
34
1
2
Houston
05-04-18
40
45
1
2
Phoenix
02-05-18
50
56
1
2
Philadelphia
08-06-18
60
67
1
2
San Antonio
05-07-18
70
78
1
2
San Diego
02-08-18
80
89
1
2
Dallas
08-09-18
90
99
1
2
San Jose
05-10-18
100
112
1
2
Use INDEX/MATCH:
=INDEX(2:2,1,MATCH("Units2",$1:$1,0))/INDEX(2:2,1,MATCH("Units1",$1:$1,0))

Question about excel columns csv file how to combine columns

I got a quick question I got a column like this
the players name and the percentage of matches won
Rank
Country
Name
Matches Won %
1 ESP ESP Rafael Nadal 89.06%
2 SRB SRB Novak Djokovic 83.82%
3 SUI SUI Roger Federer 83.61%
4 RUS RUS Daniil Medvedev 73.75%
5 AUT AUT Dominic Thiem 72.73%
6 GRE GRE Stefanos Tsitsipas 67.95%
7 JPN JPN Kei Nishikori 67.44%
and I got another data like this ACES PERCENTAGE
Rank
Country
Name
Ace %
1 USA USA John Isner 26.97%
2 CRO CRO Ivo Karlovic 25.47%
3 USA USA Reilly Opelka 24.81%
4 CAN CAN Milos Raonic 24.63%
5 USA USA Sam Querrey 20.75%
6 AUS AUS Nick Kyrgios 20.73%
7 RSA RSA Kevin Anderson 17.82%
8 KAZ KAZ Alexander Bublik 17.06%
9 FRA FRA Jo Wilfried Tsonga 14.29%
---------------------------------------
85 ESP ESP RAFAEL NADAL 6.85%
My question is can I make my two tables align so for example I want to have
my data based on matches won
So I have for example
Rank Country Name Matches% Aces %
1 ESP RAFAEL NADAL 89.06% 6.85%
Like this for all the player
I agree with the comment above that it would be easiest to import both and to then use XLOOKUP() to add the Aces % column to the first set of data. If you import the first data set to Sheet1 and the second data set to Sheet2 and both have the rank in Column A , your XLOOKUP() in Sheet 1 Column E would look something like:
XLOOKUP(A2, Sheet2!A:A, Sheet2!D:D)

difference between two column of a dataframe

I am new to python and would like to find out the difference between two column of a dataframe.
What I want is to find the difference between two column along with a respective third column. For example, I have a dataframe Soccer which contains the list of all the team playing soccer with the goals against and for their club. I wanted to find out the goal difference along with the team name. i.e. (Goals Diff=goalsFor-goalsAgainst).
Pos Team Seasons Points GamesPlayed GamesWon GamesDrawn \
0 1 Real Madrid 86 5656 2600 1647 552
1 2 Barcelona 86 5435 2500 1581 573
2 3 Atletico Madrid 80 5111 2614 1241 598
GamesLost GoalsFor GoalsAgainst
0 563 5947 3140
1 608 5900 3114
2 775 4534 3309
I tried creating a function and then iterating through each row of a dataframe as below:
for index, row in football.iterrows():
##pdb.set_trace()
goalsFor=row['GoalsFor']
goalsAgainst=row['GoalsAgainst']
teamName=row['Team']
if not total:
totals=np.array(Goal_diff_count_Formal(int(goalsFor), int(goalsAgainst), teamName))
else:
total= total.append(Goal_diff_count_Formal(int(goalsFor), int(goalsAgainst), teamName))
return total
def Goal_diff_count_Formal(gFor, gAgainst, team):
goalsDifference=gFor-gAgainst
return [team, goalsDifference]
However, I would like to know if there is a quickest way to get this, something like
dataframe['goalsFor'] - dataframe['goalsAgainst'] #along with the team name in the dataframe
Solution if unique values in Team column - create index by Team, get difference and select Team by index:
df = df.set_index('Team')
s = df['GoalsFor'] - df['GoalsAgainst']
print (s)
Team
Real Madrid 2807
Barcelona 2786
Atletico Madrid 1225
dtype: int64
print (s['Atletico Madrid'])
1225
Solution if possible duplicated values in Team column:
I believe you need grouping by Team and aggregate sum first and then get difference:
#change sample data for Team in row 3
print (df)
Pos Team Seasons Points GamesPlayed GamesWon GamesDrawn \
0 1 Real Madrid 86 5656 2600 1647 552
1 2 Barcelona 86 5435 2500 1581 573
2 3 Real Madrid 80 5111 2614 1241 598
GamesLost GoalsFor GoalsAgainst
0 563 5947 3140
1 608 5900 3114
2 775 4534 3309
df = df.groupby('Team')['GoalsFor','GoalsAgainst'].sum()
df['diff'] = df['GoalsFor'] - df['GoalsAgainst']
print (df)
GoalsFor GoalsAgainst diff
Team
Barcelona 5900 3114 2786
Real Madrid 10481 6449 4032
EDIT:
s = df['GoalsFor'] - df['GoalsAgainst']
print (s)
Team
Barcelona 2786
Real Madrid 4032
dtype: int64
print (s['Barcelona'])
2786

Excel Lookup for multiple values

I have two sheets in Excel. First one is city/branch name and the second is the branch's sales.
sheet 1:
City Branch
NY GoldenStar
NY Aquta
NY Orgi
Oregon Orgi
L.A Orgi
Oregon GoldenStar
....
Sheet 2 is detailed sales for each city
Branch
City GeldenStar Aquta Orgi
NY 45 456 90
L.A 155 345 34
Oregon 9 23 17
How can I use lookup function to assign each branch sale to sheet 1 ( I want to have a result like this :)
sheet 1:
City Branch Sale
NY GoldenStar 45
NY Aquta 456
NY Orgi 90
Oregon Orgi 17
L.A Orgi 34
Oregon GoldenStar 9
Use a combination of VLOOKUP and MATCH for this task. Consider this example:
Use the VLOOKUP to look into the data range (without headers) with the City name as the key. You get the column number from using MATCH with the Branch name into the header range.

Resources