Pandas: How to average rows with two columns having similar id's? [duplicate] - python-3.x

This question already has answers here:
Pandas dataframe: Group by two columns and then average over another column
(2 answers)
Closed 2 years ago.
I have a dataframe like the following:
State Name
County Name
Value
Idaho
Ada
20
Idaho
Ada
50
Pennsylvania
Adams
70
Colorado
Adams
25
Pennsylvania
Adams
21
Illinois
Adams
45
Illinois
Madison
45
Illinois
Madison
75
Then average the rows with similar State and County name such that the dataframe becomes this:
State Name
County Name
Mean
Idaho
Ada
12.5
Pennsylvania
Adams
55.47
Colorado
Adams
47.2
Illinois
Adams
19.5
Illinois
Madison
75.14
Any kind of help is appreciated.

Try:
df.groupby(['State Name','County Name']).mean()

Related

Question about excel columns csv file how to combine columns

I got a quick question I got a column like this
the players name and the percentage of matches won
Rank
Country
Name
Matches Won %
1 ESP ESP Rafael Nadal 89.06%
2 SRB SRB Novak Djokovic 83.82%
3 SUI SUI Roger Federer 83.61%
4 RUS RUS Daniil Medvedev 73.75%
5 AUT AUT Dominic Thiem 72.73%
6 GRE GRE Stefanos Tsitsipas 67.95%
7 JPN JPN Kei Nishikori 67.44%
and I got another data like this ACES PERCENTAGE
Rank
Country
Name
Ace %
1 USA USA John Isner 26.97%
2 CRO CRO Ivo Karlovic 25.47%
3 USA USA Reilly Opelka 24.81%
4 CAN CAN Milos Raonic 24.63%
5 USA USA Sam Querrey 20.75%
6 AUS AUS Nick Kyrgios 20.73%
7 RSA RSA Kevin Anderson 17.82%
8 KAZ KAZ Alexander Bublik 17.06%
9 FRA FRA Jo Wilfried Tsonga 14.29%
---------------------------------------
85 ESP ESP RAFAEL NADAL 6.85%
My question is can I make my two tables align so for example I want to have
my data based on matches won
So I have for example
Rank Country Name Matches% Aces %
1 ESP RAFAEL NADAL 89.06% 6.85%
Like this for all the player
I agree with the comment above that it would be easiest to import both and to then use XLOOKUP() to add the Aces % column to the first set of data. If you import the first data set to Sheet1 and the second data set to Sheet2 and both have the rank in Column A , your XLOOKUP() in Sheet 1 Column E would look something like:
XLOOKUP(A2, Sheet2!A:A, Sheet2!D:D)

Groupby month and year pandas [duplicate]

I am using this dataframe:
Fruit Date Name Number
Apples 10/6/2016 Bob 7
Apples 10/6/2016 Bob 8
Apples 10/6/2016 Mike 9
Apples 10/7/2016 Steve 10
Apples 10/7/2016 Bob 1
Oranges 10/7/2016 Bob 2
Oranges 10/6/2016 Tom 15
Oranges 10/6/2016 Mike 57
Oranges 10/6/2016 Bob 65
Oranges 10/7/2016 Tony 1
Grapes 10/7/2016 Bob 1
Grapes 10/7/2016 Tom 87
Grapes 10/7/2016 Bob 22
Grapes 10/7/2016 Bob 12
Grapes 10/7/2016 Tony 15
I would like to aggregate this by Name and then by Fruit to get a total number of Fruit per Name. For example:
Bob,Apples,16
I tried grouping by Name and Fruit but how do I get the total number of Fruit?
Use GroupBy.sum:
df.groupby(['Fruit','Name']).sum()
Out[31]:
Number
Fruit Name
Apples Bob 16
Mike 9
Steve 10
Grapes Bob 35
Tom 87
Tony 15
Oranges Bob 67
Mike 57
Tom 15
Tony 1
To specify the column to sum, use this: df.groupby(['Name', 'Fruit'])['Number'].sum()
Also you can use agg function,
df.groupby(['Name', 'Fruit'])['Number'].agg('sum')
If you want to keep the original columns Fruit and Name, use reset_index(). Otherwise Fruit and Name will become part of the index.
df.groupby(['Fruit','Name'])['Number'].sum().reset_index()
Fruit Name Number
Apples Bob 16
Apples Mike 9
Apples Steve 10
Grapes Bob 35
Grapes Tom 87
Grapes Tony 15
Oranges Bob 67
Oranges Mike 57
Oranges Tom 15
Oranges Tony 1
As seen in the other answers:
df.groupby(['Fruit','Name'])['Number'].sum()
Number
Fruit Name
Apples Bob 16
Mike 9
Steve 10
Grapes Bob 35
Tom 87
Tony 15
Oranges Bob 67
Mike 57
Tom 15
Tony 1
Both the other answers accomplish what you want.
You can use the pivot functionality to arrange the data in a nice table
df.groupby(['Fruit','Name'],as_index = False).sum().pivot('Fruit','Name').fillna(0)
Name Bob Mike Steve Tom Tony
Fruit
Apples 16.0 9.0 10.0 0.0 0.0
Grapes 35.0 0.0 0.0 87.0 15.0
Oranges 67.0 57.0 0.0 15.0 1.0
df.groupby(['Fruit','Name'])['Number'].sum()
You can select different columns to sum numbers.
A variation on the .agg() function; provides the ability to (1) persist type DataFrame, (2) apply averages, counts, summations, etc. and (3) enables groupby on multiple columns while maintaining legibility.
df.groupby(['att1', 'att2']).agg({'att1': "count", 'att3': "sum",'att4': 'mean'})
using your values...
df.groupby(['Name', 'Fruit']).agg({'Number': "sum"})
You can set the groupby column to index then using sum with level
df.set_index(['Fruit','Name']).sum(level=[0,1])
Out[175]:
Number
Fruit Name
Apples Bob 16
Mike 9
Steve 10
Oranges Bob 67
Tom 15
Mike 57
Tony 1
Grapes Bob 35
Tom 87
Tony 15
You could also use transform() on column Number after group by. This operation will calculate the total number in one group with function sum, the result is a series with the same index as original dataframe.
df['Number'] = df.groupby(['Fruit', 'Name'])['Number'].transform('sum')
df = df.drop_duplicates(subset=['Fruit', 'Name']).drop('Date', 1)
Then, you can drop the duplicate rows on column Fruit and Name. Moreover, you can drop the column Date by specifying axis 1 (0 for rows and 1 for columns).
# print(df)
Fruit Name Number
0 Apples Bob 16
2 Apples Mike 9
3 Apples Steve 10
5 Oranges Bob 67
6 Oranges Tom 15
7 Oranges Mike 57
9 Oranges Tony 1
10 Grapes Bob 35
11 Grapes Tom 87
14 Grapes Tony 15
# You could achieve the same result with functions discussed by others:
# print(df.groupby(['Fruit', 'Name'], as_index=False)['Number'].sum())
# print(df.groupby(['Fruit', 'Name'], as_index=False)['Number'].agg('sum'))
There is an official tutorial Group by: split-apply-combine talking about what you can do after group by.
If you want the aggregated column to have a custom name such as Total Number, Total etc. (all the solutions on here results in a dataframe where the aggregate column is named Number), use named aggregation:
df.groupby(['Fruit', 'Name'], as_index=False).agg(**{'Total Number': ('Number', 'sum')})
or (if the custom name doesn't need to have a white space in it):
df.groupby(['Fruit', 'Name'], as_index=False).agg(Total=('Number', 'sum'))
this is equivalent to SQL query:
SELECT Fruit, Name, sum(Number) AS Total
FROM df
GROUP BY Fruit, Name
Speaking of SQL, there's pandasql module that allows you to query pandas dataFrames in the local environment using SQL syntax. It's not part of Pandas, so will have to be installed separately.
#! pip install pandasql
from pandasql import sqldf
sqldf("""
SELECT Fruit, Name, sum(Number) AS Total
FROM df
GROUP BY Fruit, Name
""")
You can use dfsql
for your problem, it will look something like:
df.sql('SELECT fruit, sum(number) GROUP BY fruit')
https://github.com/mindsdb/dfsql
here is an article about it:
https://medium.com/riselab/why-every-data-scientist-using-pandas-needs-modin-bringing-sql-to-dataframes-3b216b29a7c0
You can use reset_index() to reset the index after the sum
df.groupby(['Fruit','Name'])['Number'].sum().reset_index()
or
df.groupby(['Fruit','Name'], as_index=False)['Number'].sum()

How to create spark datasets from a file without using File reader

I have a data file that has 4 data sections. Header data, Summary data, Detail data and Footer data. Each section has a fixed number of columns.Each section is divided by two rows that just have a single "#" as the row content.But different sections have different of columns. Is there a way I can avoid creating new files and just use spark tsv(tab seperated foramt) module or any other module to read the file into 4 datasets directly.If I read the file directly then I am loosing the extra columns in the next data section. It only reads the from the file only those columns as the first row of the file.
#deptno dname location
10 Accounting New York
20 Research Dallas
30 Sales Chicago
40 Operations Boston
#
#
#grade losal hisal
1 700.00 1200.00
2 1201.00 1400.00
4 2001.00 3000.00
5 3001.00 99999.00
3 1401.00 2000.00
#
#
#ENAME DNAME JOB EMPNO HIREDATE LOC
ADAMS RESEARCH CLERK 7876 23-MAY-87 DALLAS
ALLEN SALES SALESMAN 7499 20-FEB-81 CHICAGO
BLAKE SALES MANAGER 7698 01-MAY-81 CHICAGO
CLARK ACCOUNTING MANAGER 7782 09-JUN-81 NEW YORK
FORD RESEARCH ANALYST 7902 03-DEC-81 DALLAS
JAMES SALES CLERK 7900 03-DEC-81 CHICAGO
JONES RESEARCH MANAGER 7566 02-APR-81 DALLAS
#
#
#Name Age Address
Paul 23 1115 W Franklin
Bessy the Cow 5 Big Farm Way
Zeke 45 W Main St
Output:
Dataset d1 :
#deptno dname location
10 Accounting New York
20 Research Dallas
30 Sales Chicago
40 Operations Boston
Dataset d2 :
#grade losal hisal
1 700.00 1200.00
2 1201.00 1400.00
4 2001.00 3000.00
5 3001.00 99999.00
3 1401.00 2000.00
Dataset d3 :
#ENAME DNAME JOB EMPNO HIREDATE LOC
ADAMS RESEARCH CLERK 7876 23-MAY-87 DALLAS
ALLEN SALES SALESMAN 7499 20-FEB-81 CHICAGO
BLAKE SALES MANAGER 7698 01-MAY-81 CHICAGO
CLARK ACCOUNTING MANAGER 7782 09-JUN-81 NEW YORK
FORD RESEARCH ANALYST 7902 03-DEC-81 DALLAS
JAMES SALES CLERK 7900 03-DEC-81 CHICAGO
JONES RESEARCH MANAGER 7566 02-APR-81 DALLAS
Dataset d4 :
#Name Age Address
Paul23 1115 W Franklin
Bessy the Cow 5 Big Farm Way
Zeke 45 W Main St

Combine 2 different sheets with same data in Excel

I have the same data from different sources, both incomplete, but combined they may be less incomplete..
I have 2 files;
File #1 has; ID, Zipcode, YoB, Gender
File #2 has: Email, ID, Zipcode, Yob, Gender
The ID's in both files are the same, but #1 has some ID's that #2 hasn't, and the other way aroud.
The Email is connected to the ID. ID's are linked to the zipcode, YoB and gender. In both files are some of that info missing. E.g. File #1 and #2 both have ID 1234, only in #1 it only has a postal code, YoB but no Gender. And #2 has the zipcode and gender but no YoB.
I want to have all the information in one file;
Email, ID, YoB, Zipcode, Gender
I tried to sort both ID's alphabetically and put them next to each other and search for duplicates, but because #1 has some ID's that #2 doesnt I'm not able to combine them...
What's the best way to fix this?
By the way its about 12000 ID's from #1 and 9500 from #2
If you want a list of all the unique IDs then you could create a new sheet, copy both lots of IDs into the same column and then use Advanced Filter to copy Unique records only to another column.
Then use that column to do vlookups from the two files in the columns you require.
(I'm presuming this is a one-time job and you don't mind a bit of manual-ness)...
If on your first Sheet ("Sheet1") you have:
ID F_Name S_Name Age Favourite Cheese
1 Bob Smith 25 Brie
2 Fred Jones 29 Cheddar
3 Jeff Brown 18 Edam
4 Alice Smith 39 Mozzarella
5 Mark Jones 65 Cheddar
7 Sarah Smith 29 Mozzarella
8 Nick Jones 40 Brie
10 Betty Thompson 34 Edam
and on your second Sheet ("Sheet2") you have:
ID F_Name S_Name Age
1 Bob Smith 25
3 Jeff Brown 18
4 Alice Smith 39
5 Mark Jones 65
6 Frank Brown 44
7 Sarah Smith 29
9 Tom Brown 28
10 Betty Thompson 34
Then if you're combining them on a 3rd Sheet you need to do something like:
=IFERROR(VLOOKUP($A2,Sheet1!$A$1:$E$9,COLUMN(),FALSE),VLOOKUP($A2,Sheet2!$A$1:$E$9,COLUMN(),FALSE))
If you're trying to get to:
ID F_Name S_Name Age Favourite Cheese
1 Bob Smith 25 Brie
2 Fred Jones 29 Cheddar
3 Jeff Brown 18 Edam
4 Alice Smith 39 Mozzarella
5 Mark Jones 65 Cheddar
6 Frank Brown 44 0
7 Sarah Smith 29 Mozzarella
8 Nick Jones 40 Brie
9 Tom Brown 28 0
10 Betty Thompson 34 Edam

If a cell value equals this, another cell equals that

I have a spreadsheet with a column for cities, of which their are only 4 different values. What is the formula for equating a new column to show the corresponding state and apply it to the entire list? Example:
Atlanta equals GA,
Phoenix equals AZ,
Chicago equals IL,
Nashville equals TN
Thanks!!
You can use the VLookup function for that:
Make a table with your city name in one column and the state in the next column. Then the following formula next to the city that you want populated:
=VLOOKUP(A1,A$20:B$23,2,FALSE)
In this example, the city you want to identify is in A1, and this formula goes in B1. You can copy it down to B2, B3, etc because the table is hard-coded as A$20:B$23, rather than A20:B23 (where each successive copy down the column would look for a table one row down as well). This example put the lookup table in the A-B columns, but you could put it anywhere you like.
The FALSE at the end means, look for an exact match, not closest. So if you get a "Dallas" in your list, the function will return NA rather than guessing between the state for Chicago and the state for Nashville (either side of Dallas, alphabetically).
Hope that helps!
EDIT:
You added that you also need zipcode info, and that's easy enough to add.
Your table that defines everything would put the zipcode in the 3rd column, so down at A20:B23 (in my example above) you'd end up with A20:C23, where the table would look like
Atlanta GA 12345
Chicago IL 23456
Nashville TN 34567
Phoenix AZ 45678
The cell next to your city in the table you want to populate would be in B1 as shown above giving the state, and then in C1 you'd have the following formula:
=VLOOKUP(A1,A$20:C$23,3,FALSE)
The changes are that here the table is defined out to column C, and instead of "2" returning the second column (i.e. the state abbreviation shown in B), it returns the zipcode shown in column C, the third column.
Again, hope that helps.
Since you mention "only 4 different values" maybe:
=CHOOSE(MATCH(LEFT(A1),{"A","P","C","N"},0),"GA","AZ","IL","TN")
You can use a VLOOKUP Table that contains the city and state abbreviation.
Here is a table that has the Capital, State, State Abbreviation.
Montgomery Alabama AL
Juneau Alaska AK
Phoenix Arizona AZ
Little Rock Arkansas AR
Sacramento California CA
Denver Colorado CO
Hartford Connecticut CT
Dover Delaware DE
Tallahassee Florida FL
Atlanta Georgia GA
Honolulu Hawaii HI
Boise Idaho ID
Springfield Illinois IL
Indianapolis Indiana IN
Des Moines Iowa IA
Topeka Kansas KS
Frankfort Kentucky KY
Baton Rouge Louisiana LA
Augusta Maine ME
Annapolis Maryland MD
Boston Massachusetts MA
Lansing Michigan MI
Saint Paul Minnesota MN
Jackson Mississippi MS
Jefferson City Missouri MO
Helena Montana MT
Lincoln Nebraska NE
Carson City Nevada NV
Concord New Hampshire NH
Trenton New Jersey NJ
Santa Fe New Mexico NM
Albany New York NY
Raleigh North Carolina NC
Bismarck North Dakota ND
Columbus Ohio OH
Oklahoma City Oklahoma OK
Salem Oregon OR
Harrisburg Pennsylvania PA
Providence Rhode Island RI
Columbia South Carolina SC
Pierre South Dakota SD
Nashville Tennessee TN
Austin Texas TX
Salt Lake City Utah UT
Montpelier Vermont VT
Richmond Virginia VA
Olympia Washington WA
Charleston West Virginia WV
Madison Wisconsin WI
Cheyenne Wyoming WY
Then you would use =VLOOKUP(A1,A1:C50,3, FALSE) to look for A1 (Montgomery) in the table and it would output AL for example.

Resources