Automatically fill data from another sheet - excel

Main Question
I would like to auto-fill Sheet A with values from Sheet B in Excel 2013. My data are in two sheets in the same workbook.
Example
=========== Sheet 1 =========== =========== Sheet 2 ===========
location year val1 val2 location year val1 val2
USA.VT 1999 USA.VT 1999 6 3
USA.VT 2000 USA.VT 2000 3 2
USA.VT 2001 USA.VT 2001 4 1
USA.VT 2002 USA.VT 2002 9 5
USA.NH 1999 USA.NH 1999 3 6
USA.NH 2000 USA.NH 2002 12 56
USA.NH 2001 USA.ME 1999 3 16
USA.NH 2002 USA.ME 2002 4 5
USA.ME 1999
USA.ME 2000
USA.ME 2001
USA.ME 2002
I would like to use some function or formula to automatically populate Sheet 1 based on the values in Sheet 2 according to: location, year, and the column (val1 or val2). Non-matches would be zero-filled.
This would result in the following:
=========== Sheet 1 ===========
location year val1 val2
USA.VT 1999 6 3
USA.VT 2000 3 2
USA.VT 2001 4 1
USA.VT 2002 9 5
USA.NH 1999 3 6
USA.NH 2000 0 0
USA.NH 2001 0 0
USA.NH 2002 12 56
USA.ME 1999 3 16
USA.ME 2000 0 0
USA.ME 2001 0 0
USA.ME 2002 4 5
I have tried VLOOKUP, INDEX, and MATCH, but I'm having no luck.
Any help would be greatly appreciated!

Put this Array formula in C2:
=IFERROR(INDEX(Sheet2!C$2:C$9,MATCH($A2&$B2,Sheet2!$A$2:$A$9&Sheet2!$B$2:$B$9,0)),0)
Being an array formula you must confirm with Ctrl-Shift-Enter to exit the edit mode instead of Enter.
Then copy over one column and down.
The picture is not exact because I left it on one sheet.

Related

How to change single index value in level 1 in MultiIndex dataframe?

I have this MultiIndex dataframe, df after parsing some text columns for dates with regex.
df.columns
Index(['all', 'month', 'day', 'year'], dtype='object')
all month day year
match
456 0 2000 1 1 2000
461 0 16 1 1 16
1 1991 1 1 1991
463 0 25 1 1 25
1 2014 1 1 2014
465 0 19 1 1 19
1 1976 1 1 1976
477 0 14 1 1 14
1 1994 1 1 1994
489 0 35 1 1 35
1 1985 1 1 1985
I need to keep the rows with years only (2000,1991,2014,1976,1994,1985). Most of these are indexed as 1 at level 1, except for the first one, (456,0).
so that I could handle them this way:
df=df.drop(index=0, level=1)
My result should be this.
all month day year
match
456 0 2000 1 1 2000
461 1 1991 1 1 1991
463 1 2014 1 1 2014
465 1 1976 1 1 1976
477 1 1994 1 1 1994
489 1 1985 1 1 1985
I have tried
df.rename(index={(456,0):(456,1)}, level=1, inplace=True)
which did not seem to do anything.
I could do df1=df.drop((456,1)) and df2=df.drop(index=0, level=1) and then concat them and remove the duplicates, but that does not seem very efficient?
I cant drop the MultiIndex because I will need to append this subset to a bigger dataframe later on.
Thank you.
First idea is chain 2 masks by | for bitwise OR:
df = df[(df.index.get_level_values(1) == 1) | (df.index.get_level_values(0) == 456)]
print (df)
all month day year
456 0 2000 1 1 2000
461 1 1991 1 1 1991
463 1 2014 1 1 2014
465 1 1976 1 1 1976
477 1 1994 1 1 1994
489 1 1985 1 1 1985
Another idea if need always first value is possible set array mask by index to True:
mask = df.index.get_level_values(1) == 1
mask[0] = True
df = df[mask]
print (df)
all month day year
456 0 2000 1 1 2000
461 1 1991 1 1 1991
463 1 2014 1 1 2014
465 1 1976 1 1 1976
477 1 1994 1 1 1994
489 1 1985 1 1 1985
Another out of box solution is filtering not duplicated values by Index.duplicated, working here because first value 456 is unique and for all another values need second rows:
df1 = df[~df.index.get_level_values(0).duplicated(keep='last')]
print (df1)
all month day year
456 0 2000 1 1 2000
461 1 1991 1 1 1991
463 1 2014 1 1 2014
465 1 1976 1 1 1976
477 1 1994 1 1 1994
489 1 1985 1 1 1985
Another way. Query the level
df.query('match == [1]')
match all month day year
461 1 1991 1 1 1991
463 1 2014 1 1 2014
465 1 1976 1 1 1976
477 1 1994 1 1 1994
489 1 1985 1 1 1985

Display minimum value excluding zero along with adjacent column value from each year + Python 3+, dataframe

I have a dataframe with three columns as Year, Product, Price. I wanted to calculate minimum value excluding zero from Price from each year. Also wanted to populate adjacent value from column Product to the minimum value.
Data:
Year Product Price
2000 Grapes 0
2000 Apple 220
2000 pear 185
2000 Watermelon 172
2001 Orange 0
2001 Muskmelon 90
2001 Pear 165
2001 Watermelon 99
Desirable output in new dataframe:
Year Minimum Price Product
2000 172 Watermelon
2001 90 Muskmelon
First filter out 0 rows by boolean indexing:
df1 = df[df['Price'] != 0]
And then use DataFrameGroupBy.idxmin for indices for minimal Price per groups with selecting by loc:
df2 = df1.loc[df1.groupby('Year')['Price'].idxmin()]
Alternative is use sort_values with drop_duplicates:
df2 = df1.sort_values(['Year', 'Price']).drop_duplicates('Year')
print (df2)
Year Product Price
3 2000 Watermelon 172
5 2001 Muskmelon 90
If possible multiple minimal values and need all of them per groups:
print (df)
Year Product Price
0 2000 Grapes 0
1 2000 Apple 220
2 2000 pear 172
3 2000 Watermelon 172
4 2001 Orange 0
5 2001 Muskmelon 90
6 2001 Pear 165
7 2001 Watermelon 99
df1 = df[df['Price'] != 0]
df = df1[df1['Price'].eq(df1.groupby('Year')['Price'].transform('min'))]
print (df)
Year Product Price
2 2000 pear 172
3 2000 Watermelon 172
5 2001 Muskmelon 90
EDIT:
print (df)
Year Product Price
0 2000 Grapes 0
1 2000 Apple 220
2 2000 pear 185
3 2000 Watermelon 172
4 2001 Orange 0
5 2001 Muskmelon 90
6 2002 Pear 0
7 2002 Watermelon 0
df['Price'] = df['Price'].replace(0, np.nan)
df2 = df.sort_values(['Year', 'Price']).drop_duplicates('Year')
df2['Product'] = df2['Product'].mask(df2['Price'].isnull(), 'No data')
print (df2)
Year Product Price
3 2000 Watermelon 172.0
5 2001 Muskmelon 90.0
6 2002 No data NaN

condition after groupby: data science

i have a big df, this is a example to ilustrate my issue. I want to know from this dataframe whichs id by year_of_life are in the first percent in terms of jobs. I want to identify (i am thinking with a dummy) the one percent by years_of_life which has more jobs from the distribution.
for example
id year rap jobs_c jobs year_of_life rap_new
1 2009 0 300 10 NaN 0
2 2012 0 2012 12 0 0
3 2013 0 2012 12 1 1
4 2014 0 2012 13 2 1
5 2015 1 2012 15 3 1
6 2016 0 2012 17 4 0
7 2017 0 2012 19 5 0
8 2009 0 2009 15 0 1
9 2010 0 2009 2 1 1
10 2011 0 2009 3 2 1
11 2012 1 2009 3 3 0
12 2013 0 2009 15 4 0
13 2014 0 2009 12 5 0
14 2015 0 2009 13 6 0
15 2016 0 2009 13 7 0
16 2011 0 2009 3 2 1
17 2012 1 2009 3 3 0
18 2013 0 2009 18 4 0
19 2014 0 2009 12 5 0
20 2015 0 2009 13 6 0
.....
100 2009 0 2007 5 6 1
I want to identify (i am thinking with a dummy) the one percent by years_of_life which has more jobs from the distribution and then sum the jobs from those ids by year_of_life in the first percent
i try something like thi:
df.groupby(['year_of_life']).filter(lambda x : x.jobs>
x.jobs.quantile(.99))['jobs'].sum()
but i have the following error
TypeError: filter function returned a Series, but expected a scalar bool
Is this what you need ?
df.loc[df.groupby(['year_of_life']).jobs.apply(lambda x : x>x.quantile(.99)).fillna(True),'jobs'].sum()
Out[193]: 102

EXCEL - CountIF per category

I have this:
1 A B C
2 Country Value Valid
3 Sweden 10 0
4 Sweden 5 1
5 Sweden 1 1
6 Norway 5 1
7 Norway 5 1
8 Germany 12 1
9 Germany 2 1
10 Germany 3 1
11 Germany 1 0
I want to fill in B15 to D17 in table below with number of valid values (a 1 in column C) per country and value range:
A B C D
13 Value count
14 0 to 3 4 to 7 above 7
15 Sweden 1 1 0
16 Norway 0 2 0
17 Germany 3 0 1
I have tried IF combined with COUNTIF but i cant figure it out.
What would the formula be for cell B15?
Formula you are looking for is this:
=COUNTIFS($A$3:$A$11,$B15,$C$3:$C$11,1,$B$4:$B$11,"<4")
You will just need to change last criterion to $C$3:$C$11,">3",$C$3:$C$11,"<8" to make it count only values between.
Note: Germany will be 2 because value for valid in last row is 0

Excel formulas for range criteria date, arranged in columns

I want to write a formula for a large data chart. The criteria which I have to choose is on rows and columns.
I attach the file with the manually written calculus.
|PRODUCT|01-feb|02-feb|03-feb|04-feb|05-feb|06-feb|07-feb|08-feb|09-ef|10-feb|11-feb|feb-12|
|PRODUCT 1|4|3|1|5|2|9|1|3|5|8|0|5|
|PRODUCT 3|2|5|7|4|4|8|3|5|7|4|4|8|
|PRODUCT 1|1|0|5|3|1|1|8|0|5|3|1|1|
|PRODUCT 2|5|4|6|6|0|7|4|4|6|6|0|7|
|PRODUCT 5|8|7|8|7|1|9|2|7|8|7|1|9|
|PRODUCT 4|4|2|9|3|5|1|7|2|9|3|5|1|
|PRODUCT 1|9|8|1|4|4|6|5|8|1|4|4|6|
|PRODUCT 2|6|4|4|7|2|8|6|4|4|7|2|8|
|PRODUCT 5|2|6|1|8|3|9|3|6|1|8|3|9|
|PRODUCT 3|3|9|5|1|7|4|7|9|5|1|7|4|
|PRODUCT 4|7|6|5|5|8|2|1|6|5|5|8|2|
The compact chart that I have to get:
|PRODUCT|04-feb|08-feb|12-feb|
|PRODUCT 1|44|48|43|
|PRODUCT 2|42|35|40|
|PRODUCT 3|36|47|40|
|PRODUCT 4|41|32|38|
|PRODUCT 5|47|40|46|
The formula that it should works:
=SUMAR.SI.CONJUNTO(C5:N15,B5:B15,H20,C4:N4,"=<"&J19)
because I want to show a range of date between 01-feb to 04-feb from the first chart in the new column 04-feb.
Please, help me.
The following might help you. The formula in the upper left cell of the table of the summary is
{=SUM((($B$1:$M$1<=B$14)*($B$1:$M$1>=A$14)*$B$2:$M$13)*($A15=$A$2:$A$13))}
and can be copied over to the over cells. The 31.01 in the summary table is used as a "helper cell", so that you don't have to alter the formula for the different cells.
Product 01. Feb 02. Feb 03. Feb 04. Feb 05. Feb 06. Feb 07. Feb 08. Feb 09. Feb 10. Feb 11. Feb 12. Feb
Product1 5 2 3 3 5 5 3 3 5 3 3 5
Product3 5 4 2 4 5 1 5 3 3 5 3 3
Product4 3 1 2 2 4 5 5 1 5 5 1 5
Product1 4 1 4 3 4 1 4 1 3 4 1 3
Product3 1 2 2 4 5 2 5 1 1 5 1 1
Product4 3 2 4 1 1 4 3 5 2 3 5 2
Product1 4 3 5 1 1 1 2 2 2 2 2 2
Product3 3 2 4 3 5 1 1 1 4 1 1 4
Product4 2 1 4 2 2 1 4 4 3 4 4 3
Product1 4 5 5 2 3 4 3 4 5 3 4 5
Product3 4 2 3 1 4 1 1 3 1 1 3 1
Product4 3 5 3 3 1 4 1 1 3 1 1 3
31. Jan 04. Feb 08. Feb 12. Feb
Product1 54 55 62
Product2 0 0 0
Product3 46 56 46
Product4 41 54 61
Product5 0 0 0
You can use sumproduct for this. B2:E12 is the range of data for Feb 1 though Feb 4, and O2 is equal to the criteria you are searching for. So in my case O2 was equal to Product 1. When you want the range for Feb 8, just change B2:E12 to the range of data corresponding to Feb 5 to Feb 8.
=SUMPRODUCT(B2:E12*(A2:A12=O2))

Resources