Can not replace values in pandas dataframe column with map - python-3.x

I'm trying to replace a column in pandas data frame using a dictionary and the map method.I found a method without map but its very very ugly
Here is my dictionary
{'A+': '97–100%',
'A': '93–96%',
'A−': '90–92%',
'B+': '87–89%',
'B': '83–86%',
'B−': '80–82%',
'C+': '77–79%',
'C': '73–76%',
'C-': '70–72%',
'D+': '67–69%',
'D': '63–66%',
'D-': '60–62%',
'F': '0–59%'}
and here is my data frame
Fname Lname MidExam FinalExam CW1 CW2 TotalPoints StudentAverage Grade
1 Velma Paul 49% 66% 59% 78% 252 63.00% D
2 Kibo Mcgee 75% 75% 68% 66% 284 71.00% C-
3 Louis Underwood 98% 44% 67% 42% 251 62.75% D-
4 Phyllis Clemons 65% 45% 65% 55% 230 57.50% F
5 Zenaida Mcdowell 65% 54% 65% 54% 238 59.50% F
and i want to replace the letters D, F, ... with numbers
After i use
df["Grade"] = df["Grade"].map(usa_grade_dict)
i get
Fname Lname MidExam FinalExam CW1 CW2 TotalPoints StudentAverage Grade
1 Velma Paul 49% 66% 59% 78% 252 63.00% NaN
2 Kibo Mcgee 75% 75% 68% 66% 284 71.00% NaN
3 Louis Underwood 98% 44% 67% 42% 251 62.75% NaN
4 Phyllis Clemons 65% 45% 65% 55% 230 57.50% NaN
5 Zenaida Mcdowell 65% 54% 65% 54% 238 59.50% NaN
I do not know why I am getting NaN instead of the actual number
Any help would be awesome ^^ Ty

If get missing values in output it means key of dicionary not match with values of column.
If problem with whitespaces in column use Series.str.strip:
df["Grade"] = df["Grade"].str.strip().map(usa_grade_dict)

Related

How to do SUMPRODUCT with percentage and blank cells

I am trying to do SUMPRODUCT in Google Sheets but in a more complicated situation.
I want to sum product with percentage instead of decimal number.
This is what I am doing now, and it works just fine:
A B C D
Price Tax Cashback
100 1.09 0.95
80 1 1
50 1.09 0.95
Total =SUMPRODUCT(B:B, C:C, D:D)
What I actually want to do is
A B C D
Price Tax Cashback
100 9% 5%
80
50 9% 5%
Total ???
Use
=SUMPRODUCT(B2:B, 1+C2:C, 1-D2:D)

Find minimum value from four columns and compare it with range in different column?

I have an excel sheet which has below data:
col1 col2 col3 col4 col5 output range
-----------------------------------------------------------------------------
-1 -1 -1 -1 -1 99.9% - 100%
-1 -1 -1 -1 -1 98% - 99.8%
87.8 78.6 95.2 98.2 94.7 95% - 98.9%
100 100 100 100 100 90% - 94.9%
90.4 86 96.6 73.2 95.5 80% - 89.9%
92.9 88.9 93.1 100 100 0% - 79.9%
85.7 80 82.2 100 100
85.7 80 82.2 100 100
98.3 100 97.9 100 94.4
Now I need to come up with a formula which can do below things:
I need to figure out out minimum of col2, col3, col4, col5 and if that minimun is falling under any of those range mentioned in Range column, then I need to print that range in output column.
But if col1 has value -1 then in output column I want to write "Fail". We will ignore point 1 above if col1 has value -1.
So for example output will be:
col1 col2 col3 col4 col5 output range
-----------------------------------------------------------------------------
-1 -1 -1 -1 -1 Fail 99.9% - 100%
-1 -1 -1 -1 -1 Fail 98% - 99.8%
87.8 78.6 95.2 98.2 94.7 0% - 79.9% 95% - 98.9%
100 100 100 100 100 99.9% - 100% 90% - 94.9%
90.4 86 96.6 73.2 95.5 0% - 79.9% 80% - 89.9%
92.9 88.9 93.1 100 100 80% - 89.9% 0% - 79.9%
85.7 80 82.2 100 100 80% - 89.9%
85.7 80 82.2 100 100 80% - 89.9%
98.3 100 97.9 100 94.4 90% - 94.9%
Is this possible to do by any chance in excel? It looks like pretty complex so I am kinda confuse on how to do this in excel automatically using some formula?
Here's one way. Columns K through N are a reference.
Formula for H2:
=MIN(B2:E2)
Formula for I2:
=IF(A2=-1,"z",IF(H2>$L$2,"a",IF(H2>$L$3,"b",IF(H2>$L$4,"c",IF(H2>$L$5,"d",IF(H2>$L$6,"e","f"))))))
Formula for F2:
=VLOOKUP(I2,K:N,4,FALSE)
Drag 'em down and you're done.
Granted, you could accomplish this with fewer columns, but I've laid it out this way for illustration.
Order the ranges in ascending order, create a column of the lower bounds of the ranges and use LOOKUP function to find the appropriate range.

How do I One Hot encode mixed strings and number cell values in pandas?

I have a datasets in which i need to One hot encode composition mixture of different materials.
The columns of my dataset looks like this:
id Composition
0 ZrB2 - 5% B4C
1 HfB2 - 15% SiC - 3% WC
2 HfB2 - 15% SiC
enter image description here
I need to put it in this format:
0)
ZrB2 95
HfB2 0
SiC 0
B4C 5
WC 0
1)
ZrB2 0
HfB2 82
SiC 15
B4C 0
WC 3
2)
ZrB2 0
HfB2 85
SiC 15
B4C 0
WC 0
WB 0
enter image description here
This is not hot encoding as such but parsing a list of strings into constituent parts
each component is delimited by " - "
each component is made up of two parts percentage and column name. Build re that matches each of these constituent parts
from list/dict comprehension put it into a dataframe
complete logic for calculating %age of column where it was not defined
data = ['ZrB2 - 5% B4C', 'HfB2 - 15% SiC - 3% WC', 'HfB2 - 15% SiC']
dfhc = pd.DataFrame({"Composition":data})
# build a list of dict, where dict is of form {'ZrB2': -1, 'B4C': '5'}
# where no %age, default to -1 to be calculated later
parse1 = [{tt[1]:tt[0].replace("% ","") if len(tt[0])>0 else -1
for t in r
# parse out token and percentage, exclude empty tuples (default behaviour of re.findall())
for tt in [x for x in re.findall("([0-9]*[%]?[ ]?)([A-Z,a-z,0-9]*)",t) if x!=("","")]
}
# each column is delimited by " - "
for r in [re.split(" - ",r) for r in dfhc["Composition"].values]
]
df = pd.DataFrame(parse1)
# dtype is important for sum() to work
df = df.astype({c:np.float64 for c in df.columns})
# where %age was not known and defaulted to -1 set it to 100 - sum of other cols
for c in df.columns:
mask = df[df[c]==-1].index
df.loc[mask, c] = 100 - df.loc[mask, [cc for cc in df.columns if cc!=c]].sum(axis=1)
print(f"{dfhc.to_string(index=False)}\n\n{df.to_string(index=False)}\n\n{parse1}")
output
Composition
ZrB2 - 5% B4C
HfB2 - 15% SiC - 3% WC
HfB2 - 15% SiC
ZrB2 B4C HfB2 SiC WC
95.0 5.0 NaN NaN NaN
NaN NaN 82.0 15.0 3.0
NaN NaN 85.0 15.0 NaN
[{'ZrB2': -1, 'B4C': '5'}, {'HfB2': -1, 'SiC': '15', 'WC': '3'}, {'HfB2': -1, 'SiC': '15'}]

Convert/unpack pandas dataframe of tuples into a list to use as column headers without ( ,) syntax

I have trimmed strings within a column to isolate key words and create a dataframe (totalarea_cols) which I can then use to label headers of a second dataframe (totalarea_p).
However, it appears that keywords are created as tuples and when used to label columns in second dataframe, the tuples syntax is included (see sample below; totalarea_p.head())
Here is a sample of the code:
totalarea_code = df_meta_p2.loc[df_meta_p2['Label English'].str.contains('Total area under dry season '), 'Code'];
totalarea_cols = df_meta_p2['Label English'].str.extractall('Total area under dry season (.*)').reset_index(drop=True)
totalarea_p = df_data_p2.loc[: , totalarea_code];
totalarea_p.columns = totalarea_cols
Sample of metadata from which I would like to extract keyword from string:
In[33]: df_meta_p2['Label English']
Out[33]:
0 District code
1 Province code
2 Province name in English
3 District name in English
4 Province name in Lao
5 Total area under dry season groundnut (peanut)
6 Total number of households growing dry season ...
7 Total number of households growing dry season ...
8 Total number of households growing dry season ...
9 Total number of households growing dry season ...
10 Total number of households growing dry season ...
11 Total number of households growing dry season yam
12 Total number of households growing dry season ...
13 Total number of households growing dry season ...
14 Total number of households growing dry season ...
15 Total number of households growing dry season ...
16 Total number of households growing dry season ...
17 Total number of households growing dry season ...
18 Total number of households growing dry season ...
19 Total number of households growing dry season ...
Name: Label English, dtype: object
Sample of DataFrame output using str.extractall:
In [34]: totalarea_cols
Out[34]:
0
0 groundnut (peanut)
1 lowland rice/irrigation rice
2 upland rice
3 potato
4 sweet potato
5 cassava
6 yam
7 taro
8 other tuber, root and bulk crops
9 mungbeans
10 cowpea
11 sugar cane
12 soybean
13 sesame
14 cotton
15 tobacco
16 vegetable not specified
17 cabbage
Sample of column headers when substitute into second DataFrame, totalarea_p:
In [36]: totalarea_p.head()
Out[36]:
(groundnut (peanut),) (lowland rice/irrigation rice,) (upland rice,) \
0 0.0 0.00 0
1 0.0 0.00 0
2 0.0 0.00 0
3 0.0 0.30 0
4 0.0 1.01 0
(potato,) (sweet potato,) (cassava,) (yam,) (taro,) \
0 0.0 0.00 0.0 0.0 0
1 0.0 0.00 0.0 0.0 0
2 0.0 0.52 0.0 0.0 0
3 0.0 0.01 0.0 0.0 0
4 0.0 0.00 0.0 0.0 0
I have spent the better part of a day searching for an answer but, other than the post found here, am coming up blank. Any ideas??
You need select column 0 for Series, so change code to:
totalarea_p.columns = totalarea_cols[0]
Or select by position by iloc:
totalarea_p.columns = totalarea_cols.iloc[:, 0]

Percentage graph from absolute values

I have the following data:
Date A B C
2012/07 7 6 0
2012/08 9 4 0
2012/09 9 3 0
2012/10 14 2 1
2012/11 9 16 0
2012/12 0 14 0
2013/01 7 9 1
2013/02 8 13 1
2013/03 16 62 16
2013/04 7 12 4
2013/05 10 11 1
2013/06 6 37 4
I want to make a line graph from these data, but I want it to show percentages of line total (A + B + C) instead of the absolute values. How can I do this directly, without resorting to intermediate cells where I'd insert formulas to calculate the percentages or adding a line total column?
So the end result should look like this:
But I don't want to have to "manually" create cells like these:
A B C
2012/07 54% 46% 0%
2012/08 69% 31% 0%
2012/09 75% 25% 0%
2012/10 82% 12% 6%
2012/11 36% 64% 0%
2012/12 0% 100%0%
2013/01 41% 53% 6%
2013/02 36% 59% 5%
2013/03 17% 66% 17%
2013/04 30% 52% 17%
2013/05 45% 50% 5%
2013/06 13% 79% 9%
Use Named Ranges.
First, define the name "Total" as =B2:B12+C2:C12+D2:D12
Then, define three names "PctA"=B2:B12/Total, PctB etc.
Then, define a name "Dates"=A2:A12
Insert a line chart and enter the 3 pct names as the data series. Put in the names as Sheet1!PctA, etc. - Excel won't accept the names without a sheet reference.
Do same for Dates as the horizonal category range.

Resources