In Python: How to convert 1/8th of space to 1/6th of space? - python-3.x

Have got dataframe at store-product level as shown in sample below:
Store Product Space Min Max Total_table Carton_Size
11 Apple 0.25 0.0625 0.75 2 6
11 Orange 0.5 0.125 0.5 2 null
11 Tomato 0.75 0.0625 0.75 2 6
11 Potato 0.375 0.0625 0.75 2 6
11 Melon 0.125 0.0625 0.5 2 null
Scenario: All product here have space in terms of 1/8th. But if a product have carton_size other than null, then that particular product space has to be converted in terms of 1/(carton_size)th considering the Min(Space shouldn't be lesser than Min) and Max(Space shouldn't be greater than Max) values. Can get space from non-carton products but at the end, sum of 'Space' column should be equivalent/lesser than 'Total_table' value. Also, these 1/8th and 1/6th values are in relation to the 'Total_table', this total_table value is splitted as Space for each product.
Example: In above given dataframe, Three products have carton size, so we can take 1/8th space from the non-carton product selecting from top and split it as 1/24(means 1/24 + 1/24 + 1/24 = 1/8), which can be added to three carton products to make it 1/6, which forms the expected output shown below considering Min and Max values. If any of the product doesn't satisfy Min or Max condition - leave that product(eg., Tomato).
Roughly Expected Output:
Store Product Space Min Max Total_table Carton_Size
11 Apple 0.292 0.0625 0.75 2 6
11 Orange 0.375 0.125 0.5 2 null
11 Tomato 0.75 0.0625 0.75 2 6
11 Potato 0.417 0.0625 0.75 2 6
11 Melon 0.125 0.0625 0.5 2 null
Need solution in Python.
Thanks in Advance!

Related

Convert different units of a column in pandas

I'm working on a Kaggle project. Below is my CSV file column:
total_sqft
1056
1112
34.46Sq. Meter
4125Perch
1015 - 1540
34.46
10Sq. Yards
10Acres
10Guntha
10Grounds
The column is of type object. First I want to convert all the values to float then update the string 1015 - 1540 with its average value and finally convert the units to square feet. I've tried different StackOverflow solutions but none of them seems to work. Any help would be appreciated.
Expected Output:
total_sqft
1056.00
1112.00
370.307
1123031.25
1277.5
34.46
90.00
435600
10890
24003.5
1 square meter = 10.764 * square foot
1 perch = 272.25 * square foot
1 square yards = 9 * square foot
1 acres = 43560 * square foot
1 guntha = 1089 * square foot
1 grounds = 2400.35 * square foot
First extract numeric values by Series.str.extractall, convert to floats and get averages:
df['avg'] = (df['total_sqft'].str.extractall(r'(\d+\.*\d*)')
.astype(float)
.groupby(level=0)
.mean())
print (df)
total_sqft avg
0 1056 1056.00
1 1112 1112.00
2 34.46Sq. Meter 34.46
3 4125Perch 4125.00
4 1015 - 1540 1277.50
5 34.46 34.46
then need more information for convert to square feet.
EDIT: Create dictionary for match units, extract them from column and use map, last multiple columns:
d = {'Sq. Meter': 10.764, 'Perch':272.25, 'Sq. Yards':9,
'Acres':43560,'Guntha':1089,'Grounds':2400.35}
df['avg'] = (df['total_sqft'].str.extractall(r'(\d+\.*\d*)')
.astype(float)
.groupby(level=0)
.mean())
df['unit'] = df['total_sqft'].str.extract(f'({"|".join(d)})', expand=False)
df['map'] = df['unit'].map(d).fillna(1)
df['total_sqft'] = df['avg'].mul(df['map'])
print (df)
total_sqft avg unit map
0 1.056000e+03 1056.00 NaN 1.000
1 1.112000e+03 1112.00 NaN 1.000
2 3.709274e+02 34.46 Sq. Meter 10.764
3 1.123031e+06 4125.00 Perch 272.250
4 1.277500e+03 1277.50 NaN 1.000
5 3.446000e+01 34.46 NaN 1.000
6 9.000000e+01 10.00 Sq. Yards 9.000
7 4.356000e+05 10.00 Acres 43560.000
8 1.089000e+04 10.00 Guntha 1089.000
9 2.400350e+04 10.00 Grounds 2400.350

how to update rows based on previous row of dataframe python

I have a time series data given below:
date product price amount
11/01/2019 A 10 20
11/02/2019 A 10 20
11/03/2019 A 25 15
11/04/2019 C 40 50
11/05/2019 C 50 60
I have a high dimensional data, and I have just added the simplified version with two columns {price, amount}. I am trying to transform it relatively based on time index illustrated below:
date product price amount
11/01/2019 A NaN NaN
11/02/2019 A 0 0
11/03/2019 A 15 -5
11/04/2019 C NaN NaN
11/05/2019 C 10 10
I am trying to get relative changes of each product based on time indexes. If previous date does not exist for a specified product, I am adding "NaN".
Can you please tell me is there any function to do this?
Group by product and use .diff()
df[["price", "amount"]] = df.groupby("product")[["price", "amount"]].diff()
output :
date product price amount
0 2019-11-01 A NaN NaN
1 2019-11-02 A 0.0 0.0
2 2019-11-03 A 15.0 -5.0
3 2019-11-04 C NaN NaN
4 2019-11-05 C 10.0 10.0

Excel and selecting variables conditionally

I have a data set which contains information by country. For example, Australia_F is the observation for Australia and Australia_Weight is the weight of Australia. Each period, represents a specific year.
Period Australia_F Canada_F Denmark_F Japan_F Australia_Weight Canada_Weight Denmark_Weight Japan_weight
1985 0.05 -0.02 0.02 0.03 0.10 0.30 0.45 0.15
1986 -0.04 -0.03 0.02 0.01 0.15 0.30 0.30 0.25
The user can input any value to the following cell. For example I have inserted 3
Weight_Modification = 3
The goal is to only include countries where the variable XXXXX_F are positive
and use those with the highest values such that the total weight of counties selected is not greater than 1.
The problem is complicated by the fact that the weight_modification variable, multiplies each individual county weight by whatever the value is. For example, the Weight for Australia would be 0.10 *3 = 0.3 in 1985.
Total weights can be less than 1.00 but can't be greater than 1.00
So taking the above data as an example and for 1985 the results would be
Australia_weight Canada_weight Denmark_weight Japan_weight Total_weight
0.3 0.45 0.75
This is because in 1985 Australia has the highest value (Australia_F = 0.05), followed by Japan (Japan_F = 0.03).
Each countries weights are multiplied by 3.
Denmark is not selected even through Denmark_F is positive, because including Denmark the total weight exceeds 1.
In the actual file there are many more countries (12 in total) and many years.
Any help with how to put this together in excel is greatly appreciated.

How can you identify the best companies for each variable and copy the cases?

i want to compare the means of subgroups. The cases of the subgroup with the lowest and the highest mean should be copied and applied to the end of the dataset:
Input
df.head(10)
Outcome
Company Satisfaction Image Forecast Contact
0 Blue 2 3 3 1
1 Blue 2 1 3 2
2 Yellow 4 3 3 3
3 Yellow 3 4 3 2
4 Yellow 4 2 1 5
5 Blue 1 5 1 2
6 Blue 4 2 4 3
7 Yellow 5 4 1 5
8 Red 3 1 2 2
9 Red 1 1 1 2
I have around 100 cases in my sample. Now i look at the means for each company.
Input
df.groupby(['Company']).mean()
Outcome
Satisfaction Image Forecast Contact
Company
Blue 2.666667 2.583333 2.916667 2.750000
Green 3.095238 3.095238 3.476190 3.142857
Orange 3.125000 2.916667 3.416667 2.625000
Red 3.066667 2.800000 2.866667 3.066667
Yellow 3.857143 3.142857 3.000000 2.714286
So for satisfaction Yellow got the best and Blue the worst value. I want to copy the cases of yellow and blue and add them to the dataset but now with the new lable "Best" and "Worst". I dont want to rename it and i want to iterate over the dataset and to this for other columns, too (for example Image). Is there a solution for it? After i added the cases i want an output like this:
Input
df.groupby(['Company']).mean()
Expected Outcome
Satisfaction Image Forecast Contact
Company
Blue 2.666667 2.583333 2.916667 2.750000
Green 3.095238 3.095238 3.476190 3.142857
Orange 3.125000 2.916667 3.416667 2.625000
Red 3.066667 2.800000 2.866667 3.066667
Yellow 3.857143 3.142857 3.000000 2.714286
Best 3.857143 3.142857 3.000000 3.142857
Worst 2.666667 2.583333 2.866667 2.625000
But how i said. It is really important that the companies with the best and worst values for each column will be added again and not just be renamed because i want to do to further data processing with another software.
************************UPDATE****************************
I found out how to copy the correct cases:
Input
df2 = df.loc[df['Company'] == 'Yellow']
df2 = df2.replace('Yellow','Best')
df2 = df2[['Company','Satisfaction']]
new = [df,df2]
result = pd.concat(new)
result
Output
Company Contact Forecast Image Satisfaction
0 Blue 1.0 3.0 3.0 2
1 Blue 2.0 3.0 1.0 2
2 Yellow 3.0 3.0 3.0 4
3 Yellow 2.0 3.0 4.0 3
..........................................
87 Best NaN NaN NaN 3
90 Best NaN NaN NaN 4
99 Best NaN NaN NaN 1
111 Best NaN NaN NaN 2
Now i want to copy the cases of the company with the best values for the other variables, too. But now i have to identify manually which company is best for each category. Isnt there a more comfortable solution?
I have a solution. First i create a dictionary with the variables i want to create a dummy company for best and worst:
variables = ['Contact','Forecast','Satisfaction','Image']
After i loop over this columns and adding the cases again with the new label "Best" or "Worst":
for n in range(0,len(variables),1):
Start = variables[n-1]
neu = df.groupby(['Company'], as_index=False)[Start].mean()
Best = neu['Company'].loc[neu[Start].idxmax()]
Worst = neu['Company'].loc[neu[Start].idxmin()]
dfBest = df.loc[df['Company'] == Best]
dfWorst = df.loc[df['Company'] == Worst]
dfBest = dfBest.replace(Best,'Best')
dfWorst = dfWorst.replace(Worst,'Worst')
dfBest = dfBest[['Company',Start]]
dfWorst = dfWorst[['Company',Start]]
new = [df,dfBest,dfWorst]
df = pd.concat(new)
Thanks guys :)

Convert/unpack pandas dataframe of tuples into a list to use as column headers without ( ,) syntax

I have trimmed strings within a column to isolate key words and create a dataframe (totalarea_cols) which I can then use to label headers of a second dataframe (totalarea_p).
However, it appears that keywords are created as tuples and when used to label columns in second dataframe, the tuples syntax is included (see sample below; totalarea_p.head())
Here is a sample of the code:
totalarea_code = df_meta_p2.loc[df_meta_p2['Label English'].str.contains('Total area under dry season '), 'Code'];
totalarea_cols = df_meta_p2['Label English'].str.extractall('Total area under dry season (.*)').reset_index(drop=True)
totalarea_p = df_data_p2.loc[: , totalarea_code];
totalarea_p.columns = totalarea_cols
Sample of metadata from which I would like to extract keyword from string:
In[33]: df_meta_p2['Label English']
Out[33]:
0 District code
1 Province code
2 Province name in English
3 District name in English
4 Province name in Lao
5 Total area under dry season groundnut (peanut)
6 Total number of households growing dry season ...
7 Total number of households growing dry season ...
8 Total number of households growing dry season ...
9 Total number of households growing dry season ...
10 Total number of households growing dry season ...
11 Total number of households growing dry season yam
12 Total number of households growing dry season ...
13 Total number of households growing dry season ...
14 Total number of households growing dry season ...
15 Total number of households growing dry season ...
16 Total number of households growing dry season ...
17 Total number of households growing dry season ...
18 Total number of households growing dry season ...
19 Total number of households growing dry season ...
Name: Label English, dtype: object
Sample of DataFrame output using str.extractall:
In [34]: totalarea_cols
Out[34]:
0
0 groundnut (peanut)
1 lowland rice/irrigation rice
2 upland rice
3 potato
4 sweet potato
5 cassava
6 yam
7 taro
8 other tuber, root and bulk crops
9 mungbeans
10 cowpea
11 sugar cane
12 soybean
13 sesame
14 cotton
15 tobacco
16 vegetable not specified
17 cabbage
Sample of column headers when substitute into second DataFrame, totalarea_p:
In [36]: totalarea_p.head()
Out[36]:
(groundnut (peanut),) (lowland rice/irrigation rice,) (upland rice,) \
0 0.0 0.00 0
1 0.0 0.00 0
2 0.0 0.00 0
3 0.0 0.30 0
4 0.0 1.01 0
(potato,) (sweet potato,) (cassava,) (yam,) (taro,) \
0 0.0 0.00 0.0 0.0 0
1 0.0 0.00 0.0 0.0 0
2 0.0 0.52 0.0 0.0 0
3 0.0 0.01 0.0 0.0 0
4 0.0 0.00 0.0 0.0 0
I have spent the better part of a day searching for an answer but, other than the post found here, am coming up blank. Any ideas??
You need select column 0 for Series, so change code to:
totalarea_p.columns = totalarea_cols[0]
Or select by position by iloc:
totalarea_p.columns = totalarea_cols.iloc[:, 0]

Resources