Convert/unpack pandas dataframe of tuples into a list to use as column headers without ( ,) syntax - python-3.x

I have trimmed strings within a column to isolate key words and create a dataframe (totalarea_cols) which I can then use to label headers of a second dataframe (totalarea_p).
However, it appears that keywords are created as tuples and when used to label columns in second dataframe, the tuples syntax is included (see sample below; totalarea_p.head())
Here is a sample of the code:
totalarea_code = df_meta_p2.loc[df_meta_p2['Label English'].str.contains('Total area under dry season '), 'Code'];
totalarea_cols = df_meta_p2['Label English'].str.extractall('Total area under dry season (.*)').reset_index(drop=True)
totalarea_p = df_data_p2.loc[: , totalarea_code];
totalarea_p.columns = totalarea_cols
Sample of metadata from which I would like to extract keyword from string:
In[33]: df_meta_p2['Label English']
Out[33]:
0 District code
1 Province code
2 Province name in English
3 District name in English
4 Province name in Lao
5 Total area under dry season groundnut (peanut)
6 Total number of households growing dry season ...
7 Total number of households growing dry season ...
8 Total number of households growing dry season ...
9 Total number of households growing dry season ...
10 Total number of households growing dry season ...
11 Total number of households growing dry season yam
12 Total number of households growing dry season ...
13 Total number of households growing dry season ...
14 Total number of households growing dry season ...
15 Total number of households growing dry season ...
16 Total number of households growing dry season ...
17 Total number of households growing dry season ...
18 Total number of households growing dry season ...
19 Total number of households growing dry season ...
Name: Label English, dtype: object
Sample of DataFrame output using str.extractall:
In [34]: totalarea_cols
Out[34]:
0
0 groundnut (peanut)
1 lowland rice/irrigation rice
2 upland rice
3 potato
4 sweet potato
5 cassava
6 yam
7 taro
8 other tuber, root and bulk crops
9 mungbeans
10 cowpea
11 sugar cane
12 soybean
13 sesame
14 cotton
15 tobacco
16 vegetable not specified
17 cabbage
Sample of column headers when substitute into second DataFrame, totalarea_p:
In [36]: totalarea_p.head()
Out[36]:
(groundnut (peanut),) (lowland rice/irrigation rice,) (upland rice,) \
0 0.0 0.00 0
1 0.0 0.00 0
2 0.0 0.00 0
3 0.0 0.30 0
4 0.0 1.01 0
(potato,) (sweet potato,) (cassava,) (yam,) (taro,) \
0 0.0 0.00 0.0 0.0 0
1 0.0 0.00 0.0 0.0 0
2 0.0 0.52 0.0 0.0 0
3 0.0 0.01 0.0 0.0 0
4 0.0 0.00 0.0 0.0 0
I have spent the better part of a day searching for an answer but, other than the post found here, am coming up blank. Any ideas??

You need select column 0 for Series, so change code to:
totalarea_p.columns = totalarea_cols[0]
Or select by position by iloc:
totalarea_p.columns = totalarea_cols.iloc[:, 0]

Related

Convert different units of a column in pandas

I'm working on a Kaggle project. Below is my CSV file column:
total_sqft
1056
1112
34.46Sq. Meter
4125Perch
1015 - 1540
34.46
10Sq. Yards
10Acres
10Guntha
10Grounds
The column is of type object. First I want to convert all the values to float then update the string 1015 - 1540 with its average value and finally convert the units to square feet. I've tried different StackOverflow solutions but none of them seems to work. Any help would be appreciated.
Expected Output:
total_sqft
1056.00
1112.00
370.307
1123031.25
1277.5
34.46
90.00
435600
10890
24003.5
1 square meter = 10.764 * square foot
1 perch = 272.25 * square foot
1 square yards = 9 * square foot
1 acres = 43560 * square foot
1 guntha = 1089 * square foot
1 grounds = 2400.35 * square foot
First extract numeric values by Series.str.extractall, convert to floats and get averages:
df['avg'] = (df['total_sqft'].str.extractall(r'(\d+\.*\d*)')
.astype(float)
.groupby(level=0)
.mean())
print (df)
total_sqft avg
0 1056 1056.00
1 1112 1112.00
2 34.46Sq. Meter 34.46
3 4125Perch 4125.00
4 1015 - 1540 1277.50
5 34.46 34.46
then need more information for convert to square feet.
EDIT: Create dictionary for match units, extract them from column and use map, last multiple columns:
d = {'Sq. Meter': 10.764, 'Perch':272.25, 'Sq. Yards':9,
'Acres':43560,'Guntha':1089,'Grounds':2400.35}
df['avg'] = (df['total_sqft'].str.extractall(r'(\d+\.*\d*)')
.astype(float)
.groupby(level=0)
.mean())
df['unit'] = df['total_sqft'].str.extract(f'({"|".join(d)})', expand=False)
df['map'] = df['unit'].map(d).fillna(1)
df['total_sqft'] = df['avg'].mul(df['map'])
print (df)
total_sqft avg unit map
0 1.056000e+03 1056.00 NaN 1.000
1 1.112000e+03 1112.00 NaN 1.000
2 3.709274e+02 34.46 Sq. Meter 10.764
3 1.123031e+06 4125.00 Perch 272.250
4 1.277500e+03 1277.50 NaN 1.000
5 3.446000e+01 34.46 NaN 1.000
6 9.000000e+01 10.00 Sq. Yards 9.000
7 4.356000e+05 10.00 Acres 43560.000
8 1.089000e+04 10.00 Guntha 1089.000
9 2.400350e+04 10.00 Grounds 2400.350

In Python: How to convert 1/8th of space to 1/6th of space?

Have got dataframe at store-product level as shown in sample below:
Store Product Space Min Max Total_table Carton_Size
11 Apple 0.25 0.0625 0.75 2 6
11 Orange 0.5 0.125 0.5 2 null
11 Tomato 0.75 0.0625 0.75 2 6
11 Potato 0.375 0.0625 0.75 2 6
11 Melon 0.125 0.0625 0.5 2 null
Scenario: All product here have space in terms of 1/8th. But if a product have carton_size other than null, then that particular product space has to be converted in terms of 1/(carton_size)th considering the Min(Space shouldn't be lesser than Min) and Max(Space shouldn't be greater than Max) values. Can get space from non-carton products but at the end, sum of 'Space' column should be equivalent/lesser than 'Total_table' value. Also, these 1/8th and 1/6th values are in relation to the 'Total_table', this total_table value is splitted as Space for each product.
Example: In above given dataframe, Three products have carton size, so we can take 1/8th space from the non-carton product selecting from top and split it as 1/24(means 1/24 + 1/24 + 1/24 = 1/8), which can be added to three carton products to make it 1/6, which forms the expected output shown below considering Min and Max values. If any of the product doesn't satisfy Min or Max condition - leave that product(eg., Tomato).
Roughly Expected Output:
Store Product Space Min Max Total_table Carton_Size
11 Apple 0.292 0.0625 0.75 2 6
11 Orange 0.375 0.125 0.5 2 null
11 Tomato 0.75 0.0625 0.75 2 6
11 Potato 0.417 0.0625 0.75 2 6
11 Melon 0.125 0.0625 0.5 2 null
Need solution in Python.
Thanks in Advance!

how to update rows based on previous row of dataframe python

I have a time series data given below:
date product price amount
11/01/2019 A 10 20
11/02/2019 A 10 20
11/03/2019 A 25 15
11/04/2019 C 40 50
11/05/2019 C 50 60
I have a high dimensional data, and I have just added the simplified version with two columns {price, amount}. I am trying to transform it relatively based on time index illustrated below:
date product price amount
11/01/2019 A NaN NaN
11/02/2019 A 0 0
11/03/2019 A 15 -5
11/04/2019 C NaN NaN
11/05/2019 C 10 10
I am trying to get relative changes of each product based on time indexes. If previous date does not exist for a specified product, I am adding "NaN".
Can you please tell me is there any function to do this?
Group by product and use .diff()
df[["price", "amount"]] = df.groupby("product")[["price", "amount"]].diff()
output :
date product price amount
0 2019-11-01 A NaN NaN
1 2019-11-02 A 0.0 0.0
2 2019-11-03 A 15.0 -5.0
3 2019-11-04 C NaN NaN
4 2019-11-05 C 10.0 10.0

Adding values to a new column in Pandas depending on values in an existing column

I have a pandas dataframe as follows:
Name Age City Country percentage
a Jack 34 Sydney Australia 0.23
b Riti 30 Delhi India 0.45
c Vikas 31 Mumbai India 0.55
d Neelu 32 Bangalore India 0.73
e John 16 New York US 0.91
f Mike 17 las vegas US 0.78
I am planning to add one more column called bucket whose definition depends on the percentage column as follows:
less than 0.25 = 1
between 0.25 and 0.5 = 2
between 0.5 and 0.75 = 3
greater than 0.75 = 4
I tried the inbuilt conditions and choices properties of pandas follows:
conditions = [(df_obj['percentage'] < .25),
(df_obj['percentage'] >=.25 & df_obj['percentage'] < .5),
(df_obj['percentage'] >=.5 & df_obj['percentage'] < .75),
(df_obj['percentage'] >= .75)]
choices = [1,2,3,4]
df_obj['bucket'] = np.select(conditions, choices)
However, this gives me a random error as follows in the line where I create the conditions:
TypeError: Cannot perform 'rand_' with a dtyped [float64] array and scalar of type [bool]
A quick fix to your code is that you need more parentheses, for example:
((df_obj['percentage'] >=.25) & (df_obj['percentage'] < .5) )
^ ^ ^ ^
However, I think it's cleaner with pd.cut:
pd.cut(df['percentage'], bins=[0,0.25, 0.5, 0.75, 1],
include_lowest=True, right=False,
labels=[1,2,3,4])
Or since your buckets are linear:
df['bucket'] = (df['percentage']//0.25).add(1).astype(int)
Output
Name Age City Country percentage bucket
a Jack 34 Sydney Australia 0.23 1
b Riti 30 Delhi India 0.45 2
c Vikas 31 Mumbai India 0.55 3
d Neelu 32 Bangalore India 0.73 3
e John 16 New York US 0.91 4
f Mike 17 las vegas US 0.78 4
I think the easiest/most readable way to do this is to use the apply function:
def percentage_to_bucket(percentage):
if percentage < .25:
return 1
elif percentage >= .25 and percentage < .5:
return 2
elif percentage >= .5 and percentage < .75:
return 3
else:
return 4
df["bucket"] = df["percentage"].apply(percentage_to_bucket)
Pandas apply will take each value of a given column and apply the passed function to this value, returning a pandas series with the results, which you can then assign to your new column.

excel formula in one column based on variable entries in another column

My goal is to populate column G for each row with a CPTCode(column B). If I were to copy and paste by hand this would simply be: f2/f5, f3/f5,f4/f5,f6/f15,f7/f15, etc. Column A contains the names of staff. Each month staff codes for different types of procedures(columns B and C) and a variable quantity of each(column E). The rows with total in column B represent the end of a staff person's monthly list of codes. We are hoping to provide monthly updates to our service chiefs. I am hoping someone might be able to help me devise a formula I could use to copy down column G. We are looking at roughly 3000-4000 rows of code per month for about 200 different clinical providers.
CPTCode CPTName Work RVU FY16 Total Qty FY16 Total RVU CPT's % of FY RVUs
96119 NEUROPSYCH TESTING BY TECH 0.6 76 41.8
99212 OFFICE/OUTPATIENT VISIT EST 0.5 2 1.0
T1016 CASE MANAGEMENT 0.5 1 0.5
Total 79 43.3
H0038 SELF-HELP/PEER SVC PER 15MIN 0.0 727 0.0
90853 GROUP PSYCHOTHERAPY 0.6 236 139.2
99212 OFFICE/OUTPATIENT VISIT EST 0.5 153 73.4
S9446 PT EDUCATION NOC GROUP 0.4 105 42.0
99211 OFFICE/OUTPATIENT VISIT EST 0.2 44 7.9
90785 PSYTX COMPLEX INTERACTIVE 0.3 10 3.3
99202 OFFICE/OUTPATIENT VISIT NEW 0.9 1 0.9
99213 OFFICE/OUTPATIENT VISIT EST 1.0 1 1.0
H0031 MH HEALTH ASSESS BY NON-MD 0.6 1 0.6
Total 1278 268.4
H0038 SELF-HELP/PEER SVC PER 15MIN 0.0 452 0.0
98967 HC PRO PHONE CALL 11-20 MIN 0.5 1 0.5
Total 453 0.5
[1]: http://i.stack.imgur.com/dw44F.jpg

Resources