Vertically combine one column to another and fill other columns values in Pandas - python-3.x

If I have a abnormal column 2019/2/1 in the following dataframe:
date,type,ratio,2019/2/1
2019/1/1,food,0.4,0.3
2019/1/1,vegetables,0.2,0.6
2019/1/1,toy,0.1,0.5
How could I vertically append 2019/2/1 to ratio?
The expected result will like this:
date type ratio
0 2019/1/1 food 0.4
1 2019/1/1 vegetables 0.2
2 2019/1/1 toy 0.1
3 2019/2/1 food 0.3
4 2019/2/1 vegetables 0.6
5 2019/2/1 toy 0.5

First idea is rename column ratio before melt:
df1 = (df.rename(columns={'ratio':'2019/1/1'})
.drop('date', 1)
.melt('type',value_name='ratio', var_name='date'))
print (df1)
type date ratio
0 food 2019/1/1 0.4
1 vegetables 2019/1/1 0.2
2 toy 2019/1/1 0.1
3 food 2019/2/1 0.3
4 vegetables 2019/2/1 0.6
5 toy 2019/2/1 0.5
Another is replace datetimes from columns to date column after melt:
df['date'] = pd.to_datetime(df['date'])
df2 = df.melt(['date','type'],value_name='ratio')
df2['date'] = pd.to_datetime(df2.pop('variable'), errors='coerce').fillna(df2['date'])
print (df2)
date type ratio
0 2019-01-01 food 0.4
1 2019-01-01 vegetables 0.2
2 2019-01-01 toy 0.1
3 2019-02-01 food 0.3
4 2019-02-01 vegetables 0.6
5 2019-02-01 toy 0.5

Related

How to find the first occurrance of sum of closest value to the given value

Have got an array like below with columns ['item','Space','rem_spc']
array([['Pineapple', 0.5, 0.5],
['Mango', 0.75, 0.25],
['Apple', 0.375, 0.625],
['Melons', 0.25, 0.75],
['Grape', 0.125, 0.875]], dtype=object)
need to convert this array to dataframe along with new column ['nxt_item'] which should be generated for first array row alone(Here, for Pineapple) with below conditions:
to find the first nearest items array['Space'] whose sum equals array['rem_spc'] for pineapple.
Expected Output:
item Space rem_spc nxt_item
Pineapple 0.5 0.5 {Apple, Grape} #0.5 = 0.375 + 0.125
Mango 0.75 0.25
Apple 0.375 0.625
Melons 0.25 0.75
Grape 0.125 0.875
Thanks!
A possible solution (another would be using binary linear programming):
from itertools import product
n = len(df) - 1
combinations = product([0, 1], repeat=n)
a = np.array(list(combinations))
df['nxt_item'] = np.nan
df.loc[0, 'nxt_item'] = (
'{' +
', '.join(list(
df.loc[[False] +
a[np.argmin(np.abs(df.iloc[0, 2] -
np.sum(a * df['Space'][1:].values, axis=1))), :]
.astype(bool).tolist(), 'item']))
+ '}')
Output:
item Space rem_spc nxt_item
0 Pineapple 0.5 0.5 {Apple, Grape}
1 Mango 0.75 0.25 NaN
2 Apple 0.375 0.625 NaN
3 Melons 0.25 0.75 NaN
4 Grape 0.125 0.875 NaN

Dataframe column value based on aggregation of several columns

Say I have a pandas dataframe as below:
A B C
1 4 0.1
2 3 0.5
4 1 0.7
5 2 0.2
7 5 0.6
I want to loop through the rows in the dataframe, and for each row perform on aggregation on columns A and B as:
Agg = row[A] / row[A] + row[B]
A B C Agg
1 4 0.1 0.2
2 3 0.5 0.4
4 1 0.7 0.8
5 2 0.2 0.7
7 5 0.6 0.6
For all values of Agg > 0.6, get their corresponding column C values into a list, i.e. 0.7 and 0.2 in this case.
Last step is to get the minimum of the list i.e. min(list) = 0.2 in this instance.
We could use vectorized operations: add for addition, rdiv for division (for A/(A+B)), gt for greater than comparison and loc for the filtering:
out = df.loc[df['A'].add(df['B']).rdiv(df['A']).gt(0.6), 'C'].min()
We could also derive the same result using query much more concisely:
out = df.query('A/(A+B)>0.6')['C'].min()
Output:
0.2
Instead of iterating, you can try creating an aggregate function and apply it across all rows.
def aggregate(row):
return row["A"] / (row["A"] + row["B"])
df["Agg"] = round(df.apply(aggregate, axis = 1), 1)
df[df["Agg"] > 0.6]["C"].min()
Output -
0.2

Convert different units of a column in pandas

I'm working on a Kaggle project. Below is my CSV file column:
total_sqft
1056
1112
34.46Sq. Meter
4125Perch
1015 - 1540
34.46
10Sq. Yards
10Acres
10Guntha
10Grounds
The column is of type object. First I want to convert all the values to float then update the string 1015 - 1540 with its average value and finally convert the units to square feet. I've tried different StackOverflow solutions but none of them seems to work. Any help would be appreciated.
Expected Output:
total_sqft
1056.00
1112.00
370.307
1123031.25
1277.5
34.46
90.00
435600
10890
24003.5
1 square meter = 10.764 * square foot
1 perch = 272.25 * square foot
1 square yards = 9 * square foot
1 acres = 43560 * square foot
1 guntha = 1089 * square foot
1 grounds = 2400.35 * square foot
First extract numeric values by Series.str.extractall, convert to floats and get averages:
df['avg'] = (df['total_sqft'].str.extractall(r'(\d+\.*\d*)')
.astype(float)
.groupby(level=0)
.mean())
print (df)
total_sqft avg
0 1056 1056.00
1 1112 1112.00
2 34.46Sq. Meter 34.46
3 4125Perch 4125.00
4 1015 - 1540 1277.50
5 34.46 34.46
then need more information for convert to square feet.
EDIT: Create dictionary for match units, extract them from column and use map, last multiple columns:
d = {'Sq. Meter': 10.764, 'Perch':272.25, 'Sq. Yards':9,
'Acres':43560,'Guntha':1089,'Grounds':2400.35}
df['avg'] = (df['total_sqft'].str.extractall(r'(\d+\.*\d*)')
.astype(float)
.groupby(level=0)
.mean())
df['unit'] = df['total_sqft'].str.extract(f'({"|".join(d)})', expand=False)
df['map'] = df['unit'].map(d).fillna(1)
df['total_sqft'] = df['avg'].mul(df['map'])
print (df)
total_sqft avg unit map
0 1.056000e+03 1056.00 NaN 1.000
1 1.112000e+03 1112.00 NaN 1.000
2 3.709274e+02 34.46 Sq. Meter 10.764
3 1.123031e+06 4125.00 Perch 272.250
4 1.277500e+03 1277.50 NaN 1.000
5 3.446000e+01 34.46 NaN 1.000
6 9.000000e+01 10.00 Sq. Yards 9.000
7 4.356000e+05 10.00 Acres 43560.000
8 1.089000e+04 10.00 Guntha 1089.000
9 2.400350e+04 10.00 Grounds 2400.350

Creating a new column into a dataframe based on conditions

For the dataframe df :
dummy_data1 = {'category': ['White', 'Black', 'Hispanic','White'],
'Pop':['75','85','90','100'],'White_ratio':[0.6,0.4,0.7,0.35],'Black_ratio':[0.3,0.2,0.1,0.45], 'Hispanic_ratio':[0.1,0.4,0.2,0.20] }
df = pd.DataFrame(dummy_data1, columns = ['category', 'Pop','White_ratio', 'Black_ratio', 'Hispanic_ratio'])
I want to add a new column to this data frame,'pop_n', by first checking the category, and then multiplying the value in 'Pop' by the corresponding ratio value in the columns. For the first row,
the category is 'White' so it should multiply 75 with 0.60 and put 45 in pop_n column.
I thought about writing something like :
df['pop_n']= (df['Pop']*df['White_ratio']).where(df['category']=='W')
this works but just for one category.
I will appreciate any helps with this.
Thanks.
Using DataFrame.filter and DataFrame.lookup:
First we use filter to get the columns with ratio in the name. Then split and keep the first word before the underscore only.
Finally we use lookup to match the category values to these columns.
# df['Pop'] = df['Pop'].astype(int)
df2 = df.filter(like='ratio').rename(columns=lambda x: x.split('_')[0])
df['pop_n'] = df2.lookup(df.index, df['category']) * df['Pop']
category Pop White_ratio Black_ratio Hispanic_ratio pop_n
0 White 75 0.60 0.30 0.1 45.0
1 Black 85 0.40 0.20 0.4 17.0
2 Hispanic 90 0.70 0.10 0.2 18.0
3 White 100 0.35 0.45 0.2 35.0
Locate the columns that have underscores in their names:
to_rename = {x: x.split("_")[0] for x in df if "_" in x}
Find the matching factors:
stack = df.rename(columns=to_rename)\
.set_index('category').stack()
factors = stack[map(lambda x: x[0]==x[1], stack.index)]\
.reset_index(drop=True)
Multiply the original data by the factors:
df['pop_n'] = df['Pop'].astype(int) * factors
# category Pop White_ratio Black_ratio Hispanic_ratio pop_n
#0 White 75 0.60 0.30 0.1 45
#1 Black 85 0.40 0.20 0.4 17
#2 Hispanic 90 0.70 0.10 0.2 18
#3 White 100 0.35 0.45 0.2 35

Python LIfe Expectancy

Trying to use panda to calculate life expectanc with complex equations.
Multiply or divide column by column is not difficult to do.
My data is
A b
1 0.99 1000
2 0.95 =0.99*1000=990
3 0.93 = 0.95*990
Field A is populated and field be has only the 1000
Field b (b2) = A1*b1
Tried shift function, got result for b2 only and the rest zeros any help please thanks mazin
IIUC, if you're starting with:
>>> df
A b
0 0.99 1000.0
1 0.95 NaN
2 0.93 NaN
Then you can do:
df.loc[df.b.isnull(),'b'] = (df.A.cumprod()*1000).shift()
>>> df
A b
0 0.99 1000.0
1 0.95 990.0
2 0.93 940.5
Or more generally:
df['b'] = (df.A.cumprod()*df.b.iloc[0]).shift().fillna(df.b.iloc[0])

Resources