Boolean indexing in pandas dataframes - python-3.x

I'm trying to apply boolean indexing to a pandas DataFrame.
nm - stores the names of players
ag- stores the player ages
sc - stores the scores
capt - stores boolean index values
import pandas as pd
nm=pd.Series(['p1','p2', 'p3', 'p4'])
ag=pd.Series([12,17,14, 19])
sc=pd.Series([120, 130, 150, 100])
capt=[True, False, True, True]
Cricket=pd.DataFrame({"Name":nm,"Age":ag ,"Score":sc}, index=capt)
print(Cricket)
Output:
Name Age Score
True NaN NaN NaN
False NaN NaN NaN
True NaN NaN NaN
True NaN NaN NaN
Whenever I run the code above, I get a DataFrame filled with NaN values. The only case in which this seems to work is when capt doesn't have repeating elements.
i.e When capt=[False, True] (and reasonable values are given for nm, ag and sc) this code works as expected.
I'm running python 3.8.5, pandas 1.1.1 Is this a deprecated functionality?
Desired output:
Name Age Score
True p1 12 120
False p2 17 130
True p3 14 150
True p4 19 100

Set index values for each Series for avoid mismatch between default RangeIndex of each Series and new index values from capt:
capt=[True, False, True, True]
nm=pd.Series(['p1','p2', 'p3', 'p4'], index=capt)
ag=pd.Series([12,17,14, 19], index=capt)
sc=pd.Series([120, 130, 150, 100], index=capt)
Cricket=pd.DataFrame({"Name":nm,"Age":ag ,"Score":sc})
print(Cricket)
Name Age Score
True p1 12 120
False p2 17 130
True p3 14 150
True p4 19 100
Detail:
print(pd.Series(['p1','p2', 'p3', 'p4']))
0 p1
1 p2
2 p3
3 p4
dtype: object
print(pd.Series(['p1','p2', 'p3', 'p4'], index=capt))
True p1
False p2
True p3
True p4
dtype: object
Boolean indexing is filtration:
capt=[True, False, True, True]
nm=pd.Series(['p1','p2', 'p3', 'p4'])
ag=pd.Series([12,17,14, 19])
sc=pd.Series([120, 130, 150, 100])
Cricket=pd.DataFrame({"Name":nm,"Age":ag ,"Score":sc})
print(Cricket)
Name Age Score
0 p1 12 120
1 p2 17 130
2 p3 14 150
3 p4 19 100
print (Cricket[capt])
Name Age Score
0 p1 12 120
2 p3 14 150
3 p4 19 100

Related

Slicing xarray dataset with coordinate dependent variable

I built an xarray dataset in python3 with coordinates (time, levels) to identify all cloud bases and cloud tops during one day of observations. The variable levels is the dimension for the cloud base/tops that can be identified at a given time. It stores cloud base/top heights values for each time.
Now I want to select all the cloud bases and tops that are located within a given range of heights that change in time. The height range is identified by the arrays bottom_mod and top_mod. These arrays have a time dimension and contain the edges of the range of heights to be selected.
The xarray dataset is cloudStandard_mod_reshaped:
Dimensions: (levels: 8, time: 9600)
Coordinates:
* levels (levels) int64 0 1 2 3 4 5 6 7
* time (time) datetime64[ns] 2013-04-14 ... 2013-04-14T23:59:51
Data variables:
cloudTop (time, levels) float64 nan nan nan nan nan ... nan nan nan nan
cloudThick (time, levels) float64 nan nan nan nan nan ... nan nan nan nan
cloudBase (time, levels) float64 nan nan nan nan nan ... nan nan nan nan
I tried to select the heights in the range identified by top and bottom array as follows:
PBLclouds = cloudStandard_mod_reshaped.sel(levels=slice(bottom_mod[:], top_mod[:]))
but this instruction does accept only scalar values for the slice command.
Do you know how to slice with values that are coordinate-dependent?
You can use the .where() method.
The line providing the solution is under 2.
1. First, create some data like yours:
The dataset:
nlevels, ntime = 8, 50
ds = xr.Dataset(
coords=dict(levels=np.arange(nlevels), time=np.arange(ntime),),
data_vars=dict(
cloudTop=(("levels", "time"), np.random.randn(nlevels, ntime)),
cloudThick=(("levels", "time"), np.random.randn(nlevels, ntime)),
cloudBase=(("levels", "time"), np.random.randn(nlevels, ntime)),
),
)
output of print(ds):
<xarray.Dataset>
Dimensions: (levels: 8, time: 50)
Coordinates:
* levels (levels) int64 0 1 2 3 4 5 6 7
* time (time) int64 0 1 2 3 4 5 6 7 8 9 ... 41 42 43 44 45 46 47 48 49
Data variables:
cloudTop (levels, time) float64 0.08375 0.04721 0.9379 ... 0.04877 2.339
cloudThick (levels, time) float64 -0.6441 -0.8338 -1.586 ... -1.026 -0.5652
cloudBase (levels, time) float64 -0.05004 -0.1729 0.7154 ... 0.06507 1.601
For the top and bottom levels, I'll make the bottom level random and just add an offset to construct the top level.
offset = 3
bot_mod = xr.DataArray(
dims=("time"),
coords=dict(time=np.arange(ntime)),
data=np.random.randint(0, nlevels - offset, ntime),
name="bot_mod",
)
top_mod = (bot_mod + offset).rename("top_mod")
output of print(bot_mod):
<xarray.DataArray 'bot_mod' (time: 50)>
array([0, 1, 2, 2, 3, 1, 2, 1, 0, 2, 1, 3, 2, 0, 2, 4, 3, 3, 2, 1, 2, 0,
2, 2, 0, 1, 1, 4, 1, 3, 0, 4, 0, 4, 4, 0, 4, 4, 1, 0, 3, 4, 4, 3,
3, 0, 1, 2, 4, 0])
2. Then, select the range of levels where clouds are:
use .where() method to select the dataset variables that are between the bottom level and the top level:
ds_clouds = ds.where((ds.levels > bot_mod) & (ds.levels < top_mod))
output of print(ds_clouds):
<xarray.Dataset>
Dimensions: (levels: 8, time: 50)
Coordinates:
* levels (levels) int64 0 1 2 3 4 5 6 7
* time (time) int64 0 1 2 3 4 5 6 7 8 9 ... 41 42 43 44 45 46 47 48 49
Data variables:
cloudTop (levels, time) float64 nan nan nan nan nan ... nan nan nan nan
cloudThick (levels, time) float64 nan nan nan nan nan ... nan nan nan nan
cloudBase (levels, time) float64 nan nan nan nan nan ... nan nan nan nan
It puts nan where the condition is not satisfied, you can use the .dropna() method to get rid of those.
3. Check for success:
Plot cloudBase variable of the dataset before and after processing:
fig, axes = plt.subplots(ncols=2)
ds.cloudBase.plot.imshow(ax=axes[0])
ds_clouds.cloudBase.plot.imshow(ax=axes[1])
plt.show()
I'm not yet allowed to embed images, so that's a link:
Original data vs. selected data

Binning with pd.Cut Beyond range(replacing Nan with "<min_val" or ">Max_val" )

df= pd.DataFrame({'days': [0,31,45,35,19,70,80 ]})
df['range'] = pd.cut(df.days, [0,30,60])
df
Here as code is reproduced , where pd.cut is used to convert a numerical column to categorical column . pd.cut usually gives category as per the list passed [0,30,60]. In this row's 0 , 5 & 6 categorized as Nan which is beyond the [0,30,60]. what i want is 0 should categorized as <0 & 70 should categorized as >60 and similarly 80 should categorized as >60 respectively, If possible dynamic text labeling of A,B,C,D,E depending on no of category created.
For the first part, adding -np.inf and np.inf to the bins will ensure that everything gets a bin:
In [5]: df= pd.DataFrame({'days': [0,31,45,35,19,70,80]})
...: df['range'] = pd.cut(df.days, [-np.inf, 0, 30, 60, np.inf])
...: df
...:
Out[5]:
days range
0 0 (-inf, 0.0]
1 31 (30.0, 60.0]
2 45 (30.0, 60.0]
3 35 (30.0, 60.0]
4 19 (0.0, 30.0]
5 70 (60.0, inf]
6 80 (60.0, inf]
For the second, you can use .cat.codes to get the bin index and do some tweaking from there:
In [8]: df['range'].cat.codes.apply(lambda x: chr(x + ord('A')))
Out[8]:
0 A
1 C
2 C
3 C
4 B
5 D
6 D
dtype: object

How do I conditionally aggregate values in projection part of pandas query?

I currently have a csv file with this content:
ID PRODUCT_ID NAME STOCK SELL_COUNT DELIVERED_BY
1 P1 PRODUCT_P1 12 15 UPS
2 P2 PRODUCT_P2 4 3 DHL
3 P3 PRODUCT_P3 120 22 DHL
4 P1 PRODUCT_P1 423 18 UPS
5 P2 PRODUCT_P2 0 5 GLS
6 P3 PRODUCT_P3 53 10 DHL
7 P4 PRODUCT_P4 22 0 UPS
8 P1 PRODUCT_P1 94 56 GLS
9 P1 PRODUCT_P1 9 24 GLS
When I execute this SQL query:
SELECT
PRODUCT_ID,
MIN(CASE WHEN DELIVERED_BY = 'UPS' THEN STOCK END) as STOCK,
SUM(CASE WHEN ID > 6 THEN SELL_COUNT END) as TOTAL_SELL_COUNT,
SUM(CASE WHEN SELL_COUNT * 100 > 1000 THEN SELL_COUNT END) as COND_SELL_COUNT
FROM products
GROUP BY PRODUCT_ID;
I get the desired result:
PRODUCT_ID STOCK TOTAL_SELL_COUNT COND_SELL_COUNT
P1 12 80 113
P2 null null null
P3 null null 22
P4 22 0 null
Now I'm trying to somehow get the same result on that dataset using pandas, and that's what I'm struggling with.
I imported the csv file to da DataFrame called df_products.
Then I tried this:
def custom_aggregate(grouped):
data = {
'STOCK': np.where(grouped['DELIVERED_BY'] == 'UPS', grouped['STOCK'].min(), np.nan) # [grouped['STOCK'].min() if grouped['DELIVERED_BY'] == 'UPS' else None]
}
d_series = pd.Series(data)
return d_series
result = df_products.groupby('PRODUCT_ID').apply(custom_aggregate)
print(result)
As you can see I'm nowhere near the expected result as I'm already having problems getting the conditional STOCK aggregration to work depending on the DELIVERED_BY values.
This outputs:
STOCK
PRODUCT_ID
P1 [9.0, 9.0, nan, nan]
P2 [nan, nan]
P3 [nan, nan]
P4 [22.0]
which is not even in the correct format, but I'd be happy if I could get the expected 12.0 instead of 9.0 for P1.
Thanks
I just wanted to add that I got near the result by creating additional columns:
df_products['COND_STOCK'] = df_products[df_products['DELIVERED_BY'] == 'UPS']['STOCK']
df_products['SELL_COUNT_ID_GT6'] = df_products[df_products['ID'] > 6]['SELL_COUNT']
df_products['SELL_COUNT_GT1000'] = df_products[(df_products['SELL_COUNT'] * 100) > 1000]['SELL_COUNT']
The function would then look like this:
def custom_aggregate(grouped):
data = {
'STOCK': grouped['COND_STOCK'].min(),
'TOTAL_SELL_COUNT': grouped['SELL_COUNT_ID_GT6'].sum(),
'COND_SELL_COUNT': grouped['SELL_COUNT_GT1000'].sum(),
}
d_series = pd.Series(data)
return d_series
result = df_products.groupby('PRODUCT_ID').apply(custom_aggregate)
This is the 'almost' desired result:
STOCK TOTAL_SELL_COUNT COND_SELL_COUNT
PRODUCT_ID
P1 12.0 80.0 113.0
P2 NaN 0.0 0.0
P3 NaN 0.0 22.0
P4 22.0 0.0 0.0
Usually we can write the pandas as below
df.groupby('PRODUCT_ID').apply(lambda x : pd.Series({'STOCK':x.loc[x.DELIVERED_BY =='UPS','STOCK'].min(),
'TOTAL_SELL_COUNT': x.loc[x.ID>6,'SELL_COUNT'].sum(min_count=1),
'COND_SELL_COUNT':x.loc[x.SELL_COUNT>10,'SELL_COUNT'].sum(min_count=1)}))
Out[105]:
STOCK TOTAL_SELL_COUNT COND_SELL_COUNT
PRODUCT_ID
P1 12.0 80.0 113.0
P2 NaN NaN NaN
P3 NaN NaN 22.0
P4 22.0 0.0 NaN

Python: Extract dimension data from dataframe string column and create columns with values for each of them

Hej,
I have a source file with 2 columns: ID and all_dimensions. All dimensions is a string with different "key-value"-pairs which are not the same for each id.
I want to make the keys column headers and parse the respective value if existent in the right cell.
Example:
ID all_dimensions
12 Height:2 cm,Volume: 4cl,Weight:100g
34 Length: 10cm, Height: 5 cm
56 Depth: 80cm
78 Weight: 2 kg, Length: 7 cm
90 Diameter: 4 cm, Volume: 50 cl
Desired result:
ID Height Volume Weight Length Depth Diameter
12 2 cm 4cl 100g - - -
34 5 cm - - 10cm - -
56 - - - - 80cm -
78 - - 2 kg 7 cm - -
90 - 50 cl - - - 4 cm
I do have over a 100 dimensions so ideally I would like to write a for loop or something similar to not specify each column header (see code examples below)
I am using Python 3.7.3 and pandas 0.24.2.
What have I tried already:
1) I have tried to split the data in separate columns but wasn't sure how to proceed to have each value assigned into the right header:
df.set_index('ID',inplace=True)
newdf = df["all_dimensions"].str.split(",|:",expand = True)
2) Using the initial df, I used "str.extract" to create new columns (but then I would need to specify each header):
df['Volume']=df.all_dimensions.str.extract(r'Volume:([\w\s.]*)').fillna('')
3) To resolve the problem of 2) with each header, I created a list of all dimension attributes and thought to use the list with an for loop to extract the values:
columns_list=df.all_dimensions.str.extract(r'^([\D]*):',expand=True).drop_duplicates()
columns_list=columns_list[0].str.strip().values.tolist()
for dimension in columns_list:
df.dimension=df.all_dimensions.str.extract(r'dimension([\w\s.]*)').fillna('')
Here, JupyterNB gives me a UserWarning: "Pandas doesn't allow columns to be created via a new attribute name" and the df looks the same as before.
Option 1: I prefer splitting several time:
new_series = (df.set_index('ID')
.all_dimensions
.str.split(',', expand=True)
.stack()
.reset_index(level=-1, drop=True)
)
# split second time for individual measurement
new_df = (new_series.str
.split(':', expand=True)
.reset_index()
)
# stripping off leading/trailing spaces
new_df[0] = new_df[0].str.strip()
new_df[1] = new_df[1].str.strip()
# unstack to get the desire table:
new_df.set_index(['ID', 0])[1].unstack()
Option 2: Use split(',|:') as what you tried:
# splitting
new_series = (df.set_index('ID')
.all_dimensions
.str.split(',|:', expand=True)
.stack()
.reset_index(level=-1, drop=True)
)
# concat along axis=1 to get dataframe with two columns
# new_df.columns = ('ID', 0, 1) where 0 is measurement name
new_df = (pd.concat((new_series[::2].str.strip(),
new_series[1::2]), axis=1)
.reset_index())
new_df.set_index(['ID', 0])[1].unstack()
Output:
Depth Diameter Height Length Volume Weight
ID
12 NaN NaN 2 cm NaN 4cl 100g
34 NaN NaN 5 cm 10cm NaN NaN
56 80cm NaN NaN NaN NaN NaN
78 NaN NaN NaN 7 cm NaN 2 kg
90 NaN 4 cm NaN NaN 50 cl NaN
This is a hard question , your string need to be split and your each items after split need to be convert to dict , then we can using DataFrame constructor rebuild those columns
d=[ [{y.split(':')[0]:y.split(':')[1]}for y in x.split(',')]for x in df.all_dimensions]
from collections import ChainMap
data = list(map(lambda x : dict(ChainMap(*x)),d))
s=pd.DataFrame(data)
df=pd.concat([df,s.groupby(s.columns.str.strip(),axis=1).first()],1)
df
Out[26]:
ID all_dimensions Depth ... Length Volume Weight
0 12 Height:2 cm,Volume: 4cl,Weight:100g NaN ... NaN 4cl 100g
1 34 Length: 10cm, Height: 5 cm NaN ... 10cm NaN NaN
2 56 Depth: 80cm 80cm ... NaN NaN NaN
3 78 Weight: 2 kg, Length: 7 cm NaN ... 7 cm NaN 2 kg
4 90 Diameter: 4 cm, Volume: 50 cl NaN ... NaN 50 cl NaN
[5 rows x 8 columns]
Check the columns
df['Height']
Out[28]:
0 2 cm
1 5 cm
2 NaN
3 NaN
4 NaN
Name: Height, dtype: object

How to use the condition with other rows (previous moments in time series data), in pandas, python3

I have a pandas.dataframe df. It is a time series data, with 1000 rows and 3 columns. What I want is given in the pseudo-code below.
for each row
if the value in column 'colA' at [this_row-1] is higher than
the value in column 'B' at [this_row-2] for more than 3%
then set the value in 'colCheck' at [this_row] as True.
Finally, pickout all the rows in the df where 'colCheck' are True.
I will use the following example to further demonstrate my purpose.
df =
'colA', 'colB', 'colCheck'
Dates
2017-01-01, 20, 30, NAN
2017-01-02, 10, 40, NAN
2017-01-03, 50, 20, False
2017-01-04, 40, 10, True
First, when this_row = 2 (the 3rd row, where the date is 2017-01-03), the value in colA at [this_row-1] is 10, the value in colB at [this_row-2] is 30. So (10-30)/30 = -67% < 3%, so the value in colCheck at [this_row] is False.
Likewise, when this_row = 3, (50-40)/40 = 25% > 3%, so the value in colCheck at [this_row] is True.
Last but not least, the first two rows in colCheck should be NAN, since the calculation needs to access [this_row-2] in colB. But the first two rows do not have [this_row-2].
Besides, the criteria of 3% and [row-1] in colA, [row-2] in colB are just examples. In my real project, they are situational, e.g. 4% and [row-3].
I am looking for concise and elegant approach. I am using Python3.
Thanks.
You can rearrange the maths and use pd.Series.shift
df.colA.shift(1).div(df.colB.shift(2)).gt(1.03)
Dates
2017-01-01 False
2017-01-02 False
2017-01-03 False
2017-01-04 True
dtype: bool
Using pd.DataFrame.assign we can create a copy with the new column
df.assign(colCheck=df.colA.shift(1).div(df.colB.shift(2)).gt(1.03))
colA colB colCheck
Dates
2017-01-01 20 30 False
2017-01-02 10 40 False
2017-01-03 50 20 False
2017-01-04 40 10 True
If you insisted on leaving the first two as NaN, you could use iloc
df.assign(colCheck=df.colA.shift(1).div(df.colB.shift(2)).gt(1.03).iloc[2:])
colA colB colCheck
Dates
2017-01-01 20 30 NaN
2017-01-02 10 40 NaN
2017-01-03 50 20 False
2017-01-04 40 10 True
And for maximum clarity:
# This creates a boolean array of when your conditions are met
colCheck = (df.colA.shift(1) / df.colB.shift(2)) > 1.03
# This chops off the first two `False` values and creates a new
# column named `colCheck` and assigns to it the boolean values
# calculate just above.
df.assign(colCheck=colCheck.iloc[2:])
colA colB colCheck
Dates
2017-01-01 20 30 NaN
2017-01-02 10 40 NaN
2017-01-03 50 20 False
2017-01-04 40 10 True

Resources