Assigning value based on if cell is inbetween external tuple values - python-3.x

I have a pandas series of integer values and a dictionary of keys and tuples (2 integers).
The tuples represent a high low value for each key. I'd like to map the key value to each cell of my series based on which tuple the series value falls into.
Example:
d = {'a': (1,5), 'b': (6,10), 'c': (11,15)} keys and tuples are ordered and never repeated
s = pd.Series([5, 6, 5, 8, 15, 5, 2, 5]): I can sort series and there can be multiple repeated or not present values
for a shorter list i can do this manually i believe with a for loop but I can potentially have big dictionary with many keys.

Let's try pd.Interval:
lookup = pd.Series(list(d.keys()),
index=[pd.Interval(x,y, closed='both') for x,y in d.values()])
lookup.loc[s]
Output:
[1, 5] a
[6, 10] b
[1, 5] a
[6, 10] b
[11, 15] c
[1, 5] a
[1, 5] a
[1, 5] a
dtype: object
reindex also works and safer in the case you have out-of-range data:
lookup.reindex(s)
Output:
5 a
6 b
5 a
8 b
15 c
5 a
2 a
5 a
dtype: object

Another idea using pd.IntervalIndex and Series.map:
m = pd.Series(list(d.keys()),
index=pd.IntervalIndex.from_tuples(d.values(), closed='both'))
s = s.map(m)
Result:
0 a
1 b
2 a
3 b
4 c
5 a
6 a
7 a
dtype: object

Related

Edit multiple values with df.at()

Why does
>>> offset = 2
>>> data = {'Value': [7, 9, 21, 22, 23, 100]}
>>> df = pd.DataFrame(data=data)
>>> df.at[:offset, "Value"] = 99
>>> df
Value
0 99
1 99
2 99
3 22
4 23
5 100
change values in indices [0, 1, 2]? I would expect them only to be changed in [0, 1] to be conform with regular slicing.
Like when I do
>>> arr = [0, 1, 2, 3, 4]
>>> arr[0:2]
[0, 1]
.at behaves like .loc, in that it selects rows/columns by label. Label slicing in pandas is inclusive. Note that .iloc, which performs slicing on the integer positions, behaves like you would expect. See this good answer for a motivation.
Also note that the pandas documentation suggests to use .at only when selecting/setting single values. Instead, use .loc.
On line 4, when you say :2, it means all rows from 0 to 2 or 0:2. If you want to change only the 3rd row, you should change it to 2:2

Pandas: How to aggregate by range inclusion?

I have a dataframe with a "range" column and some value columns:
In [1]: df = pd.DataFrame({
"range": [[1,2], [[1,2], [6,11]], [4,5], [[1,3], [5,7], [9, 11]], [9,10], [[5,6], [9,11]]],
"A": range(1, 7),
"B": range(6, 0, -1)
})
Out[1]:
range A B
0 [1, 2] 1 6
1 [[1, 2], [6, 11]] 2 5
2 [4, 5] 3 4
3 [[1, 3], [5, 7], [9, 11]] 4 3
4 [9, 10] 5 2
5 [[5, 6], [9, 11]] 6 1
For every row I need to check if the range is entirely included (with all of its parts) in the range of another row and then sum the other columns (A and B) up, keeping the longer range. The rows are arbitarily ordered.
The detailed steps for the example dataframe would look like: Row 0 is entirely included in row 1 and 3, row 1, 2 and 3 have no other rows where their ranges are entirely included and row 4 is included in row 1, 3 and 5, but because row 5 is also included in 3 row 4 should only be merged once.
Hence my output dataframe would be:
Out[2]:
range A B
0 [[1, 2], [6, 11]] 8 13
1 [4, 5] 3 4
2 [[1, 3], [5, 7], [9, 11]] 16 12
I thought about sorting the rows first in order to put the longest ranges at the top so it would be easier and more efficient to merge the ranges, but unfortunately I have no idea how to perform this in pandas...

Filter dataframe by minimum number of values in groups

I have the following dataframe structure:
#----------------------------------------------------------#
# Generate dataframe mock example.
# define categorical column.
grps = pd.DataFrame(['a', 'a', 'a', 'b', 'b', 'b'])
# generate dataframe 1.
df1 = pd.DataFrame([[3, 4, 6, 8, 10, 4],
[5, 7, 2, 8, 9, 6],
[5, 3, 4, 8, 4, 6]]).transpose()
# introduce nan into dataframe 1.
for col in df1.columns:
df1.loc[df1.sample(frac=0.1).index, col] = np.nan
# generate dataframe 2.
df2 = pd.DataFrame([[3, 4, 6, 8, 10, 4],
[5, 7, 2, 8, 9, 6],
[5, 3, 4, 8, 4, 6]]).transpose()
# concatenate categorical column and dataframes.
df = pd.concat([grps, df1, df2], axis = 1)
# Assign column headers.
df.columns = ['Groups', 1, 2, 3, 4, 5, 6]
# Set index as group column.
df = df.set_index('Groups')
# Generate stacked dataframe structure.
test_stack_df = df.stack(dropna = False).reset_index()
# Change column names.
test_stack_df = test_stack_df.rename(columns = {'level_1': 'IDs',
0: 'Values'})
#----------------------------------------------------------#
Original dataframe - 'df' before stacking:
Groups 1 2 3 4 5 6
a 3 5 5 3 5 5
a nan nan 3 4 7 3
a 6 2 nan 6 2 4
b 8 8 8 8 8 8
b 10 9 4 10 9 4
b 4 6 6 4 6 6
I would like to filter the columns such that there are minimally 3 valid values in each group - 'a' & 'b'. The final output should be only columns 4, 5, 6.
I am currently using the following method:
# Function to define boolean series.
def filter_vals(test_stack_df, orig_df):
# Reset index.
df_idx_reset = orig_df.reset_index()
# Generate list with size of each 'Group'.
grp_num = pd.value_counts(df_idx_reset['Groups']).to_list()
# Data series for each 'Group'.
expt_class_1 = test_stack_df.head(grp_num[0])
expt_class_2 = test_stack_df.tail(grp_num[1])
# Check if both 'Groups' contain at least 3 values per 'ID'.
valid_IDs = len(expt_class_1['Values'].value_counts()) >=3 & \
len(expt_class_2['Values'].value_counts()) >=3
# Return 'true' or 'false'
return(valid_IDs)
# Apply function to dataframe to generate boolean series.
bool_series = test_stack_df.groupby('IDs').apply(filter_vals, df)
# Transpose original dataframe.
df_T = df.transpose()
# Filter by boolean series & transpose again.
df_filtered = df_T[bool_series].transpose()
I could achieve this with minimal fuss by applying pandas.dataframe.dropna() method and use a threshold of 6. However, this won't account for different sized groups or allow me to specify the minimum number of values, which the current code does.
For larger dataframes i.e. 4000+ columns, the code is a little slow i.e. takes ~ 20 secs to complete filtering process. I have tried alternate methods that access the original dataframe directly using groupby & transform but can't get anything to work.
Is there a simpler and faster method? Thanks for your time!
EDIT: 03/05/2020 (15:58) - just spotted something that wasn't clear in the function above. Still works but have clarified variable names. Sorry for the confusion!
This will do the trick for you:
df.notna().groupby(level='Groups').sum(axis=0).ge(3).all(axis=0)
Outputs:
1 False
2 False
3 False
4 True
5 True
6 True
dtype: bool

pandas how to derived values for a new column base on another column

I have a dataframe that has a column that each value is a list, now I want to derive a new column which only considers list whose size is greater than 1, and assigns a unique integer to the corresponding row as id.
A sample dataframe is like,
document_no_list cluster_id
[1,2,3] 1
[4,5,6,7] 2
[8] nan
[9,10] 3
column cluster_id only considers the 1st, 2nd and 4th row, each of which has a size greater than 1, and assigns a unique integer id to its corresponding cell in the column.
I am wondering how to do that in pandas.
We can use np.random.choice for unique random values with .loc for assignment i.e
df = pd.DataFrame({'document_no_list' :[[1,2,3],[4,5,6,7],[8],[9,10]]})
x = df['document_no_list'].apply(len) > 1
df.loc[x,'Cluster'] = np.random.choice(range(len(df)),x.sum(),replace=False)
Output :
document_no_list Cluster
0 [1, 2, 3] 2.0
1 [4, 5, 6, 7] 1.0
2 [8] NaN
3 [9, 10] 3.0
If you want continuous numbers then you can use
df.loc[x,'Cluster'] = np.arange(x.sum())+1
document_no_list Cluster
0 [1, 2, 3] 1.0
1 [4, 5, 6, 7] 2.0
2 [8] NaN
3 [9, 10] 3.0
Hope it helps
Create a boolean column based on condition and apply cumsum() on rows with 1's
df['cluster_id'] = df['document_no_list'].apply(lambda x: len(x)> 1).astype(int)
df.loc[df['cluster_id'] == 1, 'cluster_id'] = df.loc[df['cluster_id'] == 1, 'cluster_id'].cumsum()
document_no_list cluster_id
0 [1, 2, 3] 1
1 [4, 5, 6, 7] 2
2 [8] 0
3 [9, 10] 3

Find all combinations by columns

I have n-raws m-columns matrix and want to find all combinations. For example:
2 5 6 9
5 2 8 3
1 1 9 4
2 5 3 9
my program will print
2-5-6-9
2-5-6-3
2-5-6-4
2-5-6-9
2-5-8-9
2-5-8-3...
Can't define m x for loops. How to do that?
Use a recursion. It is enough to specify for each position which values can be there (columns), and make a recursion which has as parameters list of numbers for passed positions. In recursion iteration make iteration through possibilities of next position.
Python implementation:
def C(choose_numbers, possibilities):
if len(choose_numbers) >= len(possibilities):
print '-'.join(map(str, choose_numbers)) # format output
else:
for i in possibilities[len(choose_numbers)]:
C(choose_numbers+[i], possibilities)
c = [[2, 5, 1, 2], [5, 2, 1, 5], [6, 8, 9, 3], [9, 3, 4, 9]]
C([], c)

Resources