Edit multiple values with df.at() - python-3.x

Why does
>>> offset = 2
>>> data = {'Value': [7, 9, 21, 22, 23, 100]}
>>> df = pd.DataFrame(data=data)
>>> df.at[:offset, "Value"] = 99
>>> df
Value
0 99
1 99
2 99
3 22
4 23
5 100
change values in indices [0, 1, 2]? I would expect them only to be changed in [0, 1] to be conform with regular slicing.
Like when I do
>>> arr = [0, 1, 2, 3, 4]
>>> arr[0:2]
[0, 1]

.at behaves like .loc, in that it selects rows/columns by label. Label slicing in pandas is inclusive. Note that .iloc, which performs slicing on the integer positions, behaves like you would expect. See this good answer for a motivation.
Also note that the pandas documentation suggests to use .at only when selecting/setting single values. Instead, use .loc.

On line 4, when you say :2, it means all rows from 0 to 2 or 0:2. If you want to change only the 3rd row, you should change it to 2:2

Related

Assigning value based on if cell is inbetween external tuple values

I have a pandas series of integer values and a dictionary of keys and tuples (2 integers).
The tuples represent a high low value for each key. I'd like to map the key value to each cell of my series based on which tuple the series value falls into.
Example:
d = {'a': (1,5), 'b': (6,10), 'c': (11,15)} keys and tuples are ordered and never repeated
s = pd.Series([5, 6, 5, 8, 15, 5, 2, 5]): I can sort series and there can be multiple repeated or not present values
for a shorter list i can do this manually i believe with a for loop but I can potentially have big dictionary with many keys.
Let's try pd.Interval:
lookup = pd.Series(list(d.keys()),
index=[pd.Interval(x,y, closed='both') for x,y in d.values()])
lookup.loc[s]
Output:
[1, 5] a
[6, 10] b
[1, 5] a
[6, 10] b
[11, 15] c
[1, 5] a
[1, 5] a
[1, 5] a
dtype: object
reindex also works and safer in the case you have out-of-range data:
lookup.reindex(s)
Output:
5 a
6 b
5 a
8 b
15 c
5 a
2 a
5 a
dtype: object
Another idea using pd.IntervalIndex and Series.map:
m = pd.Series(list(d.keys()),
index=pd.IntervalIndex.from_tuples(d.values(), closed='both'))
s = s.map(m)
Result:
0 a
1 b
2 a
3 b
4 c
5 a
6 a
7 a
dtype: object

Filter dataframe by minimum number of values in groups

I have the following dataframe structure:
#----------------------------------------------------------#
# Generate dataframe mock example.
# define categorical column.
grps = pd.DataFrame(['a', 'a', 'a', 'b', 'b', 'b'])
# generate dataframe 1.
df1 = pd.DataFrame([[3, 4, 6, 8, 10, 4],
[5, 7, 2, 8, 9, 6],
[5, 3, 4, 8, 4, 6]]).transpose()
# introduce nan into dataframe 1.
for col in df1.columns:
df1.loc[df1.sample(frac=0.1).index, col] = np.nan
# generate dataframe 2.
df2 = pd.DataFrame([[3, 4, 6, 8, 10, 4],
[5, 7, 2, 8, 9, 6],
[5, 3, 4, 8, 4, 6]]).transpose()
# concatenate categorical column and dataframes.
df = pd.concat([grps, df1, df2], axis = 1)
# Assign column headers.
df.columns = ['Groups', 1, 2, 3, 4, 5, 6]
# Set index as group column.
df = df.set_index('Groups')
# Generate stacked dataframe structure.
test_stack_df = df.stack(dropna = False).reset_index()
# Change column names.
test_stack_df = test_stack_df.rename(columns = {'level_1': 'IDs',
0: 'Values'})
#----------------------------------------------------------#
Original dataframe - 'df' before stacking:
Groups 1 2 3 4 5 6
a 3 5 5 3 5 5
a nan nan 3 4 7 3
a 6 2 nan 6 2 4
b 8 8 8 8 8 8
b 10 9 4 10 9 4
b 4 6 6 4 6 6
I would like to filter the columns such that there are minimally 3 valid values in each group - 'a' & 'b'. The final output should be only columns 4, 5, 6.
I am currently using the following method:
# Function to define boolean series.
def filter_vals(test_stack_df, orig_df):
# Reset index.
df_idx_reset = orig_df.reset_index()
# Generate list with size of each 'Group'.
grp_num = pd.value_counts(df_idx_reset['Groups']).to_list()
# Data series for each 'Group'.
expt_class_1 = test_stack_df.head(grp_num[0])
expt_class_2 = test_stack_df.tail(grp_num[1])
# Check if both 'Groups' contain at least 3 values per 'ID'.
valid_IDs = len(expt_class_1['Values'].value_counts()) >=3 & \
len(expt_class_2['Values'].value_counts()) >=3
# Return 'true' or 'false'
return(valid_IDs)
# Apply function to dataframe to generate boolean series.
bool_series = test_stack_df.groupby('IDs').apply(filter_vals, df)
# Transpose original dataframe.
df_T = df.transpose()
# Filter by boolean series & transpose again.
df_filtered = df_T[bool_series].transpose()
I could achieve this with minimal fuss by applying pandas.dataframe.dropna() method and use a threshold of 6. However, this won't account for different sized groups or allow me to specify the minimum number of values, which the current code does.
For larger dataframes i.e. 4000+ columns, the code is a little slow i.e. takes ~ 20 secs to complete filtering process. I have tried alternate methods that access the original dataframe directly using groupby & transform but can't get anything to work.
Is there a simpler and faster method? Thanks for your time!
EDIT: 03/05/2020 (15:58) - just spotted something that wasn't clear in the function above. Still works but have clarified variable names. Sorry for the confusion!
This will do the trick for you:
df.notna().groupby(level='Groups').sum(axis=0).ge(3).all(axis=0)
Outputs:
1 False
2 False
3 False
4 True
5 True
6 True
dtype: bool

How to create dict from pandas dataframe with headers as keys and column values as arrays (not lists)?

I have a pandas dataframe of size (3x10000). I need to create a dict such that the keys are column headers and column values are arrays.
I understand there are many options to create such a dict where values are saved as lists. But I could not find a way to have the values as arrays.
Dataframe example:
A B C
0 1 4 5
1 6 3 2
2 8 0 9
Expected output:
{'A': array([1, 6, 8, ...]),
'B': array([4, 3, 0, ...]),
'C': array([5, 2, 9, ...])}
I guess following does what you need:
>>> import numpy as np
>>> # assuming df is your dataframe
>>> result = {header: np.array(df[header]) for header in df.columns}
>>> result
>>> {'A': array([1, 6, 8]), 'B': array([4, 3, 0]), 'C': array([5, 2, 9])
pandas added to_numpy in 0.24 and it should be way more efficient so you might want to check it.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_numpy.html

pandas how to derived values for a new column base on another column

I have a dataframe that has a column that each value is a list, now I want to derive a new column which only considers list whose size is greater than 1, and assigns a unique integer to the corresponding row as id.
A sample dataframe is like,
document_no_list cluster_id
[1,2,3] 1
[4,5,6,7] 2
[8] nan
[9,10] 3
column cluster_id only considers the 1st, 2nd and 4th row, each of which has a size greater than 1, and assigns a unique integer id to its corresponding cell in the column.
I am wondering how to do that in pandas.
We can use np.random.choice for unique random values with .loc for assignment i.e
df = pd.DataFrame({'document_no_list' :[[1,2,3],[4,5,6,7],[8],[9,10]]})
x = df['document_no_list'].apply(len) > 1
df.loc[x,'Cluster'] = np.random.choice(range(len(df)),x.sum(),replace=False)
Output :
document_no_list Cluster
0 [1, 2, 3] 2.0
1 [4, 5, 6, 7] 1.0
2 [8] NaN
3 [9, 10] 3.0
If you want continuous numbers then you can use
df.loc[x,'Cluster'] = np.arange(x.sum())+1
document_no_list Cluster
0 [1, 2, 3] 1.0
1 [4, 5, 6, 7] 2.0
2 [8] NaN
3 [9, 10] 3.0
Hope it helps
Create a boolean column based on condition and apply cumsum() on rows with 1's
df['cluster_id'] = df['document_no_list'].apply(lambda x: len(x)> 1).astype(int)
df.loc[df['cluster_id'] == 1, 'cluster_id'] = df.loc[df['cluster_id'] == 1, 'cluster_id'].cumsum()
document_no_list cluster_id
0 [1, 2, 3] 1
1 [4, 5, 6, 7] 2
2 [8] 0
3 [9, 10] 3

Why is order of data items reversed while creating a pandas series?

I am new to python and pandas so please bear with me. I tried searching the answer everywhere but couldn't find it. Here's my question:
This is my input code:
list = [1, 2, 3, 1, 2, 3]
s = pd.Series([1, 2, 3, 10, 20, 30], list)
The output is:
1 1
2 2
3 3
1 10
2 20
3 30
dtype: int64
Now, my question is why the "list" is coming before the first list specified while creating the series? I tried running the same code multiple times to check if the series creation is orderless. Any help would be highly appreciated.
Python Version:
Python 3.6.0
Pandas Version:
'0.19.2'
I think you omit index which specify first column called index - so Series construction now is:
#dont use list as variable, because reversed word in python
L = [1, 2, 3, 1, 2, 3]
s = pd.Series(data=[1, 2, 3, 10, 20, 30], index=L)
print (s)
1 1
2 2
3 3
1 10
2 20
3 30
dtype: int64
You can also check Series documentation.

Resources