How to sort and update a column in pandas dataframe. In a updated dataframe i want to concat new dataframe - python-3.x

I want to sort a column of pandas DataFrame.
not just sorting but want to return a dataframe with sorted column.
in a sorted and updated DataFrame I want to concat a single column DataFrame.
students = {'name': ['s1', 's2', 's3', 's4', 's5', 's6', 's7', 's8', 's9', 's10']}
marks = {'grade': [45, 78, 12, 14, 48, 43, 47, 98, 35, 80]}
df_1 = pd.DataFrame(students)
df_2 = pd.DataFrame(marks)
df = pd.concat([df_1, df_2], axis=1)
df_25to_75 = df
df_25to_75.sort_values(['Marks'], inplace = True)
lst = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
a = pd.concat([df_25to_75, pd.DataFrame({'no.s': lst})], axis = 1)

You can create your DataFrame more simply, then you want rank. Feel free to sort in the end if it's necessary, though it's not to get the ranking.
import pandas as pd
df = pd.DataFrame({**students, **marks})
df['no.s'] = df.grade.rank(method='dense').astype(int)
name grade no.s
0 s1 45 5
1 s2 78 8
2 s3 12 1
3 s4 14 2
4 s5 48 7
5 s6 43 4
6 s7 47 6
7 s8 98 10
8 s9 35 3
9 s10 80 9
Your original issue is that though you sort the DataFrame the original index remains bound to the rows. Thus, when you then assign a new Series 1 - 10, it aligns on the original index, not on the row ordering.

Related

Get sum of group subset using pandas groupby

I have a dataframe as shown. Using python, I want to get the sum of 'Value' for each 'Id' group upto the first occurrence of 'Stage' 12.
df = pd.DataFrame({'Id':[1,1,1,2,2,2,2],
'Date': ['2020-04-23', '2020-04-25', '2020-04-28', '2020-04-20', '2020-05-01', '2020-05-05', '2020-05-12'],
'Stage': [11, 12, 15, 11, 14, 12, 12],
'Value': [5, 4, 6, 12, 2, 8, 3]})
Id Date Stage Value
1 2020-04-23 11 5
1 2020-04-25 12 4
1 2020-04-28 15 6
2 2020-04-20 11 12
2 2020-05-01 14 2
2 2020-08-05 12 8
2 2020-05-12 12 3
My desired output:
Id Value
1 9
2 22
Would be very thankful if someone could help.
Let us try use the groupby transform idxmax filter the dataframe , then do another round of groupby
idx = df['Stage'].eq(12).groupby(df['id']).transform('idxmax')
output = df[df.index <= idx].groupby('id')['Value'].sum().reset_index()
Detail
the transform with idxmax will return the first index match with 12 for all the groupby row, then we need to filter the df with index less than that to get the data until the first 12 show up.

Filter dataframe by minimum number of values in groups

I have the following dataframe structure:
#----------------------------------------------------------#
# Generate dataframe mock example.
# define categorical column.
grps = pd.DataFrame(['a', 'a', 'a', 'b', 'b', 'b'])
# generate dataframe 1.
df1 = pd.DataFrame([[3, 4, 6, 8, 10, 4],
[5, 7, 2, 8, 9, 6],
[5, 3, 4, 8, 4, 6]]).transpose()
# introduce nan into dataframe 1.
for col in df1.columns:
df1.loc[df1.sample(frac=0.1).index, col] = np.nan
# generate dataframe 2.
df2 = pd.DataFrame([[3, 4, 6, 8, 10, 4],
[5, 7, 2, 8, 9, 6],
[5, 3, 4, 8, 4, 6]]).transpose()
# concatenate categorical column and dataframes.
df = pd.concat([grps, df1, df2], axis = 1)
# Assign column headers.
df.columns = ['Groups', 1, 2, 3, 4, 5, 6]
# Set index as group column.
df = df.set_index('Groups')
# Generate stacked dataframe structure.
test_stack_df = df.stack(dropna = False).reset_index()
# Change column names.
test_stack_df = test_stack_df.rename(columns = {'level_1': 'IDs',
0: 'Values'})
#----------------------------------------------------------#
Original dataframe - 'df' before stacking:
Groups 1 2 3 4 5 6
a 3 5 5 3 5 5
a nan nan 3 4 7 3
a 6 2 nan 6 2 4
b 8 8 8 8 8 8
b 10 9 4 10 9 4
b 4 6 6 4 6 6
I would like to filter the columns such that there are minimally 3 valid values in each group - 'a' & 'b'. The final output should be only columns 4, 5, 6.
I am currently using the following method:
# Function to define boolean series.
def filter_vals(test_stack_df, orig_df):
# Reset index.
df_idx_reset = orig_df.reset_index()
# Generate list with size of each 'Group'.
grp_num = pd.value_counts(df_idx_reset['Groups']).to_list()
# Data series for each 'Group'.
expt_class_1 = test_stack_df.head(grp_num[0])
expt_class_2 = test_stack_df.tail(grp_num[1])
# Check if both 'Groups' contain at least 3 values per 'ID'.
valid_IDs = len(expt_class_1['Values'].value_counts()) >=3 & \
len(expt_class_2['Values'].value_counts()) >=3
# Return 'true' or 'false'
return(valid_IDs)
# Apply function to dataframe to generate boolean series.
bool_series = test_stack_df.groupby('IDs').apply(filter_vals, df)
# Transpose original dataframe.
df_T = df.transpose()
# Filter by boolean series & transpose again.
df_filtered = df_T[bool_series].transpose()
I could achieve this with minimal fuss by applying pandas.dataframe.dropna() method and use a threshold of 6. However, this won't account for different sized groups or allow me to specify the minimum number of values, which the current code does.
For larger dataframes i.e. 4000+ columns, the code is a little slow i.e. takes ~ 20 secs to complete filtering process. I have tried alternate methods that access the original dataframe directly using groupby & transform but can't get anything to work.
Is there a simpler and faster method? Thanks for your time!
EDIT: 03/05/2020 (15:58) - just spotted something that wasn't clear in the function above. Still works but have clarified variable names. Sorry for the confusion!
This will do the trick for you:
df.notna().groupby(level='Groups').sum(axis=0).ge(3).all(axis=0)
Outputs:
1 False
2 False
3 False
4 True
5 True
6 True
dtype: bool

How to use two for loops to copy a list variable into a location variable in a dataframe?

I have a dataframe which has 2 columns called locStuff and data. Someone was kind enough to show me how to index a location range in the df so that it correctly changes the data to a single integer attached to locStuff instead of the dataframe index, that works fine, now I cannot see how to change the data values of that location range with a list of values.
import pandas as pd
INDEX = list(range(1, 11))
LOCATIONS = [3, 10, 6, 2, 9, 1, 7, 5, 8, 4]
DATA = [94, 43, 85, 10, 81, 57, 88, 11, 35, 86]
# Make dataframe
DF = pd.DataFrame(LOCATIONS, columns=['locStuff'], index=INDEX)
DF['data'] = pd.Series(DATA, index=INDEX)
# Location and new value inputs
LOC_TO_CHANGE = 8
NEW_LOC_VALUE = 999
NEW_LOC_VALUE = [999,666,333]
LOC_RANGE = list(range(3, 6))
DF.iloc[3:6, 1] = ('%03d' % NEW_LOC_VALUE)
print(DF)
#I TRIED BOTH OF THESE SEPARATELY
for i in NEW_LOC_VALUE:
for j in LOC_RANGE:
DF.iloc[j, 1] = ('%03d' % NEW_LOC_VALUE[i])
print (DF)
i=0
while i<len(NEW_LOC_VALUE):
for j in LOC_RANGE:
DF.iloc[j, 1] = ('%03d' % NEW_LOC_VALUE[i])
i=+1
print(DF)
Neither of these work:
for i in NEW_LOC_VALUE:
for j in LOC_RANGE:
DF.iloc[j, 1] = ('%03d' % NEW_LOC_VALUE[i])
print (DF)
i=0
while i<len(NEW_LOC_VALUE):
for j in LOC_RANGE:
DF.iloc[j, 1] = ('%03d' % NEW_LOC_VALUE[i])
i=+1
I know how to do this using loops or list comprehensions for an empty list but no idea how to adapt what I have above for a DataFrame.
Expected behaviour would be:
locStuff data
1 3 999
2 10 43
3 6 85
4 2 10
5 9 81
6 1 57
7 7 88
8 5 333
9 8 35
10 4 666
Try setting locStuff as index, assign values, and reset_index:
DF.set_index('locStuff', inplace=True)
DF.loc[LOC_RANGE, 'data'] = NEW_LOC_VALUE
DF.reset_index(inplace=True)
Output:
locStuff data
0 3 999
1 10 43
2 6 85
3 2 10
4 9 81
5 1 57
6 7 88
7 5 333
8 8 35
9 4 666

Pandas: How to build a column based on another column which is indexed by another one?

I have this dataframe presented below. I tried a solution below, but I am not sure if this is a good solution.
import pandas as pd
def creatingDataFrame():
raw_data = {'code': [1, 2, 3, 2 , 3, 3],
'Region': ['A', 'A', 'C', 'B' , 'A', 'B'],
'var-A': [2,4,6,4,6,6],
'var-B': [20, 30, 40 , 50, 10, 20],
'var-C': [3, 4 , 5, 1, 2, 3]}
df = pd.DataFrame(raw_data, columns = ['code', 'Region','var-A', 'var-B', 'var-C'])
return df
if __name__=="__main__":
df=creatingDataFrame()
df['var']=np.where(df['Region']=='A',1.0,0.0)*df['var-A']+np.where(df['Region']=='B',1.0,0.0)*df['var-B']+np.where(df['Region']=='C',1.0,0.0)*df['var-C']
I want the variable var assumes values of column 'var-A', 'var-B' or 'var-C' depending on the region provided by region 'Region'.
The result must be
df['var']
Out[50]:
0 2.0
1 4.0
2 5.0
3 50.0
4 6.0
5 20.0
Name: var, dtype: float64
You can try with lookup
df.columns=df.columns.str.split('-').str[-1]
df
Out[255]:
code Region A B C
0 1 A 2 20 3
1 2 A 4 30 4
2 3 C 6 40 5
3 2 B 4 50 1
4 3 A 6 10 2
5 3 B 6 20 3
df.lookup(df.index,df.Region)
Out[256]: array([ 2, 4, 5, 50, 6, 20], dtype=int64)
#df['var']=df.lookup(df.index,df.Region)

Better way to replace values in DataFrame from large dictionary

I have written some code that replaces values in a DataFrame with values from another frame using a dictionary, and it is working, but i am using this on some large files, where the dictionary can get very long. A few thousand pairs. When I then uses this code it runs very slow, and it have also been going out of memory on a few ocations.
I am somewhat convinced that my method of doing this is far from optimal, and that there must be some faster ways to do this. I have created a simple example that does what I want, but that is slow for large amounts of data. Hope someone have a simpler way to do this.
import pandas as pd
#Frame with data where I want to replace the 'id' with the name from df2
df1 = pd.DataFrame({'id' : [1, 2, 3, 4, 5, 3, 5, 9], 'values' : [12, 32, 42, 51, 23, 14, 111, 134]})
#Frame containing names linked to ids
df2 = pd.DataFrame({'id' : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'name' : ['id1', 'id2', 'id3', 'id4', 'id5', 'id6', 'id7', 'id8', 'id9', 'id10']})
#My current "slow" way of doing this.
#Starts by creating a dictionary from df2
#Need to create dictionaries from the domain and banners tables to link ids
df2_dict = dict(zip(df2['id'], df2['name']))
#and then uses the dict to replace the ids with name in df1
df1.replace({'id' : df2_dict}, inplace=True)
I think you can use map with Series converted to_dict - get NaN if not exist value in df2:
df1['id'] = df1.id.map(df2.set_index('id')['name'].to_dict())
print (df1)
id values
0 id1 12
1 id2 32
2 id3 42
3 id4 51
4 id5 23
5 id3 14
6 id5 111
7 id9 134
Or replace, if dont exist value in df2 let original values from df1:
df1['id'] = df1.id.replace(df2.set_index('id')['name'])
print (df1)
id values
0 id1 12
1 id2 32
2 id3 42
3 id4 51
4 id5 23
5 id3 14
6 id5 111
7 id9 134
Sample:
#Frame with data where I want to replace the 'id' with the name from df2
df1 = pd.DataFrame({'id' : [1, 2, 3, 4, 5, 3, 5, 9], 'values' : [12, 32, 42, 51, 23, 14, 111, 134]})
print (df1)
#Frame containing names linked to ids
df2 = pd.DataFrame({'id' : [1, 2, 3, 4, 6, 7, 8, 9, 10], 'name' : ['id1', 'id2', 'id3', 'id4', 'id6', 'id7', 'id8', 'id9', 'id10']})
print (df2)
df1['new_map'] = df1.id.map(df2.set_index('id')['name'].to_dict())
df1['new_replace'] = df1.id.replace(df2.set_index('id')['name'])
print (df1)
id values new_map new_replace
0 1 12 id1 id1
1 2 32 id2 id2
2 3 42 id3 id3
3 4 51 id4 id4
4 5 23 NaN 5
5 3 14 id3 id3
6 5 111 NaN 5
7 9 134 id9 id9

Resources