To extract distinct values for all categorical columns in dataframe - python-3.x

I have a situation where I need to print all the distinct values that are there for all the categorical columns in my data frame
The dataframe looks like this :
Gender Function Segment
M IT LE
F IT LM
M HR LE
F HR LM
The output should give me the following:
Variable_Name Distinct_Count
Gender 2
Function 2
Segment 2
How to achieve this?

using nunique then passing the series into a new datafame and setting column names.
df_unique = df.nunique().to_frame().reset_index()
df_unique.columns = ['Variable','DistinctCount']
print(df_unique)
Variable DistinctCount
0 Gender 2
1 Function 2
2 Segment 2

This is not good, yet it won't fail to provide the expected output:
new_data = {'Variable_Name':[],'Distinct_Count':[]}
for i in list(df):
new_data['Variable_Name'].append(i)
new_data['Distinct_Count'].append(df[i].nunique())
new_df = pd.DataFrame(new_data)
print(new_df)
Output:
Variable_Name Distinct_Count
0 Gender 2
1 Function 2
2 Segment 2

Related

get records with distinct combination of id after self join in pandas

I have a pandas dataframe df. It has columns setid, id, and label. I would like to compare the id label values pairwise. I've tried doing a self join as I illustrate below, but that winds up giving me extra records for each permutation of id. I would like just one record for each distinct combination of id. I've sketched out some examples below with data to illustrate what I'm trying to accomplish. Can anyone suggest slick way to do this?
df
setid id label
1 1 a
1 2 b
if I join it to itself on setid
import pandas as pd
pd.merge(df,df, how='inner', on=['setid']).head()
setid id_x id_y label_x label_y
1 1 1 a a
1 1 2 a b
1 2 2 b b
1 2 1 b a
but I only want one version of each combination of id, for example the output below
setid id_x id_y label_x label_y
1 1 1 a a
1 1 2 a b
1 2 2 b b
You could use np.sort
df1=pd.merge(df,df, how='inner', on=['setid']).head()
df1[['label_x', 'label_y']]=np.sort(df1.filter(like='label').values, axis=1)
df1=df1.drop_duplicates(subset=['label_x','label_y'])
df1

How to split a column values in dataframe

I have a dataframe like this
PK Name Mobile questions
1 Jack 12345 [{'question':"how are you","Response":"Fine"},{"question":"whats your age","Response":"i am 19"}]
2 kim 102345 [{'question':"how are you","Response":"Not Fine"},{"question":"whats your age","Response":"i am 29"}]
3 jame 420
I want the output df to be like
PK Name Mobile Question 1 Response 1 Question 2 Response 2
1 Jack 12345 How are you Fine Whats your age i am 19
2 Kim 102345 How are you Not Fine Whats your age i am 29
3 jame 420
you can use explode to first create a row per element in each list. Then create a dataframe from this exploded series and keep indexes. assign a column to get incremental value per row per index group, then set_index and unstack to create the right shape. finally rename the columns and join to original df
# create a row per element in each list in each row
s = df['questions'].explode()
# create the dataframe and reshape
df_qr = pd.DataFrame(s.tolist(), index=s.index)\
.assign(cc=lambda x: x.groupby(level=0).cumcount()+1)\
.set_index('cc', append=True).unstack()
#flatten columns names
df_qr.columns = [f'{col[0]} {col[1]}' for col in df_qr.columns]
# join back to df
df_f = df.drop('questions', axis=1).join(df_qr, how='left')
print (df_f)
PK Name Mobile question 1 question 2 Response 1 Response 2
0 1 Jack 12345 how are you whats your age Fine i am 19
1 2 kim 102345 how are you whats your age Not Fine i am 29
Edit, if some rows are emppty strings instead of lis, then create s this way:
s = df.loc[df['questions'].apply(lambda x: isinstance(x, list)), 'questions'].explode()

Normalize Column Values by Monthly Averages with added Group dimension

Initial Note
I already got this running, but it takes a very long time to execute. My DataFrame is around 500MB large. I am hoping to hear some feedback on how to execute this as quickly as possible.
Problem Statement
I want to normalize the DataFrame columns by the mean of the column's values during each month. An added complexity is that I have a column named group which denotes a different sensor in which the parameter (column) was measured. Therefore, the analysis needs to iterate around group and each month.
DF example
X Y Z group
2019-02-01 09:30:07 1 2 1 'grp1'
2019-02-01 09:30:23 2 4 3 'grp2'
2019-02-01 09:30:38 3 6 5 'grp1'
...
Code (Functional, but slow)
This is the code that I used. Coding annotations provide descriptions of most lines. I recognize that the three for loops are causing this runtime issue, but I do not have the foresight to see a way around it. Does anyone know any
# Get mean monthly values for each group
mean_per_month_unit = process_df.groupby('group').resample('M', how='mean')
# Store the monthly dates created in last line into a list called month_dates
month_dates = mean_per_month_unit.index.get_level_values(1)
# Place date on multiIndex columns. future note: use df[DATE, COL_NAME][UNIT] to access mean value
mean_per_month_unit = mean_per_month_unit.unstack().swaplevel(0,1,1).sort_index(axis=1)
divide_df = pd.DataFrame().reindex_like(df)
process_cols.remove('group')
for grp in group_list:
print(grp)
# Iterate through month
for mnth in month_dates:
# Make mask where month and group
mask = (df.index.month == mnth.month) & (df['group'] == grp)
for col in process_cols:
# Set values of divide_df
divide_df.iloc[mask.tolist(), divide_df.columns.get_loc(col)] = mean_per_month_unit[mnth, col][grp]
# Divide process_df with divide_df
final_df = process_df / divide_df.values
EDIT: Example data
Here is the data in CSV format.
EDIT2: Current code (according to current answer)
def normalize_df(df):
df['month'] = df.index.month
print(df['month'])
df['year'] = df.index.year
print(df['year'])
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
agg = df.groupby(by=['group', 'month', 'year'], as_index=True).mean()
print("###################", x.name, x['month'])
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by
print(column)
mean_col = agg.loc[(x['group'], x['month'], x['year']), column]
print(mean_col)
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
normalize_cols = df.columns.tolist()
normalize_cols.remove('group')
#normalize_cols.remove('mode')
df2 = df.apply(find_norm, df_col_list = normalize_cols, axis=1)
The code runs perfectly for one iteration and then it fails with the error:
KeyError: ('month', 'occurred at index 2019-02-01 11:30:17')
As I said, it runs correctly once. However, it iterates over the same row again and then fails. I see according to df.apply() documentation that the first row always runs twice. I'm just not sure why this fails on the second time through.
Assuming that the requirement is to group the columns by mean and the month, here is another approach:
Create new columns - month and year from the index. df.index.month can be used for this provided the index is of type DatetimeIndex
type(df.index) # df is the original dataframe
#pandas.core.indexes.datetimes.DatetimeIndex
df['month'] = df.index.month
df['year'] = df.index.year # added year assuming the grouping occurs per grp per month per year. No need to add this column if year is not to be considered.
Now, group over (grp, month, year) and aggregate to find mean of every column. (Added year assuming the grouping occurs per grp per month per year. No need to add this column if year is not to be considered.)
agg = df.groupby(by=['grp', 'month', 'year'], as_index=True).mean()
Use a function to calculate the normalized values and use apply() over the original dataframe
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by the mean.
mean_col = agg.loc[(str(x['grp']), x['month'], x['year']), column]
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
df2 = df.apply(find_norm, df_col_list = ['A','B','C'], axis=1)
#df2 will now have 3 additional columns - normA, normB, normC
df2:
A B C grp month year normA normB normC
2019-02-01 09:30:07 1 2 3 1 2 2019 0.666667 0.8 1.5
2019-03-02 09:30:07 2 3 4 1 3 2019 1.000000 1.0 1.0
2019-02-01 09:40:07 2 3 1 2 2 2019 1.000000 1.0 1.0
2019-02-01 09:38:07 2 3 1 1 2 2019 1.333333 1.2 0.5
Alternatively, for step 3, one can join the agg and df dataframes and find the norm.
Hope this helps!
Here is how the code would look like:
# Step 1
df['month'] = df.index.month
df['year'] = df.index.year # added year assuming the grouping occurs
# Step 2
agg = df.groupby(by=['grp', 'month', 'year'], as_index=True).mean()
# Step 3
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by the mean.
mean_col = agg.loc[(str(x['grp']), x['month'], x['year']), column]
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
df2 = df.apply(find_norm, df_col_list = ['A','B','C'], axis=1)

How to put a number of timedate data into a subset of dataframe while keeping the data type?

I have a dataframe which has name as the index while a column of birth date e.g
> df_birthdate
date
Paul 2009-03-07
Peter 2000-06-23
Pauline 2001-03-03
Paula 2002-02-17
> type(df_birthdate.date[0])
pandas._libs.tslibs.timestamps.Timestamp
> df_huge = pd.DataFrame({'School': ['A','A','A','A','B','B','B','B']})
> df_huge['new_date'] = ''
> idx_t = df_huge.School == 'A'
And I have a huge dataframe called df_huge which I want to put the date into it. I know that the order won't change.
df_huge.loc[idx_t, "new_date"] = df_birthdate.values
The above code works for me in the most cases, however, when the 'date' column is in datetime format, by applying .values, the data which I put into the df_huge dataframe are no longer in datetime format. Any suggestion to put 'date' from df_birthdate into a specific location of the df_huge? Many thanks.
You can omit df_huge['new_date'] = '' for assign empty strings to column:
idx_t = df_huge.School == 'A'
df_huge.loc[idx_t, "new_date"] = df_birthdate.to_numpy()
print (df_huge)
School new_date
0 A 2009-03-07
1 A 2000-06-23
2 A 2001-03-03
3 A 2002-02-17
4 B NaT
5 B NaT
6 B NaT
7 B NaT
print (df_huge.dtypes)
School object
new_date datetime64[ns]
dtype: object

I want to count the occurrence of duplicate values in a column in a dataframe and update the count in a new column in python

Example: Let's say I have a df
Id
A
B
C
A
A
B
It should look like:
Id count
A. 1
B. 1
C. 1
A. 2
A. 3
B. 2
Note: I've tried using the for loop method and while loop option but it works for small datasets but takes a lot of time for large datasets.
for i in df:
for j in df:
if i==j:
count+=1
You can groupby with cumcount, like this:
df['counts'] = df.groupby('Id', sort=False).cumcount() + 1
df.head()
Id counts
0 A 1
1 B 1
2 C 1
3 A 2
4 A 3
5 B 2
dups_values = df.pivot_table(index=['values'], aggfunc='size')
print(dups_values)

Resources