all!
I have a dataframe. One column contains strings like this: 'Product1, Product2, foo, bar'.
I've splitted them by ',' and now I have a column containing lists of product names.
How can I get a set of unique product names?
First flatten list of lists, then apply set and last convert to list:
df = pd.DataFrame(data = {'a':['Product1,Product1,foo,bar','Product1,foo,foo,bar']})
print (df)
a
0 Product1,Product1,foo,bar
1 Product1,foo,foo,bar
a=list(set([item for sublist in df['a'].str.split(',').values.tolist() for item in sublist]))
print (a)
['bar', 'foo', 'Product1']
If want unique values per rows:
df = df['a'].str.split(',').apply(lambda x: list(set(x)))
print (df)
0 [bar, foo, Product1]
1 [bar, foo, Product1]
Name: a, dtype: object
Related
I have a sample dataframe as given below.
import pandas as pd
data = {'ID':['A', 'B', 'C', 'D],
'Age':[[20], [21], [19], [24]],
'Sex':[['Male'], ['Male'],['Female'], np.nan],
'Interest': [['Dance','Music'], ['Dance','Sports'], ['Hiking','Surfing'], np.nan]}
df = pd.DataFrame(data)
df
Each of the columns are in list datatype. I want to remove those lists and preserve the datatypes present within the lists for all columns.
The final output should look something shown below.
Any help is greatly appreciated. Thank you.
Option 1. You can use the .str column accessor to index the lists stored in the DataFrame values (or strings, or any other iterable):
# Replace columns containing length-1 lists with the only item in each list
df['Age'] = df['Age'].str[0]
df['Sex'] = df['Sex'].str[0]
# Pass the variable-length list into the join() string method
df['Interest'] = df['Interest'].apply(', '.join)
Option 2. explode Age and Sex, then apply ', '.join to Interest:
df = df.explode(['Age', 'Sex'])
df['Interest'] = df['Interest'].apply(', '.join)
Both options return:
df
ID Age Sex Interest
0 A 20 Male Dance, Music
1 B 21 Male Dance, Sports
2 C 19 Female Hiking, Surfing
EDIT
Option 3. If you have many columns which contain lists with possible missing values as np.nan, you can get the list-column names and then loop over them as follows:
# Get columns which contain at least one python list
list_cols = [c for c in df
if df[c].apply(lambda x: isinstance(x, list)).any()]
list_cols
['Age', 'Sex', 'Interest']
# Process each column
for c in list_cols:
# If all lists in column c contain a single item:
if (df[c].str.len() == 1).all():
df[c] = df[c].str[0]
else:
df[c] = df[c].apply(', '.join)
I have a dataframe with some 60+ columns. Out of these, about half are categorical (non-amount columns). Though, some of them have categorical data stored as 1s and 0s, so datatype will be int or float if it has NaN.
I need to create a new dataframe with selected columns in earlier dataframe as index and unique values as the column.
Test Data is as under:
data = pd.DataFrame({'A':['A','B','C','A','B','C','D'],
'B':[1,0,1,0,1,0,1],
'C':[10,20,30,40,50,60,70],
'D':['Y','N','Y','N','Y','N','P']
})
I did this to get the selected columns from all columns and get unique values for each column.
cols = itemgetter(0,1,3)(data.columns)
uniq_stats = pd.DataFrame(columns=['Val'],index=cols)
for each in cols:
uniq_stats.loc[each] = ';'.join(data[each].unique())
However, this fails for those columns where the data is categorical but stored in 1s and 0s, and for those columns where there are Null values.
Expected Outcome for Above Test Data:
Val
A A;B;C;D
B 1;0
D Y;N;P
What should I do to get those as well?
I'd like if Null value is also included in the list of unique values.
Use DataFrame.iloc for columns by positions and then add lambda function in DataFrame.agg:
df = data.iloc[:, [0,1,3]].agg(lambda x: ';'.join(x.astype(str).unique())).to_frame('Val')
print (df)
Val
A A;B;C;D
B 1;0
D Y;N;P
Similar idea is convert only unique values, so should be faster:
df = data.iloc[:,[0,1,3]].agg(lambda x:';'.join(str(y) for y in x.unique())).to_frame('Val')
print (df)
Val
A A;B;C;D
B 1;0
D Y;N;P
Okay. I tried the map() function to do this and I think it works. Now includes both numeric categories and nan values in the list of unique values.
cols = itemgetter(0,1,3)(data.columns)
uniq_stats = pd.DataFrame(columns=['Val'],index=cols)
for each in cols:
uniq_stats.loc[each] = ';'.join(map(str,data[each].unique()))
However, please share if there's a better and faster way to do this.
I think you can use .stack() with .groupby.unique()
selected_cols = ['A','B']
s = data[selected_cols].stack(dropna=False).groupby(level=[1]).unique()
s.to_frame('vals')
vals
A [A, B, C, D]
B [1, 0]
another way using melt.
pd.melt(data).groupby('variable')['value'].unique()
variable
A [A, B, C, D]
B [1, 0]
C [10, 20, 30, 40, 50, 60, 70]
D [Y, N, P]
Name: value, dtype: object
I want to write a function that updates the column names of a df based on the name of the df.
I have a number of dfs with identical columns. I need eventually to merge these dfs into one df . To identify where the data has originally come from once merged I want to update the column names by appending an identifier to the column name in each separate df first
I have tried using a dictionary (dict) within the function to update the columns but have been unable to get this to work
I have attempted the following function:
def update_col(input):
dict = {'df1': 'A'
,'df2': 'B'
}
input.rename(columns= {'Col1':'Col1-' + dict[input]
,'Col2':'Col2-' + dict[input]
},inplace= True)
My test df are
df1:
Col1 Col2
foo bah
foo bah
df2:
Col1 Col2
foo bah
foo bah
Running the function as follows I wish to get:
update_col(df1)
df1:
Col1-A Col2-A
foo bah
foo bah
I think better way would be:
mydict = {'df1': 'A'
,'df2': 'B'
}
d={'df'+str(e+1):i for e,i in enumerate([df1,df2])} #create a dict of dfs
final_d={k:v.add_suffix('-'+v1) for k,v in d.items() for k1,v1 in mydict.items() if k==k1}
print(final_d)
{'df1': Col1-A Col2-A
0 foo bah
1 foo bah, 'df2': Col1-B Col2-B
0 foo bah
1 foo bah}
you can then access the dfs as final_d['df1'] etc.
Note: Please dont use dict as a dictionary name as it is a builtin python function
I want to have a dataframe with repeated values with the same id number. But i want to split the repeated rows into colunms.
data = [[10450015,4.4],[16690019 4.1],[16690019,4.0],[16510069 3.7]]
df = pd.DataFrame(data, columns = ['id', 'k'])
print(df)
The resulting dataframe would have n_k (n= repated values of id rows). The repeated id gets a individual colunm and when it does not have repeated id, it gets a 0 in the new colunm.
data_merged = {'id':[10450015,16690019,16510069], '1_k':[4.4,4.1,3.7], '2_k'[0,4.0,0]}
print(data_merged)
Try assiging the column idx ref, using DataFrame.assign and groupby.cumcount then DataFrame.pivot_table. Finally use a list comprehension to sort column names:
df_new = (df.assign(col=df.groupby('id').cumcount().add(1))
.pivot_table(index='id', columns='col', values='k', fill_value=0))
df_new.columns = [f"{x}_k" for x in df_new.columns]
print(df_new)
1_k 2_k
id
10450015 4.4 0
16510069 3.7 0
16690019 4.1 4
I have a DataFrame as follows:
a b c d
0 0.140603 0.622511 0.936006 0.384274
1 0.246792 0.961605 0.866785 0.544677
2 0.710089 0.057486 0.531215 0.243285
I want to iterate the df with itertuples() and print the values and column names of each row. Currently I know the following method:
df=pd.DataFrame(np.random.rand(3,4),columns=['a','b','c','d'])
for item in df.itertuples():
print(item)
And the output is:
Pandas(Index=0, a=0.55464273035498401, b=0.50784779485386233, c=0.55866384351761911, d=0.35969591433338755)
Pandas(Index=1, a=0.60682158587529356, b=0.37571390304543184,
c=0.13566419305411737, d=0.55807909125502775)
Pandas(Index=2, a=0.73260693374584385, b=0.59246381839030349, c=0.92102184020347211, d=0.029942550647279687)
Question:
1) I thought the return data of each iteration is a tuple (as suggested by the function name) when the type(df) returns Pandas()?
2) What is the best way to extract the value of 'a', 'b', 'c', 'd' being the column names as I loop through the items of each row?
It's a named tuple.
To access the values of the named tuple, either by label:
for item in df.itertuples():
print(item.a, item.b)
or by position
for item in df.itertuples():
print(item[1], item[2])
When DataFrame has more than 254 columns, the return type is a tuple and the only available access is by position. To be anyway able to access by label, restrict df just to columns you need
for item in df.loc[:, [a, b]].itertuples():
print(item.a, item.b)