Dataframe column to list of strings (with groupby) - python-3.x

I have a dataframe and I want to get one of its columns as a list of strings, so that from something like:
df = pd.DataFrame({'customer':['a','a','a','b','b'],
'location':['1','2','3','4','5']})
I can get a dataframe like:
a ['1','2','3']
b ['4','5']
where one column is the customer and another is a list of strings of their location.
I have tried df.astype(str).values.tolist() but I can't seem to groupby in order to get the list per customer.

Just use
df.groupby('customer').location.unique()
Out[58]:
customer
a [1, 2, 3]
b [4, 5]
Name: location, dtype: object
This is string type , just did not show the quote
df.groupby('customer').location.unique()[0][0]
Out[61]: '1'
Also you should know string input in list dose not show quote in pandas' object
pd.Series([['1','2']])
Out[64]:
0 [1, 2]
dtype: object

Related

Check if each Column in pandas DF is only values [ 0-9]

I am trying to scan a column in a df that only contains values that have 0-9. I want to exclude or flag columns in this dataframe that contain aplha/numerical
df_analysis[df_analysis['unique_values'].astype(str).str.contains(r'^[0-9]*$', na=True)]
import pandas as pd
df = pd.DataFrame({"string": ["asdf", "lj;k", "qwer"], "numbers": [6, 4, 5], "more_numbers": [1, 2, 3], "mixed": ["wef", 8, 9]})
print(df.select_dtypes(include=["int64"]).columns.to_list())
print(df.select_dtypes(include=["object"]).columns.to_list())
Create dataframe with multiple columns. Use .select_dtypes to find the columns that are integers and return them as a list. You can add "float64" or any other numeric type to the include list.
Output:

Subsets of a data frame where certain columns satisfy a condition

I have a dataset catalog with 3 columns: product id, brand name and product class.
import pandas as pd
catalog = {'product_id': [1, 2, 3, 1, 2, 4, 3, 5, 6],
'brand_name': ['FW', 'GW', 'FK','FW','GW','WU','FK','MU', 'AS'],
'product_class': ['ACCESSORIES', 'DRINK', 'FOOD', 'ACCESSORIES', 'DRINK', 'FURNITURE','FOOD', 'ELECTRONICS', 'APPAREL']}
df = pd.DataFrame(data=catalog)
Assume I have a list of product id prod = [1,3,4]. Now, with Python, I want to list all the brand names corresponding to this list prod based on the product_id. How can I do this using only groupby() and get_group() functions? I can do this using pd.DataFrame() combined with the zip() function, but it is too inefficient, as I would need to obtain each column individually.
Expected output (in dataframe)
Product_id Brand_name
1 'FW'
3 'FK'
4 'WU'
Can anyone give some help on this?
You can use pandas functions isin() and drop_duplicates() to achieve this:
prod = [1,3,4]
print(df[df.product_id.isin(prod)][["product_id", "brand_name"]].drop_duplicates())
Output:
product_id brand_name
0 1 FW
2 3 FK
5 4 WU

Creating a dataframe of Unique Values for each column from another DataFrame

I have a dataframe with some 60+ columns. Out of these, about half are categorical (non-amount columns). Though, some of them have categorical data stored as 1s and 0s, so datatype will be int or float if it has NaN.
I need to create a new dataframe with selected columns in earlier dataframe as index and unique values as the column.
Test Data is as under:
data = pd.DataFrame({'A':['A','B','C','A','B','C','D'],
'B':[1,0,1,0,1,0,1],
'C':[10,20,30,40,50,60,70],
'D':['Y','N','Y','N','Y','N','P']
})
I did this to get the selected columns from all columns and get unique values for each column.
cols = itemgetter(0,1,3)(data.columns)
uniq_stats = pd.DataFrame(columns=['Val'],index=cols)
for each in cols:
uniq_stats.loc[each] = ';'.join(data[each].unique())
However, this fails for those columns where the data is categorical but stored in 1s and 0s, and for those columns where there are Null values.
Expected Outcome for Above Test Data:
Val
A A;B;C;D
B 1;0
D Y;N;P
What should I do to get those as well?
I'd like if Null value is also included in the list of unique values.
Use DataFrame.iloc for columns by positions and then add lambda function in DataFrame.agg:
df = data.iloc[:, [0,1,3]].agg(lambda x: ';'.join(x.astype(str).unique())).to_frame('Val')
print (df)
Val
A A;B;C;D
B 1;0
D Y;N;P
Similar idea is convert only unique values, so should be faster:
df = data.iloc[:,[0,1,3]].agg(lambda x:';'.join(str(y) for y in x.unique())).to_frame('Val')
print (df)
Val
A A;B;C;D
B 1;0
D Y;N;P
Okay. I tried the map() function to do this and I think it works. Now includes both numeric categories and nan values in the list of unique values.
cols = itemgetter(0,1,3)(data.columns)
uniq_stats = pd.DataFrame(columns=['Val'],index=cols)
for each in cols:
uniq_stats.loc[each] = ';'.join(map(str,data[each].unique()))
However, please share if there's a better and faster way to do this.
I think you can use .stack() with .groupby.unique()
selected_cols = ['A','B']
s = data[selected_cols].stack(dropna=False).groupby(level=[1]).unique()
s.to_frame('vals')
vals
A [A, B, C, D]
B [1, 0]
another way using melt.
pd.melt(data).groupby('variable')['value'].unique()
variable
A [A, B, C, D]
B [1, 0]
C [10, 20, 30, 40, 50, 60, 70]
D [Y, N, P]
Name: value, dtype: object

Adding a list of unique values to an existing DataFrame column

I want to add a list of unique values to a DataFrame column. There is the code:
IDs = set(Remedy['Ticket ID'])
log['ID Incidencias'] = IDs
But I obtain the following error:
ValueError: Length of values does not match length of index
Any idea about how could I add a list of unique values to an existing DataFrame column?
Thanks
Not sure if this is what you really need, but to add a list or set of values to each row of an existing dataframe column you can use:
log['ID Incidencias'] = [IDs] * len(log)
Example:
df = pd.DataFrame({'col1': list('abc')})
IDs = set((1,2,3,4))
df['col2'] = [IDs] * len(df)
print(df)
# col1 col2
#0 a {1, 2, 3, 4}
#1 b {1, 2, 3, 4}
#2 c {1, 2, 3, 4}

Set of values in dataframe columns

all!
I have a dataframe. One column contains strings like this: 'Product1, Product2, foo, bar'.
I've splitted them by ',' and now I have a column containing lists of product names.
How can I get a set of unique product names?
First flatten list of lists, then apply set and last convert to list:
df = pd.DataFrame(data = {'a':['Product1,Product1,foo,bar','Product1,foo,foo,bar']})
print (df)
a
0 Product1,Product1,foo,bar
1 Product1,foo,foo,bar
a=list(set([item for sublist in df['a'].str.split(',').values.tolist() for item in sublist]))
print (a)
['bar', 'foo', 'Product1']
If want unique values per rows:
df = df['a'].str.split(',').apply(lambda x: list(set(x)))
print (df)
0 [bar, foo, Product1]
1 [bar, foo, Product1]
Name: a, dtype: object

Resources