I have a dataset catalog with 3 columns: product id, brand name and product class.
import pandas as pd
catalog = {'product_id': [1, 2, 3, 1, 2, 4, 3, 5, 6],
'brand_name': ['FW', 'GW', 'FK','FW','GW','WU','FK','MU', 'AS'],
'product_class': ['ACCESSORIES', 'DRINK', 'FOOD', 'ACCESSORIES', 'DRINK', 'FURNITURE','FOOD', 'ELECTRONICS', 'APPAREL']}
df = pd.DataFrame(data=catalog)
Assume I have a list of product id prod = [1,3,4]. Now, with Python, I want to list all the brand names corresponding to this list prod based on the product_id. How can I do this using only groupby() and get_group() functions? I can do this using pd.DataFrame() combined with the zip() function, but it is too inefficient, as I would need to obtain each column individually.
Expected output (in dataframe)
Product_id Brand_name
1 'FW'
3 'FK'
4 'WU'
Can anyone give some help on this?
You can use pandas functions isin() and drop_duplicates() to achieve this:
prod = [1,3,4]
print(df[df.product_id.isin(prod)][["product_id", "brand_name"]].drop_duplicates())
Output:
product_id brand_name
0 1 FW
2 3 FK
5 4 WU
Related
I am trying to scan a column in a df that only contains values that have 0-9. I want to exclude or flag columns in this dataframe that contain aplha/numerical
df_analysis[df_analysis['unique_values'].astype(str).str.contains(r'^[0-9]*$', na=True)]
import pandas as pd
df = pd.DataFrame({"string": ["asdf", "lj;k", "qwer"], "numbers": [6, 4, 5], "more_numbers": [1, 2, 3], "mixed": ["wef", 8, 9]})
print(df.select_dtypes(include=["int64"]).columns.to_list())
print(df.select_dtypes(include=["object"]).columns.to_list())
Create dataframe with multiple columns. Use .select_dtypes to find the columns that are integers and return them as a list. You can add "float64" or any other numeric type to the include list.
Output:
I want to add a list of unique values to a DataFrame column. There is the code:
IDs = set(Remedy['Ticket ID'])
log['ID Incidencias'] = IDs
But I obtain the following error:
ValueError: Length of values does not match length of index
Any idea about how could I add a list of unique values to an existing DataFrame column?
Thanks
Not sure if this is what you really need, but to add a list or set of values to each row of an existing dataframe column you can use:
log['ID Incidencias'] = [IDs] * len(log)
Example:
df = pd.DataFrame({'col1': list('abc')})
IDs = set((1,2,3,4))
df['col2'] = [IDs] * len(df)
print(df)
# col1 col2
#0 a {1, 2, 3, 4}
#1 b {1, 2, 3, 4}
#2 c {1, 2, 3, 4}
I have a dataframe and I want to get one of its columns as a list of strings, so that from something like:
df = pd.DataFrame({'customer':['a','a','a','b','b'],
'location':['1','2','3','4','5']})
I can get a dataframe like:
a ['1','2','3']
b ['4','5']
where one column is the customer and another is a list of strings of their location.
I have tried df.astype(str).values.tolist() but I can't seem to groupby in order to get the list per customer.
Just use
df.groupby('customer').location.unique()
Out[58]:
customer
a [1, 2, 3]
b [4, 5]
Name: location, dtype: object
This is string type , just did not show the quote
df.groupby('customer').location.unique()[0][0]
Out[61]: '1'
Also you should know string input in list dose not show quote in pandas' object
pd.Series([['1','2']])
Out[64]:
0 [1, 2]
dtype: object
I am scraping data from a website which builds a Pandas dataframe with different column names dependent on the data available on the site. I have a vector of column names, say:
colnames = ['column1', 'column2', 'column3', 'column5']
which are the columns of a postgres database for which I wish to store the scraped data in.
The problem I am having is that the way I have had to set up the scraping to get all the data I require, I end up grabbing some columns for which I have no use and which aren't in my postgres database. These columns will not have the same names each time, as some pages have extra data, so I can't simply exclude the column names I don't want, as I don't know what all of these will be. There will also be columns in my postgres database for which the data will not be scraped every time.
Hence, when I try and upload the resulting dataframe to postgres, I get the error:
psycopg2.errors.UndefinedColumn: column "column4" of relation "my_db" does not exist
This leads to my question:
How do I subset the resulting pandas dataframe using the column names I have stored in the vector, given some of the columns may not exist in the dataframe? I have tried my_dt = my_dt[colnames], which returns the error:
KeyError: ['column1', 'column2', 'column3'] not in index
Reproducible example:
df = pd.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8]], columns =
['column1', 'column2', 'column3', 'column4'])
subset_columns = ['column1', 'column2', 'column3', 'column5']
test = df[subset_columns]
Any help will be appreciated.
You can simply do:
colnames = ['column1', 'column2', 'column3', 'column5']
df[df.columns & colnames]
I managed to find a fix, though I still don't understand what was causing the initial 'Key Error' to come out as a vector rather than just the elements which weren't columns of my dataframe:
df = pd.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8]], columns =
['column1', 'column2', 'column3', 'column4'])
subset_columns = ['column1', 'column2', 'column3', 'column5']
column_match = set(subset_columns) & set(df.columns)
df = df[column_match]
Out[69]:
column2 column1 column3
0 2 1 3
1 6 5 7
I have a dataframe like this
import pandas as pd
df = pd.DataFrame({'item': [1, 1,2,2],
'user': [1,2,2,1],
'appraisal': [4,2,1,3],
'feedback' : ['good', 'bad', 'bad', 'well']
})
names = ['item', 'user', 'appraisal', 'feedback' ]
df = df[names]
df
I want to get a dataframe as below
item appraisal feedback
0 1 3 good bad
1 2 2 bad well
Where 'item' is 'item' from df, 'appraisal' is average of df.appraisal for items and 'feedback' is combined cells from df.feedback for items
I can get two variales
df1 = df.groupby('item')['appraisal'].mean()
But how to combine text cells? I can make pivot_table item / user and "feedback" as a value and then add cells user1+user2.....
but the real dataset has many unique values and i don't think it's a best decision
thanx for help
you can use GroupBy.agg() method:
In [4]: df.groupby('item').agg({'appraisal':'mean','feedback':' '.join})
Out[4]:
appraisal feedback
item
1 3 good bad
2 2 bad well
or if you need a "flat" DF, use as_index=False as #John Galt has recommended:
In [5]: df.groupby('item', as_index=False).agg({'appraisal':'mean','feedback':' '.join})
Out[5]:
item appraisal feedback
0 1 3 good bad
1 2 2 bad well