How can I reset the index in a .groupby output? - python-3.x

I have the following dataframe. I want to group the name and brand columns by their respective unique values.
dataframe
I wrote the following python code to group them:
high_products = products.reset_index().groupby(['name', 'brand'])[['name', 'brand', 'count_name']]
and printed the output with the following code:
high_products.head().sort_values(by='count_name', ascending=False)
grouped
As you can see from the image above, it appears that it is grouping them also based on the index. In the end, I just want to get the unique name and brand values and their respective count names.
How can I do this with .groupby?
Thank you.

Change the first line of your code to (count aggregation on name, brand):
high_products = products.groupby(['name', 'brand']).agg(['count']).reset_index()

Related

get the particular values in a one column based on the value in other

I have two columns and I would like to get the values of column shipping_req_key based on one `pick_order'. The dataframe looks like this :
shipping_req_key pick_order
5029338170 480280603713
5029338145 480280712615
5029338145 480280804414
5029338145 480280807715
I would like to get the shipping_req_key corresponding to the particular pick_order. I would appreciate your feedback.
If you want to filter for all pick_orders try this. Assuming you should store the corresponding order id.
pick_list = df.pick_order.unique()
for order in pick_list:
print(order, df[df['pick_order']==order]['shipping_req_key'].to_list())
For particular
df[df['pick_order']=='480280603713']

dynamically create columns in pyspark sql select statement

I have a pyspark dataframe called unique_attributes. the dataframe has columns productname, productbrand, producttype, weight, id. I am partitioning by some columns and trying to get the first value of the id column using a window function. I would like to be able to dynamically pass a list of columns to partition by. so for example if I wanted to add the weight column to the partition without having to code another 'col('weight') in the select, just instead pass a list. does anyone have a suggestion how to accomplish this? I have an example below.
current code:
w2 = Window().partitionBy(['productname',
'productbrand',
'producttype']).orderBy(unique_attributes.id.asc())
first_item_id_df=unique_attributes\
.select(col('productname'),
col('productbrand'),
col('producttype')),first("id",True).over(w2).alias('matchid')).distinct()
desired dynamic code:
column_list=['productname',
'productbrand',
'producttype',
'weight']
w2 = Window().partitionBy(column_list).orderBy(unique_attributes.id.asc())
# somehow creates
first_item_id_df=unique_attributes\
.select(col('productname'),
col('productbrand'),
col('producttype'), col('weight'),first("id",True).over(w2).alias('matchid')).distinct()

How to get the column names after table extracted from a PDF file using camelot? I'm new for this

Briefly I am doing this steps.
tables = camelot.read_pdf(doc_file)
tables[0].df
I am using tables[0].df.columns to get column names from the extracted table.
But it does not give the column names.
Camelot extracted tables have no alphabetic column names.
tables[0].df.columns returns, for example, for three columns table:
RangeIndex(start=0, stop=3, step=1)
Instead, you can try to read the first row and get a list from it: tables[0].df.iloc[0].tolist().
The output could be:
['column1', 'column2', 'column3']

Merge multiple dataframes using multiindex in python

I have 3 series which is generated out of the code shown below. I have shown a the code for one series below
I would like to merge 3 such series/dataframes using columns (subject_id,hadm_id,icustay_id) but unfortunately these headings don't appear as column names. How do I convert them as columns and use them for merging with another series/dataframe of similar datatype
I am generating series from another dataframe (df) based on the condition given below. Though I already tried converting this series to dataframe, still it doesn't display the indices, instead it displays the column name as index. I have shown the output below. I would like to see the values 'Subject_id','hadm_id','icustay_id' as column names in dataframe along with other column 'val_bw_80_110' so that I can join with other dataframes using these 3 ids ('Subject_id','hadm_id','icustay_id')
s1 =
df.groupby(['subject_id','hadm_id','icustay_id'['val_bw_80_110'].mean()
I expect an output where the ids (subject_id,hadm_id,icustay_id) are converted to column names and can be used for joining/merging with other dataframes.
You can add parameter as_index=False to DataFrame.groupby or use Series.reset_index:
df = df.groupby(['subject_id','hadm_id','icustay_id'], as_index=False)['val_bw_80_110'].mean()
Or:
df = df.groupby(['subject_id','hadm_id','icustay_id'])['val_bw_80_110'].mean().reset_index()

Iterating over rows of dataframe but keep each row as a dataframe

I want to iterate over the rows of a dataframe, but keep each row as a dataframe that has the exact same format of the parent dataframe, except with only one row. I know about calling DataFrame() and passing in the index and columns, but for some reason this doesn't always give me the same format of the parent dataframe. Calling to_frame() on the series (i.e. the row) does cast it back to a dataframe, but often transposed or in some way different from the parent dataframe format. Isn't there some easy way to do this and guarantee it will always be the same format for each row?
Here is what I came up with as my best solution so far:
def transact(self, orders):
# Buy or Sell
if len(orders) > 1:
empty_order = orders.iloc[0:0]
for index, order in orders.iterrows():
empty_order.loc[index] = order
#empty_order.append(order)
self.sub_transact(empty_order)
else:
self.sub_transact(orders)
In essence, I empty the dataframe and then insert the series, from the For loop, back into it. This works correctly, but gives the following warning:
C:\Users\BNielson\Google Drive\My Files\machine-learning\Python-Machine-Learning\ML4T_Ex2_1.py:57: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
empty_order.loc[index] = order
C:\Users\BNielson\Anaconda3\envs\PythonMachineLearning\lib\site-packages\pandas\core\indexing.py:477: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self.obj[item] = s
So it's this line giving the warning:
empty_order.loc[index] = order
This is particularly strange because I am using .loc already, when normally you get this error when you don't use .loc.
There is a much much easier way to do what I want.
order.to_frame().T
So...
if len(orders) > 1:
for index, order in orders.iterrows():
self.sub_transact(order.to_frame().T)
else:
self.sub_transact(orders)
What this actually does is translates the series (which still contains the necessary column and index information) back to a dataframe. But for some Moronic (but I'm sure Pythonic) reason it transposes it so that the previous row is now the column and the previous columns are now multiple rows! So you just transpose it back.
Use groupby with a unique list. groupby does exactly what you are asking for as in, it iterates over each group and each group is a dataframe. So, if you manipulate it such that you groupby a value that is unique for each and every row, you'll get a single row dataframe when you iterate over the group
for n, group in df.groupby(np.arange(len(df))):
pass
# do stuff
If I can suggest an alternative way than it would be like this:
for index, order in orders.iterrows():
orders.loc[index:index]
orders.loc[index:index] is exactly one row dataframe slice with the same structure, including index and column names.

Resources