Table has several columns with the same type of information - pandas-groupby

My table has 4 columns: order_id, item_id_1, item_id_2 and item_id_3. The three last columns cover the same type of information (the ids of products). I want to transform this table to get 2-columns table with "order_id" and "item_id", so my columns cover unique type of informations. That means, if in a particular order_id there were 3 products ordered, I will get three (instead of one) rows in my new table).
This will alow me, for exapmle, perform 'grupby' operation on 'item_id" column to count how meny times a particular product was ordered.
How this table transformation process is called?

For example, if you have a dataframe like this -
df = pd.DataFrame({'order_id':[1,2,3], 'item_id_1':['a','b','c'], 'item_id_2':['x','y',''], 'item_id_3':['','q','']})
df
order_id item_id_1 item_id_2 item_id_3
0 1 a x
1 2 b y q
2 3 c
pd.melt(df, id_vars=['order_id'], \
value_vars=['item_id_1', 'item_id_2', 'item_id_3'], \
var_name='item_id', value_name='item_value').\
replace('',np.nan).dropna().\
sort_values(['order_id']).\
reset_index(drop=True)\
[['order_id', 'item_id']]

So I'm not aware of any method that allows you to expand rows automatically as you're suggesting, but you can easily reach you're goal without. Let's start from a similar data frame, I put nan in cells of items that have not been ordered:
import pandas as pd
import numpy as np
data = {'order_id':[1,2,3],'item_id_1':[11,12,13],'item_id_2':[21,np.nan,23],'item_id_3':[31,np.nan,np.nan]}
df = pd.DataFrame(data)
cols = ['item_id_1','item_id_2','item_id_3']
print(df)
Out:
order_id item_id_1 item_id_2 item_id_3
0 1 11 21.0 31.0
1 2 12 NaN NaN
2 3 13 23.0 NaN
Then you can define a new empty data frame to fill by iterating through the rows of the initial one. For every item a new row is added to the empty data frame with same order_id and different item_id.
new_df = pd.DataFrame(columns = ['order_id','item_id']) # ,'item_num']
for ind, row in df.iterrows():
new_row = {}
new_row['order_id'] = row['order_id']
for col in cols: # for num, col in enumerate(cols):
item = row[col]
if not pd.isna(item):
new_row['item_id'] = item
# new_row['item_num'] = num +1
new_df = new_df.append(new_row,ignore_index=True)
print(new_df)
Out: # shape (6,2), ok because because 6 items have been ordered
order_id item_id
0 1.0 11.0
1 1.0 21.0
2 1.0 31.0
3 2.0 12.0
4 3.0 13.0
5 3.0 23.0
If you want, you could also add a third column to keep trace of the category of each item (i.e. if it was item_1, 2 or 3) by uncommenting the lines in the code, which gives you this output:
order_id item_id item_num
0 1.0 11.0 1.0
1 1.0 21.0 2.0
2 1.0 31.0 3.0
3 2.0 12.0 1.0
4 3.0 13.0 1.0
5 3.0 23.0 2.0

Related

Explode pandas rows based on function applied to each row

I have a dataframe df as follows:
Col1 Price Day
A 16 5
B 12 3
D 5 8
I need to apply a function to each row of df:
import pandas as pd, numpy as np
def Fn(Price, Day):
pr = np.arange(Price/2, Price + Price/2, Price/2)
da = np.arange(Day/2, Day+ Day/2, Day/2)
return pd.DataFrame({'Price':pr, 'Day':da)
I need to achieve the following:
Col1 Price Day
A 8 2.5
A 16 5
B 6 1.5
B 12 3
D 2.5 4
D 5 8
In reality with the function Fn has something like:
pr = np.arange(Price/18, Price + Price/18, Price/18)
da = np.arange(Day/18, Day+ Day/18, Day/18)
I am not sure how to proceed with the above.
A possible solution, which:
Iterates over the rows of the dataframe with map
Applies Fn in each iteration, getting the corresponding resulting dataframe, which is put into a list.
Finally, concatenates all dataframes of the mentioned list into a single dataframe.
(pd.concat(map(
lambda x: pd.concat(
[pd.Series(x[1]['Col1'], name='Col1'),
Fn(x[1]['Price'], x[1]['Day'])], axis=1, ignore_index=True),
df.iterrows()))
.ffill()
.set_axis(df.columns, axis=1))
Output:
Col1 Price Day
0 A 8.0 2.5
1 A 16.0 5.0
0 B 6.0 1.5
1 B 12.0 3.0
0 D 2.5 4.0
1 D 5.0 8.0

How to correspondence of unique values ​between 2 tables?

I am fairly new to Python and I am trying to create a new function to work on my project.
The function will aim to detect which unique value is present in another column of another table.
At first, the function seeks to keep only the unique values ​​of the two tables, then merges them into a new dataframe
It's the rest that gets complicated because I would like to return which row and on which table my value is missing
If you have any other leads or thought patterns, I'm also interested.
Here is my code :
def correspondance_cle(df1, df2, col):
df11 = pd.DataFrame(df1[col].unique())
df11.columns= [col]
df11['test1'] = 1
df21 = pd.DataFrame(df2[col].unique())
df21.columns= [col]
df21['test2'] = 1
df3 = pd.merge(df11, df21, on=col, how='outer')
df3 = df3.loc[(fk['test1'].isna() == True) | (fk['test2'].isna() == True),:]
df3.info()
for row in df3[col]:
if df3['test1'].isna() == True:
print(row, "is not in df1")
else:
print(row, 'is not in df2')
Thanks to everyone who took the time to read the post.
First use outer join with remove duplicates by Series.drop_duplicates and Series.reset_index for avoid removed original indices:
df1 = pd.DataFrame({'a':[1,2,5,5]})
df2 = pd.DataFrame({'a':[2,20,5,8]})
col = 'a'
df = (df1[col].drop_duplicates().reset_index()
.merge(df2[col].drop_duplicates().reset_index(),
indicator=True,
how='outer',
on=col))
print (df)
index_x a index_y _merge
0 0.0 1 NaN left_only
1 1.0 2 0.0 both
2 2.0 5 2.0 both
3 NaN 20 1.0 right_only
4 NaN 8 3.0 right_only
Then filter rows by helper column _merge:
print (df[df['_merge'].eq('left_only')])
index_x a index_y _merge
0 0.0 1 NaN left_only
print (df[df['_merge'].eq('right_only')])
index_x a index_y _merge
3 NaN 20 1.0 right_only
4 NaN 8 3.0 right_only

How to reformat time series to fill in missing entries with NaNs?

I have a problem that involves converting time series from one
representation to another. Each item in the time series has
attributes "time", "id", and "value" (think of it as a measurement
at "time" for sensor "id"). I'm storing all the items in a
Pandas dataframe with columns named by the attributes.
The set of "time"s is a small set of integers (say, 32),
but some of the "id"s are missing "time"s/"value"s. What I want to
construct is an output dataframe with the form:
id time0 time1 ... timeN
val0 val1 ... valN
where the missing "value"s are represented by NaNs.
For example, suppose the input looks like the following:
time id value
0 0 13
2 0 15
3 0 20
2 1 10
3 1 12
Then, assuming the set of possible times is 0, 2, and 3, the
desired output is:
id time0 time1 time2 time3
0 13 NaN 15 20
1 NaN NaN 10 12
I'm looking for a Pythonic way to do this since there are several
million rows in the input and around 1/4 million groups.
You can transform your table with a pivot. If you need to handle duplicate values for index/column pairs, you can use the more general pivot_table.
For your example, the simple pivot is sufficient:
>>> df = df.pivot(index="id", columns="time", values="value")
time 0 2 3
id
0 13.0 15.0 20.0
1 NaN 10.0 12.0
To get the exact result from your question, you could reindex the columns to fill in the empty values, and rename the column index like this:
# add missing time columns, fill with NaNs
df = df.reindex(range(df.columns.max() + 1), axis=1)
# name them "time#"
df.columns = "time" + df.columns.astype(str)
# remove the column index name "time"
df = df.rename_axis(None, axis=1)
Final df:
time0 time1 time2 time3
id
0 13.0 NaN 15.0 20.0
1 NaN NaN 10.0 12.0

Pandas groupby and append the original values. Count the mean of per row

I have a dataframe of IDs and Values. Where IDs are kind of repetition of trial and Values are the results.
I want to do groupby by ID and for same IDs the Values will be added to adjacent columns. Finally I want to calculate the mean of each of the rows.
>>>df
ID Value
0 1 1.1
1 2 1.2
2 3 2.4
3 1 1.7
4 2 4.3
5 3 2.2
>>>groups = df.groupby(by='ID')
#Now I cannot figure it what to do for my desired output.
I want the output like
ID Value_1 Value_2 Mean
0 1 1.1 1.7 1.9
1 2 1.2 4.3 2.75
2 3 2.4 2.2 2.3
Use DataFrame.assign for new column created by counter per groups by GroupBy.cumcount, reshape by DataFrame.pivot, change columns names by DataFrame.add_prefix, add new column filled by means and last data cleaning - DataFrame.reset_index with DataFrame.rename_axis:
df = (df.assign(g = df.groupby('ID').cumcount().add(1))
.pivot('ID','g','Value')
.add_prefix('Value_')
.assign(Mean = lambda x: x.mean(axis=1))
.reset_index()
.rename_axis(None, axis=1))
print (df)
ID Value_1 Value_2 Mean
0 1 1.1 1.7 1.40
1 2 1.2 4.3 2.75
2 3 2.4 2.2 2.30
One of possible solutions, assuming that you have 2 rows for each ID:
Define a function to be applied to groups:
def fn(grp):
vals = grp.Value.values
return [ vals[0], vals[-1], grp.Value.mean() ]
Then apply it and "move" ID column from index to regular column:
df2 = df.groupby('ID').apply(fn).apply(pd.Series).reset_index()
And the last point is to set proper column names:
df2.columns=[ 'ID', 'Value_1', 'Value_2', 'Mean' ]

Creating sqlite table from csv files with different column names

I have a large amount .csv files that I would like to put in a sqlite database. Most of the files contain the same column names, but there are some files that have extra columns.
The code that I've tried is (altered to be generic):
import os
import pandas as pd
import sqlite3
conn = sqlite3.connect('test.db')
cur = conn.cursor()
os.chdir(dir)
for file in os.listdir(dir):
df = pd.read_csv(file)
df.to_sql('X', conn, if_exists = 'append')
When it encounters a file with column that is not in table X I get the error:
OperationalError: table X has no column named ColumnZ
How can I alter my code to append the table with the new column and fill previous rows with NaN?
If all DataFrames can fit into RAM, you can do this:
import glob
files = glob.glob(r'/path/to/csv_files/*.csv')
df = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)
df.to_sql('X', conn, if_exists = 'replace')
Demo:
In [22]: d1
Out[22]:
a b
0 0 1
1 2 3
In [23]: d2
Out[23]:
a b c
0 1 2 3
1 4 5 6
In [24]: d3
Out[24]:
x b
0 11 12
1 13 14
In [25]: pd.concat([d1,d2,d3], ignore_index=True)
Out[25]:
a b c x
0 0.0 1 NaN NaN
1 2.0 3 NaN NaN
2 1.0 2 3.0 NaN
3 4.0 5 6.0 NaN
4 NaN 12 NaN 11.0
5 NaN 14 NaN 13.0
Alternatively you can store all columns as a list and check in a loop whether a new DF has additional columns and add those columns to the SQLite DB, using SQLite ALTER TABLE statement:
ALTER TABLE tab_name ADD COLUMN ...

Resources