Split Pandas dataframe into multiple dataframes based on empty column delimiter - python-3.x

I'm reading the following excel sheet into a dataframe.
I want to split it into three dataframes by product. The tables will always be delimited by a single blank column in between, but each table can have different number of columns.

Based on the article introduced in the comment, you can process it as follows.
import pandas as pd
#### Read excel file to dataframe
df = pd.read_excel('test.xlsx', index_col=None, header=None)
#### Find empty column and listed
empcols = [col for col in df.columns if df[col].isnull().all()]
df.fillna('', inplace=True)
#### Split into consecutive columns of valid data
allcols = list(range(len(df.columns)))
start = 0
colslist = []
for sepcol in empcols:
colslist.append(allcols[start:sepcol])
start = sepcol+1
colslist.append(allcols[start:])
#### Extract consecutive columns of valid data and store them in a dictionary
dfdic = {}
for i in range(len(colslist)):
wkdf = df.iloc[:, colslist[i]]
title = ''.join(wkdf.iloc[0].tolist())
wkcols = wkdf.iloc[1].tolist()
wkdf.drop(wkdf.index[[0,1]], inplace=True)
wkdf.columns = wkcols
dfdic[title] = wkdf.reset_index(drop=True)
#### Display each DataFrame stored in the dictionary
dfkeys = dfdic.keys()
for k in dfkeys:
print(k)
print(dfdic[k])
print()

Related

How do I subset a pandas dataframe based on a list of column names

I have a client data df with 200+ columns, say A,B,C,D...X,Y,Z. There's a column in this df which has CAMPAIGN_ID in it. I have another data mapping_csv that has CAMPAIGN_ID and set of columns I need from df. I need to split df into one csv file for each campaign, that will have rows from that campaign and only those columns that are as per mapping_csv.
I am getting type error as below.
TypeError: unhashable type: 'list'
This is what I tried.
for campaign in df['CAMPAIGN_ID'].unique():
df2 = df[df['CAMPAIGN_ID']==campaign]
# remove blank columns
df2.dropna(how='all', axis=1, inplace=True)
for column in df2.columns:
if df2[column].unique()[0]=="0000-00-00" and df2[column].unique().shape[0]==1:
df2 = df2.drop(column, axis=1)
for column in df2.columns:
if df2[column].unique()[0]=='0' and df2[column].unique().shape[0]==1:
df2 = df2.drop(column, axis=1)
# select required columns
df2 = df2[mapping_csv.loc[mapping_csv['CAMPAIGN_ID']==campaign, 'Variable_List'].str.replace(" ","").str.split(",")]
file_shape = df2.shape[0]
filename = "cart_"+str(dt.date.today().strftime('%Y%m%d'))+"_"+campaign+"_rowcnt_"+str(file_shape)
df2.to_csv(filename+".csv",index=False)
Any help will be appreciated.
This is how data looks like -
This is how mapping looks like -
This addresses your core problem.
df = pd.DataFrame(dict(id=['foo','foo','bar','bar',],a=[1,2,3,4,], b=[5,6,7,8], c=[1,2,3,4]))
mapper = dict(foo=['a','b'], bar=['b','c'])
for each_id in df.id.unique():
df_id = df.query(f'id.str.contains("{each_id}")').loc[:,mapper[each_id]]
print(df_id)

How to replace a value in a column by the values of multiple dataframe column

In my dataframe, I have multiple columns whose values I would like to replace into one column. For instance, I would like the NaN values in MEDICATIONS: columns to be replaced by a value if it exists in any other column except MEDICATION:
Input:
Expected Output:
`
df['MEDICATIONS'].combine_first(df["Rest of the columns besides MEDICATIONS:"])
`
Link of the dataset:
https://drive.google.com/file/d/1cyZ_OWrGNvJyc8ZPNFVe543UAI9snHDT/view?usp=sharing
Something like this?
import pandas as pd
df = pd.read_csv('data - data.csv')
del df['Unnamed: 0']
df['Combined_Meds'] = df.astype(str).values.sum(axis=1)
df['Combined_Meds'] = df['Combined_Meds'].str.replace('nan', '', regex=False)
cols = list(df.columns)
cols = [cols[-1]] + cols[:-1]
df = df[cols]
df.sample(10)

How to selecting multiple rows and take mean value based on name of the row

From this data frame I like to select rows with same concentration and also almost same name. For example, first three rows has same concentration and also same name except at the end of the name Dig_I, Dig_II, Dig_III. This 3 rows same with same concentration. I like to somehow select this three rows and take mean value of each column. After that I want to create a new data frame.
here is the whole data frame:
import pandas as pd
df = pd.read_csv("https://gist.github.com/akash062/75dea3e23a002c98c77a0b7ad3fbd25b.js")
import pandas as pd
df = pd.read_csv("https://gist.github.com/akash062/75dea3e23a002c98c77a0b7ad3fbd25b.js")
new_df = df.groupby('concentration').mean()
Note: This will only find the averages for columns with dtype float or int... this will drop the img_name column and will take the averages of all columns...
This may be faster...
df = pd.read_csv("https://gist.github.com/akash062/75dea3e23a002c98c77a0b7ad3fbd25b.js").groupby('concentration').mean()
If you would like to preserve the img_name...
df = pd.read_csv("https://gist.github.com/akash062/75dea3e23a002c98c77a0b7ad3fbd25b.js")
new = df.groupby('concentration').mean()
pd.merge(df, new, left_on = 'concentration', right_on = 'concentration', how = 'inner')
Does that help?

how to remove duplicates from all sheets in dataframe in python

I have a dataframe with number of sheets,i wants to delete duplicate from all sheets.i used below code
df = df.drop_duplicates(subset='Month',keep='last')
after that i save this df
df.to_excel(path,index=False)
but its removing only 1st sheet duplicate and showing only one sheet
I would suggest treating each sheet of your document as an separate data frame, then in iteration remove the duplicates of each set according to your criteria. This is quick draft of concept I had on mind, for 2 sheets:
xls = pd.ExcelFile('myFile.xls')
xls_dfs = []
df1 = pd.read_excel(xls, 'Sheet1')
xls_dfs.append(df1)
df2 = pd.read_excel(xls, 'Sheet2')
xls_dfs.append(df2)
for df in xls_dfs:
df = df.drop_duplicates(subset='Month',keep='last')
df.to_excel('myFile.xls',index=False)

Want to copy selected column from old dataframe to a new dataframe column

I have a dataframe named new_df and would like to create a new data and copy column "Close" to new dataframe to column named "Col1". I would then open another dataframe named new_df and copy "Close" to Column named "Col2" of the new dataframe already created.
It is imporantant to note that when importing column that the data column may vary in lenghth, meaning first column import may have 30 records and second column import may have 32 records.
df = pd.read_csv('RIO.L.csv',parse_dates=True)
df['Date_1'] = pd.to_datetime(df['Date'], format= '%d/%m/%Y')
df['Year'] = pd.DatetimeIndex(df['Date_1']).year
df['Month'] = pd.DatetimeIndex(df['Date_1']).month
df['Day'] = pd.DatetimeIndex(df['Date_1']).day
df.sort_values(by=['Month','Year','Day'], inplace=True)
m_Year_Select = 2019
m_Month_Select = 5
v_data_select = (df['Year'] <= m_Year_Select) & (df['Month'] == m_month_Select)
new_df = df.loc[v_data_select]
print(new_df)
I used the pd.concat()
result = pd.concat([result, df_2000['Close']], axis=1, sort=False, join='outer')
Problem solved

Resources