Transform dataframe from wide to long while breaking apart column names - python-3.x

I have a dataframe that looks like this:
CustomerID CustomerStatus CustomerTier Order.Blue.Good Order.Green.Bad Order.Red.Good
----------------------------------------------------------------------------------------------------------------
101 ACTIVE PREMIUM NoticeABC: Good 5 NoticeYAF: Bad 1 NoticeAFV: Good 4
102 INACTIVE DIAMOND NoticeTAC: Bad 3
I'm trying to transform it to look like this:
CustomerID CustomerStatus CustomerTier Color Outcome NoticeCode NoticeDesc
----------------------------------------------------------------------------------------------------------------
101 ACTIVE PREMIUM Blue Good NoticeABC Good 5
101 ACTIVE PREMIUM Green Bad NoticeYAF Bad 1
101 ACTIVE PREMIUM Red Good NoticeAFV Good 4
102 INACTIVE DIAMOND Green Bad NoticeTAC Bad3
I believe this is just a wide-to-long data transformation, which I tried using this approach:
df = pd.wide_to_long(df, ['Order'], i=['CustomerID','CustomerStatus','CustomerTier'], j='Color', sep='.')
However, this is returning an empty dataframe. I'm sure I'm doing something wrong with the separator--perhaps because there are 2 of them in the column names?
I feel like splitting the column names into Color, Outcome, NoticeCode, and NoticeDesc would be relatively easy once I figure out how to do this conversion, but just struggling with this part!
Any helpful tips to point me in the right direction would be greatly appreciated! Thank you!

I believe this would need to be solved with two separate calls to pd.wide_to_long as so:
# To set "Outcome" column
df = pd.wide_to_long(df,
stubnames = ['Order.Blue', 'Order.Green', 'Order.Red'],
i = ['CustomerID','CustomerStatus','CustomerTier'],
j = 'Outcome',
sep = '.')
df = pd.wide_to_long(df
stubnames = 'Order',
i = ['CustomerID', 'CustomerStatus', 'CustomerTier'],
j = 'Color',
sep= '.')
Then, to split the Notice column, you could use pd.str.split as so:
df[['NoticeCode', 'NoticeDesc']] = df['Outcome'].str.split(': ', expand=True)
Let me know how this goes and we can workshop a bit!

Related

Multilevel index of rows of a dataframe using pandas [duplicate]

I've spent hours browsing everywhere now to try to create a multiindex from dataframe in pandas. This is the dataframe I have (posting excel sheet mockup. I do have this in pandas dataframe):
And this is what I want:
I have tried
newmulti = currentDataFrame.set_index(['user_id','account_num'])
But it returns a dataframe, not a multiindex. Also, I could not figure out how to make 'user_id' level 0 and 'account_num' level 1. I think this must be trivial but I've read so many posts, tutorials, etc. and still could not figure it out. Partly because I'm a very visual person and most posts are not. Please help!
You could simply use groupby in this case, which will create the multi-index automatically when it sums the sales along the requested columns.
df.groupby(['user_id', 'account_num', 'dates']).sales.sum().to_frame()
You should also be able to simply do this:
df.set_index(['user_id', 'account_num', 'dates'])
Although you probably want to avoid any duplicates (e.g. two or more rows with identical user_id, account_num and date values but different sales figures) by summing them, which is why I recommended using groupby.
If you need the multi-index, you can simply access viat new_df.index where new_df is the new dataframe created from either of the two operations above.
And user_id will be level 0 and account_num will be level 1.
For clarification of future users I would like to add the following:
As said by Alexander,
df.set_index(['user_id', 'account_num', 'dates'])
with a possible inplace=True does the job.
The type(df) gives
pandas.core.frame.DataFrame
whereas type(df.index) is indeed the expected
pandas.core.indexes.multi.MultiIndex
Use pd.MultiIndex.from_arrays
lvl0 = currentDataFrame.user_id.values
lvl1 = currentDataFrame.account_num.values
midx = pd.MultiIndex.from_arrays([lvl0, lvl1], names=['level 0', 'level 1'])
There are two ways to do it, albeit not exactly like you have shown, but it works.
Say you have the following df:
A B C D
0 nil one 1 NaN
1 bar one 5 5.0
2 foo two 3 8.0
3 bar three 2 1.0
4 foo two 4 2.0
5 bar two 6 NaN
1. Workaround 1:
df.set_index('A', append = True, drop = False).reorder_levels(order = [1,0]).sort_index()
This will return:
2. Workaround 2:
df.set_index(['A', 'B']).sort_index()
This will return:
The DataFrame returned by currentDataFrame.set_index(['user_id','account_num']) has it's index set to ['user_id','account_num']
newmulti.index will return the MultiIndex object.

Combine multiple rows based on Id and other column using pandas python [duplicate]

I've spent hours browsing everywhere now to try to create a multiindex from dataframe in pandas. This is the dataframe I have (posting excel sheet mockup. I do have this in pandas dataframe):
And this is what I want:
I have tried
newmulti = currentDataFrame.set_index(['user_id','account_num'])
But it returns a dataframe, not a multiindex. Also, I could not figure out how to make 'user_id' level 0 and 'account_num' level 1. I think this must be trivial but I've read so many posts, tutorials, etc. and still could not figure it out. Partly because I'm a very visual person and most posts are not. Please help!
You could simply use groupby in this case, which will create the multi-index automatically when it sums the sales along the requested columns.
df.groupby(['user_id', 'account_num', 'dates']).sales.sum().to_frame()
You should also be able to simply do this:
df.set_index(['user_id', 'account_num', 'dates'])
Although you probably want to avoid any duplicates (e.g. two or more rows with identical user_id, account_num and date values but different sales figures) by summing them, which is why I recommended using groupby.
If you need the multi-index, you can simply access viat new_df.index where new_df is the new dataframe created from either of the two operations above.
And user_id will be level 0 and account_num will be level 1.
For clarification of future users I would like to add the following:
As said by Alexander,
df.set_index(['user_id', 'account_num', 'dates'])
with a possible inplace=True does the job.
The type(df) gives
pandas.core.frame.DataFrame
whereas type(df.index) is indeed the expected
pandas.core.indexes.multi.MultiIndex
Use pd.MultiIndex.from_arrays
lvl0 = currentDataFrame.user_id.values
lvl1 = currentDataFrame.account_num.values
midx = pd.MultiIndex.from_arrays([lvl0, lvl1], names=['level 0', 'level 1'])
There are two ways to do it, albeit not exactly like you have shown, but it works.
Say you have the following df:
A B C D
0 nil one 1 NaN
1 bar one 5 5.0
2 foo two 3 8.0
3 bar three 2 1.0
4 foo two 4 2.0
5 bar two 6 NaN
1. Workaround 1:
df.set_index('A', append = True, drop = False).reorder_levels(order = [1,0]).sort_index()
This will return:
2. Workaround 2:
df.set_index(['A', 'B']).sort_index()
This will return:
The DataFrame returned by currentDataFrame.set_index(['user_id','account_num']) has it's index set to ['user_id','account_num']
newmulti.index will return the MultiIndex object.

Code optimisation - comparing two datetime columns by month and creating a new column too slow

I am trying to create a new column in Pandas dataframe. If the other two date columns in my dataframe share the same month, then this new column should have 1 as a value, otherwise 0. Also, I need to check that ids match my other list of ids that I have saved previously in another place and mark those only with 1. I have some code but it is useless since I am dealing with almost a billion of rows.
my_list_of_ids = df[df.bool_column == 1].id.values
def my_func(date1, date2):
for id_ in df.id:
if id_ in my_list_of_ids:
if date1.month == date2.month:
my_var = 1
else:
my_var = 0
else:
my_var = 0
return my_var
df["new_column"] = df.progress_apply(lambda x: my_func(x['date1'], x['date2']), axis=1)
Been waiting for 30 minutes and still 0%. Any help is appreciated.
UPDATE (adding an example):
id | date1 | date2 | bool_column | new_column |
id1 2019-02-13 2019-04-11 1 0
id1 2019-03-15 2019-04-11 0 0
id1 2019-04-23 2019-04-11 0 1
id2 2019-08-22 2019-08-11 1 1
id2 ....
id3 2019-09-01 2019-09-30 1 1
.
.
.
What I need to do is save the ids that are 1 in my bool_column, then I am looping through all of the ids in my dataframe and checking if they are in the previously created list (= 1). Then I want to compare month and the year of date1 and date2 columns and if they are the same, create a new_column with a value 1 where they mach, otherwise, 0.
The pandas way to do this is
mask = ((df['date1'].month == df['date2'].month) & (df['id'].isin(my_list_of_ids)))
df['new_column'] = mask.replace({False: 0, True: 1})
Since you have a large data-set, this will take time, but should be faster than using apply
The best way to deal with the month match is to use vectorization in pandas and do this:
new_column = (df.date1.dt.month == df.date2.dt.month).astype(int)
That is, avoid using apply() over the DataFrame (which will probably be iterative) and take advantage of the underlying numpy vectorization. The gateway to such functionality is almost always in families of Series functions and properties, like the dt family for dates, str family for strings, and so forth.
Luckily, you have pre-computed the id_list membership in your bool_column, so to add membership as a criterion, just do this:
new_column = ((df.date1.dt.month == df.date2.dt.month) & df.bool_column).astype(int)
Once again, the & of two Series takes advantage of vectorization. You stay inside boolean space till the end, then cast to int with astype(int). Reviewing your code, it occurs to me that the iterative checking of your id_list may be the real performance hit here, even more so than the DataFrame.apply(). Whatever you do, avoid at all costs iterating your id_list at each row, since you already have a vector denoting membership in your bool_column.
By the way I believe there's a tiny error in your example data, the new_column value for your third row should be 0, since your bool_column value there is 0.

Exporting a list as a new column in a pandas dataframe as part of a nested for loop

I am inputting multiple spreadsheets with multiple columns of data. For each spreadsheet, the maximum value of each column is found. Then, for each element in the column, the element is divided by the maximum value of that column. The output should be a value (between 0 and 1) for each element in the column in ascending order. This is appended to a list which should be added to the source spreadsheet as a column.
Currently, the nested loops are performing correctly apart from the final step, as far as I understand. Each column is added to the spreadsheet EXCEPT the values are for the final column of the source spreadsheet rather than values related to each individual column.
I have tried changing the indents to associate levels of the code with different parts (as I think this is the problem) and tried moving the appended column along in the dataframe, to no avail.
for i in distlist:
#listname = i[4:] + '_norm'
df2 = pd.read_excel(i,header=0,index_col=None, skip_blank_lines=True)
df3 = df2.dropna(axis=0, how='any')
cols = []
for column in df3:
cols.append(column)
for x in cols:
listname = x + ' norm'
maxval = df3[x].max()
print(maxval)
mylist = []
for j in df3[x]:
findNL = (j/maxval)
mylist.append(findNL)
df3[listname] = mylist
saveloc = 'E:/test/'
filename = i[:-18] + '_Normalised.xlsx'
df3.to_excel(saveloc+filename, index=False)
New columns are added to the output dataframe with bespoke headings relating to the field headers in the source spreadsheet and renamed according to (listname). The data in each one of these new columns is identical and relates to the final column in the spreadsheet. To me, it seems to be overwriting the values each time (as if looping through the entire spreadsheet, not outputting for each column), and adding it to the spreadsheet.
Any help would be much appreciated. I think it's something simple, but I haven't managed to work out what...
If I understand you correctly, you are overcomplicating things. You dont need a for loop for this. You can simplify your code:
# Make example dataframe, this is not provided
df = pd.DataFrame({'col1':[1, 2, 3, 4],
'col2':[5, 6, 7, 8]})
print(df)
col1 col2
0 1 5
1 2 6
2 3 7
3 4 8
Now we can use DataFrame.apply and use add_suffix to give the new columns _norm suffix and after that concat the columns to one final dataframe
df_conc = pd.concat([df, df.apply(lambda x: x/x.max()).add_suffix('_norm')],axis=1)
print(df_conc)
col1 col2 col1_norm col2_norm
0 1 5 0.25 0.625
1 2 6 0.50 0.750
2 3 7 0.75 0.875
3 4 8 1.00 1.000
Many thanks. I think I was just overcomplicating it. Incidentally, I think my code may do the same job, but because there is so little difference in the values, it wasn't notable.
Thanks for your help #Erfan

How to understand python data frames better

Being a beginner in Python, I often face this problem - Let's say I am working with a data frame and want to execute an operation on one of the column. It can be just removing the decimal point from the value or maybe I want to take out the month from the date column. But often the solutions I would find online - it is generally shown with a single value or a data point like this:
a = 11.0
int(a)
11
Now, the same solution can't be applied if I have a data frame or a column. Again If I want to add time with date
d = date.today()
d
datetime.date(2018, 3, 30)
datetime.combine(d, datetime.min.time())
datetime.datetime(2018, 3, 30, 0, 0)
In the same manner, this solution can not be used for a data frame. That will throw an error. Obviously I have a lacking in knowledge here, I am not being able to make it work in terms of data frames. Can you please point me towards the topic which might help me understand these problems in terms of data frames ? or maybe show an example how its done ?
You should have a look at pandas library to manipulate dataframes : https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
This is an exemple to apply a function for each value of a given column:
import pandas as pd
def myFunction(a_string):
return a_string.upper()
data = pd.read_csv('data.csv')
print(data)
data['City'] = data['City'].apply(myFunction)
print(data)
Data at beginning :
Name City Age
Robert Paris 32
Max Dallas 24
Raj Delhi 27
Data after:
Name City Age
Robert PARIS 32
Max DALLAS 24
Raj DELHI 27
Here myFunction uppercase the string but could be used the same way for other kind of operations.
Hope that helps.

Resources