I have used the following code with the unique() function in pandas to create a column which then contains a list of unique values:
import pandas as pd
from collections import OrderedDict
dct = OrderedDict([
('referencenum',['10','10','20','20','20','30','30','40']),
('Month',['Jan','Jan','Jan','Feb','Feb','Feb','Feb','Mar']),
('Category',['good','bad','bad','bad','bad','good','bad','bad'])
])
df = pd.DataFrame.from_dict(dct)
This gives the following sample dataset:
referencenum Month Category
0 10 Jan good
1 10 Jan bad
2 20 Jan bad
3 20 Feb bad
4 20 Feb bad
5 30 Feb good
6 30 Feb bad
7 40 Mar bad
Then I summarise as follows:
dfsummary = pd.DataFrame(df.groupby(['referencenum', 'Month'])['Category'].unique())
dfsummary.reset_index()
To give the summary dataframe with "Category" column containing a list
referencenum Month Category
0 10 Jan [good, bad]
1 20 Feb [bad]
2 20 Jan [bad]
3 30 Feb [good, bad]
4 40 Mar [bad]
My question is how do I obtain another column containing the len() or number of items in the Category "list" column?
Also - how do extract the first/ second item in the list to another column?
Can I do these manipulations within pandas or do I somehow need to drop out to list manipulations and then come back to pandas?
Many thanks!
You should check out the accessors.
Basically, they're ways to handle the values contained in a Series that are specific to their type (datetime, string, etc.).
In this case, you would use df['Category'].str.len().
If you wanted the first element, you would use df['Category'].str[0].
To generalise: you can treat the elements of a Series as a collection of objects by referring to its .str property.
If you want to get the number of elements of each entry in Category column, you should use len() method with apply():
dfsummary['Category_len'] = dfsummary['Category'].apply(len)
Related
I have a Pandas dataframe and I want to get a list of all of the unique years for unique events. I don't care about the DIRECTION column, I just want a list of DATE's. I don't necessarily want the DATE's to be unique, because there are sometimes multiple ID's for the same date, but I don't need all of the DIRECTION's for the same date.
Pandas df
ID DIRECTION DATE
ABA Z 2019
ABA N 2019
ABA E 2019
ABB Z 2019
ABB N 2019
ABB E 2019
ABC Z 2020
ABC N 2020
ABC E 2020
Expected Output
[2019, 2019, 2020]
Actual Output
[2019, 2020]
Current Code
ids=df['ID'].unique().tolist()
dates=df['DATE'].unique().tolist()
labels, counts = np.unique(dates, return_counts=True)
**
len(counts) == 2
#I want len(counts) == 3
You want the unique date per id, and then concatenate them into one array:
np.concatenate(df.groupby('ID')['DATE'].unique().values)
Output:
array([2019, 2019, 2020])
I'd like to get the number of month in between these dates (between max and minimum date) and keep the same order in the groupby
One of possible solutions is to start from a datesac - the result
of your grouping (presented in your picture).
I also assume that ORDER_INST column of your source DataFrame is of datetime type (not a string) and hence just this type has also level 1 of
the MultiIndex in datesac.
To compute the month span separately for each MRN (level 0 of the
MultiIndex), define a function, to be applied to each group:
def monthSpan(grp):
dates = grp.index.get_level_values(1)
return (dates.max().to_period('M') - dates.min().to_period('M')).n
Then add MonthSpan column to your df, running:
datesac['MonthSpan'] = datesac.groupby(level=0).transform(monthSpan);
The result is:
List MonthSpan
MRN ORDER_INST
1000031 2010-04-12 0 11
2010-04-16 0 11
2010-04-17 0 11
2010-04-18 0 11
2011-03-01 0 11
9017307 2018-11-27 0 7
2019-02-04 0 7
2019-04-25 0 7
2019-05-14 0 7
2019-06-09 0 7
Pandas does not allow item assignments to a groupby object (a new column cannot be added to a groupby object) so the operation will have to be split. One solution is first calculate the month difference from the groupby object, merge the dataframes together, and then groupby again.
Create the first groupby object:
datesac = acdates.groupby(['MRN'])
Calculate the difference in months between each group and join to the original dataframe (or a new dataframe). This method requires numpy so import as necessary
import numpy as np
acdates_new = pd.merge(
left=acdates,
right=((datesac['ORDER_INST'].max() - df_group['ORDER_INST'].min())/np.timedelta64(1, 'M')).astype('int').rename("DATE_DIFF"),
left_on='MRN',
right_index=True
)
Regroup
datesac = acdates_new.groupby(['MRN'])
I will try to keep this as simple as possible and cause no confusion.
I have over 35,000 rows and 15 columns of data but I am going to provide a very simple case of what I am trying to do.
Basically, I want to sort my dates in column A in ascending order, but then also have the data corresponding to a date sorted to "follow" it to its new row location. I want this to be done using VBA code.
Simple Case
Raw Data
Column A Column B
Jan 5 15
Jan 3 45
Jan 1 7
Jan 10 12
Jan 7 30
Expected Sorted Data
Column A Column B
Jan 1 7
Jan 3 45
Jan 5 15
Jan 7 30
Jan 10 12
I have looked all over and can not find a good way to do this. Any and all help will be greatly appreciated. Thanks!
I have tried to solve my problem for many hours, but somehow I dont get the solutions right.
I have a pandas dataframe which comprises a column of dates, which are in turn stored as lists for each row.
df['date']=pd.to_datetime(df["date"])
gives me the error: unhashable type: 'list'
My dataframe looks like this:
0 [February 28, 2013 Thursday]
1 [November 2, 2012 Friday]
2 [July 31, 2012 Tuesday]
3 [May 10, 2012 Thursday]
4 [June 23, 2004 Wednesday]
as such, each row is a list, but each list only contains one string. I want to convert this one string within each row to datetime format (just like 02-28-2013) in the dataframe so that I can perform date operations.
How can I convert the column in a way that the pd.to_datetime command can be executed?
Thank you so much!!
i create my DataFrame like that:
import pandas as pd
df = pd.DataFrame(columns=['date'])
df['date']=pd.to_datetime(df["date"])
df.date=[['01-01-2013'], ['2-2-2015'], ['July 31, 2012']]
I used to write some random dates, but it's also a pandas with lists.
>>> df.date
0 [01-01-2013]
1 [2-2-2015]
2 [July 31, 2012]
You have to access the elements in the list, so simply use a lambda function
pd.to_datetime(df.date.apply(lambda x: x[0]))
>>> df.date
0 2013-01-01
1 2015-02-02
2 2012-07-31
My spark dataframe has array column, I have to generate new columns by extracting data from single array column. are there any methods available for this.
id Amount
10 [Tax:10,Total:30,excludingTax:20]
11 [Total:30]
12 [Tax:05,Total:35,excludingTax:30]
I have to generate this dataframe.
ID Tax Total
10 10 30
11 0 30
12 05 35
If you know for sure [Tax:10,Total:30,excludingTax:20] are the only fields in the same order you can always map over entire dataframe and extract them as Amount[0], Amount[1] ...
Then assign them as a instance of a case class and finally convert back to dataframe.
Only thing you have to be care full that you don't call Amount[3] if Amount has only 2 values. That is easily achievable by checking the array length.
Alternately if you don't know the order. Best way is to use JSONRdd. Then loop through the JSON object parse them and create a new row. Finally convert that to a dataframe