Combine multiple rows into Single row based on specific column using python - python-3.x

I need to modify available value in billable and non-billable utilization, earlier its default now the value is dynamic.
I have a Billable column value as 'Yes' and 'No'
If Value is 'Yes' then it will sum row-wise and created new columns as 'Billable Utilization'
Billing_utilization = df[Billing_utilization] * sum/available * 100
If value is 'No' then it will be sum row-wise and created new column as 'Non-Billable Utilization'.
Non-Billing_utilization = df[Non-Billing_utilization] * sum/ available1 * 100
Data:
| Employee Name | Java | Python | .Net | React | Billable |
| Priya | 10 | | 5 | | Yes |
| Priya | | 10 | | 5 | No |
| Krithi | | 10v | 20 | | No |
Output
Priya is in both billable and non-billable, priya name appears in two rows. I need to merge in single row with Employee Name. So expected output should be
| Employee Name | Java | Python | .Net | React | Total | Billing | Non-Billing |
| Priya | 10 | 10 | 5 | 5 | 30 | 8.928571429 | 8.928571429 |
| Krithi | 10 | 20 | | | 30 | | 17.85714286 |
df['Billable Status'] = np.where ( df['Billable Status'] == 'Billable', 'Billable Utilization','Non Billable Utilization' )
df2 = (df.groupby ( ['Employee Name', 'Billable Status'])[list_column].sum ().sum ( axis=1 ).unstack ().div (available2).mul(100)).round ( 2 ))
df = df1.join ( df2 ).reset_index ()
df.index = df.index
# Round the column value
df['Total'] = df['Total'].round ( 2 )
# df= df.round(2)

Try:
cols = df.select_dtypes ( 'number' ).columns.tolist ()
df['Total'] = df.groupby('Employee Name')[cols].transform('sum').sum(1)
df['Billing'] = df.mask(df['Billable'] == 'No')[cols].sum(1) / df['Total']
df['Non-Billing'] = df.mask(df['Billable'] == 'Yes')[cols].sum(1) / df['Total']
aggfuncs = dict(zip(cols, ['sum']*len(cols)))
aggfuncs.update({'Total': 'first', 'Billing': 'sum', 'Non-Billing': 'sum'})
out = df.pivot_table(aggfuncs, 'Employee Name', aggfunc=aggfuncs,
sort=False, fill_value=0)[aggfuncs].reset_index()
Output:
>>> out
Employee Name Java Python .Net React Total Billing Non-Billing
0 Priya 10 10 5 5 30 0.5 0.5
1 Krithi 0 10 20 0 30 0.0 1.0

Related

Pandas groupby compare count equal values in 2 columns in excel with subrows

I have an excel file like this:
link
.----.-------------.-------------------------.-----------------.
| | ID | Shareholder - Last name | DM Cognome |
:----+-------------+-------------------------+-----------------:
| 1. | 01287560153 | MASSIRONI | Bocapine Ardaya |
:----+-------------+-------------------------+-----------------:
| | | CAGNACCI | |
:----+-------------+-------------------------+-----------------:
| 2. | 05562881002 | | Directors |
:----+-------------+-------------------------+-----------------:
| 3. | 04113870655 | SABATO | Sabato |
:----+-------------+-------------------------+-----------------:
| | | VILLARI | |
:----+-------------+-------------------------+-----------------:
| 4. | 01419190846 | SALMERI | Salmeri |
:----+-------------+-------------------------+-----------------:
| | | MICALIZZI | Lipari |
:----+-------------+-------------------------+-----------------:
| | | LIPARI | |
'----'-------------'-------------------------'-----------------'
I open this file with pandas and ffill the ID column since there are subrows. Then groupby by ID to get the count of any equal values on the Shareholder - Last name and DM\nCognome columns. However I can't. In this case the result should be 0 row1 0 row2 1 row3 2 row4.
It should be noted that row 4 is consist of 3 subrow and row3 also consist of 2 subrow.(ex)
I have 2 questions:
What is the best way to read an unorganised excel file like above and do lots of comparisons, replacing values etc.
How can I achieve the results that I mentioned earlier.
Here is what I did, but it doesn't work:
data['ID'] = data['ID'].fillna(method='ffill')
data.groupby('ID', sort=False, as_index=False)['Shareholder - Last name', 'DM\nCognome'].apply(lambda x: (x['Shareholder - Last name']==x['DM\nCognome']).count())
First, read as input the table (keeping the ID as string instead of float):
df = pd.read_excel("Workbook1.xlsx", converters={'ID':str})
df = df.drop("Unnamed: 0", axis=1) #drop this column since it is not useful
Fill the ID and if a shareholder is missing replace Nan with "Missing":
df['ID'] = df['ID'].fillna(method='ffill')
df["Shareholder - Last name"] = df["Shareholder - Last name"].fillna("missing")
Convert to lowercase the surnames:
df["Shareholder - Last name"] = df["Shareholder - Last name"].str.lower()
Custom function to count how many householders occur in the other column:
def f(group):
s = pd.Series(group["DM\nCognome"].str.lower())
count = 0
for surname in group["Shareholder - Last name"]:
count += s.str.count(surname).sum()
return count
And finally get the count for each ID:
df.groupby("ID",sort=False)[["Shareholder - Last name", "DM\nCognome"]].apply(lambda x: f(x))
Output:
ID
01287560153 0.0
05562881002 0.0
04113870655 1.0
01419190846 2.0

Explode date interval over a group by and take last value in pyspark

I have a dataframe which contains some products, a date and a value. Now the dates have different gaps inbetween recorded values that I want to fill out. Such that I have a recorded value for every hour from the first time the product was seen to the last, if there is no record I want to use the latest value.
So, I have a dataframe like:
| ProductId | Date | Value |
|-----------|-------------------------------|-------|
| 1 | 2020-03-12T00:00:00.000+0000 | 4 |
| 1 | 2020-03-12T01:00:00.000+0000 | 2 |
| 2 | 2020-03-12T01:00:00.000+0000 | 3 |
| 2 | 2020-03-12T03:00:00.000+0000 | 4 |
| 1 | 2020-03-12T05:00:00.000+0000 | 4 |
| 3 | 2020-03-12T05:00:00.000+0000 | 2 |
I want to create a new dataframe that looks like:
| ProductId | Date | Value |
|-----------|-------------------------------|-------|
| 1 | 2020-03-12T00:00:00.000+0000 | 4 |
| 1 | 2020-03-12T01:00:00.000+0000 | 2 |
| 1 | 2020-03-12T02:00:00.000+0000 | 2 |
| 1 | 2020-03-12T03:00:00.000+0000 | 2 |
| 1 | 2020-03-12T04:00:00.000+0000 | 2 |
| 1 | 2020-03-12T05:00:00.000+0000 | 4 |
| 2 | 2020-03-12T01:00:00.000+0000 | 3 |
| 2 | 2020-03-12T02:00:00.000+0000 | 3 |
| 2 | 2020-03-12T03:00:00.000+0000 | 4 |
| 3 | 2020-03-12T05:00:00.000+0000 | 2 |
My code so far:
def generate_date_series(start, stop):
start = datetime.strptime(start, "yyyy-MM-dd'T'HH:mm:ss.SSSZ")
stop = datetime.strptime(stop, "yyyy-MM-dd'T'HH:mm:ss.SSSZ")
return [start + datetime.timedelta(hours=x) for x in range(0, (stop-start).hours + 1)]
spark.udf.register("generate_date_series", generate_date_series, ArrayType(TimestampType()))
df = df.withColumn("max", max(col("Date")).over(Window.partitionBy("ProductId"))) \
.withColumn("min", min(col("Date")).over(Window.partitionBy("ProductId"))) \
.withColumn("Dato", explode(generate_date_series(col("min"), col("max"))) \
.over(Window.partitionBy("ProductId").orderBy(col("Dato").desc())))
window_over_ids = (Window.partitionBy("ProductId").rangeBetween(Window.unboundedPreceding, -1).orderBy("Date"))
df = df.withColumn("Value", last("Value", ignorenulls=True).over(window_over_ids))
Error:
TypeError: strptime() argument 1 must be str, not Column
So the first question is obviously how do I create and call the udf correctly so I don't run into the above error.
The second question is how do I complete the task, such that I get my desired dataframe?
So after some searching and experimenting I found a solution. I defined a udf that returns a date range between two dates with 1 hour intervals. And I then do a forward fill
I fixed the issue with the following code:
def missing_hours(t1, t2):
return [t1 + timedelta(hours=x) for x in range(0, int((t2-t1).total_seconds()/3600))]
missing_hours_udf = udf(missing_hours, ArrayType(TimestampType()))
window = Window.partitionBy("ProductId").orderBy("Date")
df_missing = df.withColumn("prev_timestamp", lag(col("Date"), 1, None).over(window)) \
.filter(col("prev_timestamp").isNotNull()) \
.withColumn("Date", explode(missing_hours_udf(col("prev_timestamp"), col("Date")))) \
.withColumn("Value", lit(None)) \
.drop("prev_timestamp")
df = df_original.union(df_missing)
window = Window.partitionBy("ProductId").orderBy("Date") \
.rowsBetween(-sys.maxsize, 0)
# define the forward-filled column
filled_values_column = last(df['Value'], ignorenulls=True).over(window)
# do the fill
df = df.withColumn('Value', filled_values_column)

Python selecting different number of rows for each group of a mutlilevel index

I have a data frame with a multilevel index. I would like to sort this data frame based on a specific column and extract the first n rows for each group of the first index, but n is different for each group.
For example:
| Index1| Index2| Sort_In_descending_order | How_manyRows_toChoose |
-----------------------------------------------------------------------
| 1 | 20 | 3 | 2 |
| | 40 | 2 | 2 |
| | 10 | 1 | 2 |
| 2 | 20 | 2 | 1 |
| | 50 | 1 | 1 |
the result should look like this:
| Index1| Index2| Sort_In_descending_order | How_manyRows_toChoose |
-----------------------------------------------------------------------
| 1 | 20 | 3 | 2 |
| | 40 | 2 | 2 |
| 2 | 20 | 2 | 1 |
I got this far:
df.groupby(level[0,1]).sum().sort_values(['Index1','Sort_In_descending_order'],ascending=False).groupby('Index1').head(2)
However the .head(2) picks 2 element of each group independent of the number in the column "How_manyRows_toChoose".
Some pice of code would be great!
Thank you!
Use lambda function in GroupBy.apply with head and add parameter group_keys=False for avoid duplicated index values:
#original code
df = (df.groupby(level[0,1])
.sum()
.sort_values(['Index1','Sort_In_descending_order'],ascending=False))
df = (df.groupby('Index1', group_keys=False)
.apply(lambda x: x.head(x['How_manyRows_toChoose'].iat[0])))
print (df)
Sort_In_descending_order How_manyRows_toChoose
Index1 Index2
1 20 3 2
40 2 2
2 20 2 1

Sorting rows in pandas first by timestamp values and then by giving particular order to categorical values of a column

I have a pandas dataframe which has a column "user" containing categorical values(a,b,c,d). I only care about the ordering of two categories in ascending order (a, d). So (a,b,c,d) and (a,c,b,d) both are fine for me.
How to create the ordering is the first part of the question?
Secondly I have another column which contains "timestamps". I want to order my rows first by "timestamps" and then for the rows with same timestamps I want to sort with the above ordering of categorical values.
Lets say my data frame looks like this.
+-----------+------+
| Timestamp | User |
+-----------+------+
| 1 | b |
| 2 | d |
| 1 | a |
| 1 | c |
| 1 | d |
| 2 | a |
| 2 | b |
+-----------+------+
I want first this kind of sorting to happen
+-----------+------+
| Timestamp | User |
+-----------+------+
| 1 | b |
| 1 | a |
| 1 | c |
| 1 | d |
| 2 | d |
| 2 | a |
| 2 | b |
+-----------+------+
Followed by the categorical ordering of "user"
+-----------+------+
| Timestamp | User |
+-----------+------+
| 1 | a |
| 1 | b |
| 1 | c |
| 1 | d |
| 2 | a |
| 2 | b |
| 2 | d |
+-----------+------+
OR
+-----------+------+
| Timestamp | User |
+-----------+------+
| 1 | a |
| 1 | c |
| 1 | b |
| 1 | d |
| 2 | a |
| 2 | b |
| 2 | d |
+-----------+------+
As you can see the "c" and "b"'s order do not matter.
You can specify order in ordered categorical by categories and then call DataFrame.sort_values:
df['User'] = pd.Categorical(df['User'], ordered=True, categories=['a','b','c','d'])
df = df.sort_values(['Timestamp','User'])
print (df)
Timestamp User
2 1 a
0 1 b
3 1 c
4 1 d
5 2 a
6 2 b
1 2 d
If there is many values of User is possible dynamically create categories:
vals = ['a', 'd']
cats = vals + np.setdiff1d(df['User'], vals).tolist()
print (cats)
['a', 'd', 'b', 'c']
df['User'] = pd.Categorical(df['User'], ordered=True, categories=cats)
df = df.sort_values(['Timestamp','User'])
print (df)
Timestamp User
2 1 a
4 1 d
0 1 b
3 1 c
5 2 a
1 2 d
6 2 b

pandas - create new columns based on existing columns / conditional average

I am new to Pandas and I am trying to learn column creation based on conditions applied to already existing columns. I am working with cellular data and this is how my source data looks like (the 2 columns to the right are empty to begin with):
DEVICE_ID | MONTH | TYPE | DAY | COUNT | LAST_MONTH| SEASONAL_AVG
8129 | 201601 | VOICE | 1 | 8 | |
8129 | 201502 | VOICE | 1 | 5 | |
8129 | 201501 | VOICE | 1 | 2 | |
8321 | 201403 | DATA | 3 | 1 | |
2908 | 201302 | TEXT | 5 | 4 | |
8129 | 201406 | VOICE | 2 | 3 | |
8129 | 201306 | VOICE | 2 | 7 | |
3096 | 201501 | DATA | 5 | 6 | |
8129 | 201301 | VOICE | 1 | 2 | |
I created a dataframe with this data and named it df.
df = pd.DataFrame({'DEVICE_ID' : [8129, 8129,8129,8321,2908,8129,8129,3096,8129],
'MONTH' : [201601,201502,201501,201403,201302,201406,201306,201501,201301],
'TYPE' : ['VOICE','VOICE','VOICE','DATA','TEXT','VOICE','VOICE','DATA','VOICE'],
'DAY' : [1,1,1,3,5,2,2,5,1],
'COUNT' : [8,5,2,1,4,3,7,6,2]
})
I am trying to create two additional columns to df: 'LAST_MONTH' and 'SEASONAL_AVG'. Logic for these two columns:
LAST_MONTH: for the corresponding DEVICE_ID & TYPE & DAY combination return the previous month's COUNT. Ex: For row 1 (DEVICE_ID: 8129, TYPE: VOICE, DAY: 1, MONTH 201502), LAST_MONTH will be COUNT from row 2 (DEVICE_ID: 8129, TYPE: VOICE, DAY: 1, MONTH 201501. If there is no record for the previous month, LAST_MONTH will be zero.
SEASONAL_AVG: for the corresponding DEVICE_ID & TYPE & DAY combination return the average of corresponding month from all previous years (data starts from 201301). Ex: SEASONAL_AVG for row 0 = average of COUNTs of rows 2 and 8. There will always be at least one record for corresponding month from the past. Need not be for for all TYPEs and DAYs combinations, but at least some of the possible combinations will be present for all DEVICE_IDs.
Your help is greatly appreciated! Thanks!
EDIT1:
def last_month(record):
year = int(str(record['MONTH'])[:4])
month = int(str(record['MONTH'])[-2:])
if month in (2,3,4,5,6,7,8,9,10):
x = str(0)+str(month-1)
y = int(str(year)+str(x))
last_month = int(y)
elif month == 1:
last_month = int(str(year-1)+str(12))
else:
last_month = int(str(year)+str(month-1))
day = record['DAY']
cellular_type = record['TYPE']
#return record['COUNT']
return record['COUNT'][(record['MONTH'] == last_month) & (record['DAY'] == day) & (record['TYPE'] == cellular_type)]
df['last_month'] = df.apply (lambda record: last_month(record),axis=1)

Resources