pandas - create new columns based on existing columns / conditional average - python-3.x

I am new to Pandas and I am trying to learn column creation based on conditions applied to already existing columns. I am working with cellular data and this is how my source data looks like (the 2 columns to the right are empty to begin with):
DEVICE_ID | MONTH | TYPE | DAY | COUNT | LAST_MONTH| SEASONAL_AVG
8129 | 201601 | VOICE | 1 | 8 | |
8129 | 201502 | VOICE | 1 | 5 | |
8129 | 201501 | VOICE | 1 | 2 | |
8321 | 201403 | DATA | 3 | 1 | |
2908 | 201302 | TEXT | 5 | 4 | |
8129 | 201406 | VOICE | 2 | 3 | |
8129 | 201306 | VOICE | 2 | 7 | |
3096 | 201501 | DATA | 5 | 6 | |
8129 | 201301 | VOICE | 1 | 2 | |
I created a dataframe with this data and named it df.
df = pd.DataFrame({'DEVICE_ID' : [8129, 8129,8129,8321,2908,8129,8129,3096,8129],
'MONTH' : [201601,201502,201501,201403,201302,201406,201306,201501,201301],
'TYPE' : ['VOICE','VOICE','VOICE','DATA','TEXT','VOICE','VOICE','DATA','VOICE'],
'DAY' : [1,1,1,3,5,2,2,5,1],
'COUNT' : [8,5,2,1,4,3,7,6,2]
})
I am trying to create two additional columns to df: 'LAST_MONTH' and 'SEASONAL_AVG'. Logic for these two columns:
LAST_MONTH: for the corresponding DEVICE_ID & TYPE & DAY combination return the previous month's COUNT. Ex: For row 1 (DEVICE_ID: 8129, TYPE: VOICE, DAY: 1, MONTH 201502), LAST_MONTH will be COUNT from row 2 (DEVICE_ID: 8129, TYPE: VOICE, DAY: 1, MONTH 201501. If there is no record for the previous month, LAST_MONTH will be zero.
SEASONAL_AVG: for the corresponding DEVICE_ID & TYPE & DAY combination return the average of corresponding month from all previous years (data starts from 201301). Ex: SEASONAL_AVG for row 0 = average of COUNTs of rows 2 and 8. There will always be at least one record for corresponding month from the past. Need not be for for all TYPEs and DAYs combinations, but at least some of the possible combinations will be present for all DEVICE_IDs.
Your help is greatly appreciated! Thanks!
EDIT1:
def last_month(record):
year = int(str(record['MONTH'])[:4])
month = int(str(record['MONTH'])[-2:])
if month in (2,3,4,5,6,7,8,9,10):
x = str(0)+str(month-1)
y = int(str(year)+str(x))
last_month = int(y)
elif month == 1:
last_month = int(str(year-1)+str(12))
else:
last_month = int(str(year)+str(month-1))
day = record['DAY']
cellular_type = record['TYPE']
#return record['COUNT']
return record['COUNT'][(record['MONTH'] == last_month) & (record['DAY'] == day) & (record['TYPE'] == cellular_type)]
df['last_month'] = df.apply (lambda record: last_month(record),axis=1)

Related

Combine multiple rows into Single row based on specific column using python

I need to modify available value in billable and non-billable utilization, earlier its default now the value is dynamic.
I have a Billable column value as 'Yes' and 'No'
If Value is 'Yes' then it will sum row-wise and created new columns as 'Billable Utilization'
Billing_utilization = df[Billing_utilization] * sum/available * 100
If value is 'No' then it will be sum row-wise and created new column as 'Non-Billable Utilization'.
Non-Billing_utilization = df[Non-Billing_utilization] * sum/ available1 * 100
Data:
| Employee Name | Java | Python | .Net | React | Billable |
| Priya | 10 | | 5 | | Yes |
| Priya | | 10 | | 5 | No |
| Krithi | | 10v | 20 | | No |
Output
Priya is in both billable and non-billable, priya name appears in two rows. I need to merge in single row with Employee Name. So expected output should be
| Employee Name | Java | Python | .Net | React | Total | Billing | Non-Billing |
| Priya | 10 | 10 | 5 | 5 | 30 | 8.928571429 | 8.928571429 |
| Krithi | 10 | 20 | | | 30 | | 17.85714286 |
df['Billable Status'] = np.where ( df['Billable Status'] == 'Billable', 'Billable Utilization','Non Billable Utilization' )
df2 = (df.groupby ( ['Employee Name', 'Billable Status'])[list_column].sum ().sum ( axis=1 ).unstack ().div (available2).mul(100)).round ( 2 ))
df = df1.join ( df2 ).reset_index ()
df.index = df.index
# Round the column value
df['Total'] = df['Total'].round ( 2 )
# df= df.round(2)
Try:
cols = df.select_dtypes ( 'number' ).columns.tolist ()
df['Total'] = df.groupby('Employee Name')[cols].transform('sum').sum(1)
df['Billing'] = df.mask(df['Billable'] == 'No')[cols].sum(1) / df['Total']
df['Non-Billing'] = df.mask(df['Billable'] == 'Yes')[cols].sum(1) / df['Total']
aggfuncs = dict(zip(cols, ['sum']*len(cols)))
aggfuncs.update({'Total': 'first', 'Billing': 'sum', 'Non-Billing': 'sum'})
out = df.pivot_table(aggfuncs, 'Employee Name', aggfunc=aggfuncs,
sort=False, fill_value=0)[aggfuncs].reset_index()
Output:
>>> out
Employee Name Java Python .Net React Total Billing Non-Billing
0 Priya 10 10 5 5 30 0.5 0.5
1 Krithi 0 10 20 0 30 0.0 1.0

Split column on condition in dataframe

The data frame I am working on has a column named "Phone" and I want to split in on / or , in a way such that I get the data frame as shown below in separate columns. For example, the first row is 0674-2537100/101 and I want to split it on "/" into two columns having values as 0674-2537100 and 0674-2537101.
Input:
+-------------------------------+
| Phone |
+-------------------------------+
| 0674-2537100/101 |
| 0674-2725627 |
| 0671 – 2647509 |
| 2392229 |
| 2586198/2583361 |
| 0663-2542855/2405168 |
| 0674 – 2563832/0674-2590796 |
| 0671-6520579/3200479 |
+-------------------------------+
Output:
+-----------------------------------+
| Phone | Phone1 |
+-----------------------------------+
| 0674-2537100 | 0674-2537101 |
| 0674-2725627 | |
| 0671 – 2647509 | |
| 2392229 | |
| 2586198 | 2583361 |
| 0663-2542855 | 0663-2405168 |
| 0674 – 2563832 | 0674-2590796 |
| 0671-6520579 | 0671-3200479 |
+-----------------------------------+
Here I came up with a solution where I can take out the length of strings on both sides of the separator(/). Take out their difference. Copy the substring from the first column from character position [:difference-1] to the second column.
So far my progress is,
df['Phone'] = df['Phone'].str.replace(' ', '')
df['Phone'] = df['Phone'].str.replace('–', '-')
df[['Phone','Phone1']] = df['Phone'].str.split("/",expand=True)
df["Phone1"].fillna(value=np.nan, inplace=True)
m2 = (df["Phone1"].str.len() < 12) & (df["Phone"].str.len() > 7)
m3 = df["Phone"].str.len() - df["Phonenew"].str.len()
df.loc[m2, "Phone1"] = df["Phone"].str[:m3-1] + df["Phonenew"]
It gives an error and the column has only nan values after I run this. PLease help me out here.
Considering you're only going to have 2 '/' in the 'Phone' column. Here's what you can do:
'''
This fucntion takes in rows of a dataframe as an input and returns row with appropriate values.
'''
def split_phone_number(row):
split_str=row['Phone'].split('/')
# Considering that you're only going to have 2 or lesser values, update
# the passed row's columns with appropriate values.
if len(split_str)>1:
row['Phone']=split_str[0]
row['Phone1']=split_str[1]
else:
row['Phone']=split_str[0]
row['Phone1']=''
# Return the updated row.
return row
# Making a dummy dataframe.
d={'Phone':['0674-2537100/101','0674-257349','0671-257349','257349','257349/100','101/100','5688343/438934']}
dataFrame= pd.DataFrame(data=d)
# Considering you're only going to have one extra column. adding that column to dataframe.
dataFrame=dataFrame.assign(Phone1=['' for i in range(dataFrame.shape[0])])
# applying the split_phone_number function to dataframe.
dataFrame=dataFrame.apply(split_phone_number,axis=1)
# Prinitng dataframe.
print(dataFrame)
Input:
+---------------------+
| Phone |
+---------------------+
| 0 0674-2537100/101 |
| 1 0674-257349 |
| 2 0671-257349 |
| 3 257349 |
| 4 257349/100 |
| 5 101/100 |
| 6 5688343/438934 |
+---------------------+
Output:
+----------------------------+
| Phone Phone1 |
+----------------------------+
| 0 0674-2537100 101 |
| 1 0674-257349 |
| 2 0671-257349 |
| 3 257349 |
| 4 257349 100 |
| 5 101 100 |
| 6 5688343 438934 |
+----------------------------+
For further reading:
dataframe.apply()
Hope this helps. Cheers!

Pandas groupby compare count equal values in 2 columns in excel with subrows

I have an excel file like this:
link
.----.-------------.-------------------------.-----------------.
| | ID | Shareholder - Last name | DM Cognome |
:----+-------------+-------------------------+-----------------:
| 1. | 01287560153 | MASSIRONI | Bocapine Ardaya |
:----+-------------+-------------------------+-----------------:
| | | CAGNACCI | |
:----+-------------+-------------------------+-----------------:
| 2. | 05562881002 | | Directors |
:----+-------------+-------------------------+-----------------:
| 3. | 04113870655 | SABATO | Sabato |
:----+-------------+-------------------------+-----------------:
| | | VILLARI | |
:----+-------------+-------------------------+-----------------:
| 4. | 01419190846 | SALMERI | Salmeri |
:----+-------------+-------------------------+-----------------:
| | | MICALIZZI | Lipari |
:----+-------------+-------------------------+-----------------:
| | | LIPARI | |
'----'-------------'-------------------------'-----------------'
I open this file with pandas and ffill the ID column since there are subrows. Then groupby by ID to get the count of any equal values on the Shareholder - Last name and DM\nCognome columns. However I can't. In this case the result should be 0 row1 0 row2 1 row3 2 row4.
It should be noted that row 4 is consist of 3 subrow and row3 also consist of 2 subrow.(ex)
I have 2 questions:
What is the best way to read an unorganised excel file like above and do lots of comparisons, replacing values etc.
How can I achieve the results that I mentioned earlier.
Here is what I did, but it doesn't work:
data['ID'] = data['ID'].fillna(method='ffill')
data.groupby('ID', sort=False, as_index=False)['Shareholder - Last name', 'DM\nCognome'].apply(lambda x: (x['Shareholder - Last name']==x['DM\nCognome']).count())
First, read as input the table (keeping the ID as string instead of float):
df = pd.read_excel("Workbook1.xlsx", converters={'ID':str})
df = df.drop("Unnamed: 0", axis=1) #drop this column since it is not useful
Fill the ID and if a shareholder is missing replace Nan with "Missing":
df['ID'] = df['ID'].fillna(method='ffill')
df["Shareholder - Last name"] = df["Shareholder - Last name"].fillna("missing")
Convert to lowercase the surnames:
df["Shareholder - Last name"] = df["Shareholder - Last name"].str.lower()
Custom function to count how many householders occur in the other column:
def f(group):
s = pd.Series(group["DM\nCognome"].str.lower())
count = 0
for surname in group["Shareholder - Last name"]:
count += s.str.count(surname).sum()
return count
And finally get the count for each ID:
df.groupby("ID",sort=False)[["Shareholder - Last name", "DM\nCognome"]].apply(lambda x: f(x))
Output:
ID
01287560153 0.0
05562881002 0.0
04113870655 1.0
01419190846 2.0

Getting the latest value in a time range or null

I have a huge data set e.g.
| Date | ID | Value |
+------------+----+-------+
| 10-10-2020 | 1 | 1 |
| 10-11-2020 | 1 | 2 |
| 10-12-2020 | 1 | 3 |
| 10-13-2020 | 1 | 4 |
| 10-10-2020 | 2 | 5 |
| 10-11-2020 | 2 | 6 |
| 10-12-2020 | 2 | 7 |
| 10-09-2020 | 3 | 8 |
| 10-08-2020 | 4 | 9 |
As you can see this example contains of 4 IDs within different date ranges.
I have a special logic, which calculates some derived values with RangeBetween function. Let's assume it is a simple sum over the defined time range.
What I need to do is to generate such a result (explained below):
| ID | Value sum (last 2 days) | Value sum (last 4 days) | Value sum (prev 2 days) | Value sum (prev 4 days) | Result (2 days) | Result (4 days) |
+----+-------------------------+-------------------------+-------------------------+-------------------------+-----------------+-----------------+
| 1 | 7 (3+4) | 10 (1+2+3+4) | 5 (3+2) | 6 (3+2+1) | 7 | 10 |
| 2 | 7 | 18 (5+6+7) | 11 (5+6) | 11 (5+6) | 7 | 18 |
| 3 | null | null | null | 8 | null | 0 |
//exclude | 4 | null | null | null | null | null | null |
This example assumes that today is 10-13-2020.
For each Id I need to get a sum of the value in 2 ranges: 2 and 4 days
1. the table contains 2 calculations for the same ranges starting from now and the day before (columns last and prev X days)
2. if all values exist in a range - simply result the sum of the range (example with ID = 1)
3. if some of values are not specified in a range assume it is zero (example with ID = 2)
4. if values do not exist in the defined range, but there is at least 1 value in the range with the day before - assume there was a sum yesterday, but no such today - set it to zero (example #3)
5. if no value values in the range and the day before - do not include in the result set (example #4)
Right now I have a code:
let last2Days =
Window
.PartitionBy('ID')
.OrderBy(Functions.Col('Date').Cast("timestamp").Cast("long"))
.RangeBetween(-1, 0)
let prev2Days =
Window
.PartitionBy('ID')
.OrderBy(Functions.Col('Date').Cast("timestamp").Cast("long"))
.RangeBetween(-2, -1)
df
.WithColumn('last2daysSum', Functions.Sum('value').Over(last2Days))
.WithColumn('prev2daysSum', Functions.Sum('value').Over(last4Days))
.WithColumn('result2Days', Functions.Col('last2daysSum'))
.Where(Functions.Col('Date').EqualTo(Functions.Lit('10-13-2020')))
This works for example #1 (when result is taken from last2daysSum)
1. is there a simple way to get a proper result for #2 (the latest record within defined time range)?
2. combine the previous question and condition `if last = null && prev != null then 0 else if last = null && prev = null then null else last` - example #3?
3. how to exclude records as per example #4?
Is that possible to solve it with no reshuffling?
For Question #1 If you only want to calculate for one specific date then a groupBy and agg is simpler and should execute faster. The trick is to use when inside aggregate functions like sum.
For Questions #2 and #3 you can coalesce to zero and filter out fully null rows before that. If you need to filter for a broader range than you want to display (so include rows that had values days before but do not now) you can add an extra calculation for the longer period a drop that after filtering. See below for code example.
import org.apache.spark.sql.functions._
val data = Seq(
("2020-10-10", 1, 1),
("2020-10-11", 1, 2),
("2020-10-12", 1, 3),
("2020-10-13", 1, 4),
("2020-10-10", 2, 5),
("2020-10-11", 2, 6),
("2020-10-12", 2, 7),
("2020-10-09", 3, 8),
("2020-10-08", 4, 9)
).toDF("Date", "ID", "Value").withColumn("Date", to_date($"Date"))
def sumLastNDays(now: java.sql.Timestamp, start: Int, end: Int = 0) =
sum(when($"Date".between(date_sub(lit(now), start-1), date_sub(lit(now), end)), $"Value"))
val now = java.sql.Timestamp.valueOf("2020-10-13 00:00:00")
data
.groupBy($"ID")
.agg(
sumLastNDays(now, 2).as("last2DaysSum"),
sumLastNDays(now, 4).as("last4DaysSum"),
sumLastNDays(now, 4, 2).as("prev2DaysSum"),
sumLastNDays(now, 5).as("last5DaysSum")
)
.filter($"last5DaysSum".isNotNull)
.drop($"last5DaysSum")
.withColumn("last4DaysSum", coalesce($"last4DaysSum", lit(0)))
.withColumn("last2DaysSum", coalesce($"last2DaysSum", lit(0)))
.withColumn("prev2DaysSum", coalesce($"prev2DaysSum", lit(0)))
.orderBy($"ID")
.show()
Result:
+---+------------+------------+------------+
| ID|last2DaysSum|last4DaysSum|prev2DaysSum|
+---+------------+------------+------------+
| 1| 7| 10| 3|
| 2| 7| 18| 11|
| 3| 0| 0| 0|
+---+------------+------------+------------+
Note: I'm not sure if you meant prev2Days to be the previous 2 day interval before the current 2 day interval or the yesterday's last 2 day interval, because in the expected results table ID 1 has Oct. 11-12 summed and ID 2 has Oct. 10-11 summed for prev2Days, but either way you can adjust the range params if you want something else. I assumed that prev2Days does not overlap with last2Days, just change it to sumLastNDays(now, 3, 1) if you want overlapping 2 day ranges.

Explode date interval over a group by and take last value in pyspark

I have a dataframe which contains some products, a date and a value. Now the dates have different gaps inbetween recorded values that I want to fill out. Such that I have a recorded value for every hour from the first time the product was seen to the last, if there is no record I want to use the latest value.
So, I have a dataframe like:
| ProductId | Date | Value |
|-----------|-------------------------------|-------|
| 1 | 2020-03-12T00:00:00.000+0000 | 4 |
| 1 | 2020-03-12T01:00:00.000+0000 | 2 |
| 2 | 2020-03-12T01:00:00.000+0000 | 3 |
| 2 | 2020-03-12T03:00:00.000+0000 | 4 |
| 1 | 2020-03-12T05:00:00.000+0000 | 4 |
| 3 | 2020-03-12T05:00:00.000+0000 | 2 |
I want to create a new dataframe that looks like:
| ProductId | Date | Value |
|-----------|-------------------------------|-------|
| 1 | 2020-03-12T00:00:00.000+0000 | 4 |
| 1 | 2020-03-12T01:00:00.000+0000 | 2 |
| 1 | 2020-03-12T02:00:00.000+0000 | 2 |
| 1 | 2020-03-12T03:00:00.000+0000 | 2 |
| 1 | 2020-03-12T04:00:00.000+0000 | 2 |
| 1 | 2020-03-12T05:00:00.000+0000 | 4 |
| 2 | 2020-03-12T01:00:00.000+0000 | 3 |
| 2 | 2020-03-12T02:00:00.000+0000 | 3 |
| 2 | 2020-03-12T03:00:00.000+0000 | 4 |
| 3 | 2020-03-12T05:00:00.000+0000 | 2 |
My code so far:
def generate_date_series(start, stop):
start = datetime.strptime(start, "yyyy-MM-dd'T'HH:mm:ss.SSSZ")
stop = datetime.strptime(stop, "yyyy-MM-dd'T'HH:mm:ss.SSSZ")
return [start + datetime.timedelta(hours=x) for x in range(0, (stop-start).hours + 1)]
spark.udf.register("generate_date_series", generate_date_series, ArrayType(TimestampType()))
df = df.withColumn("max", max(col("Date")).over(Window.partitionBy("ProductId"))) \
.withColumn("min", min(col("Date")).over(Window.partitionBy("ProductId"))) \
.withColumn("Dato", explode(generate_date_series(col("min"), col("max"))) \
.over(Window.partitionBy("ProductId").orderBy(col("Dato").desc())))
window_over_ids = (Window.partitionBy("ProductId").rangeBetween(Window.unboundedPreceding, -1).orderBy("Date"))
df = df.withColumn("Value", last("Value", ignorenulls=True).over(window_over_ids))
Error:
TypeError: strptime() argument 1 must be str, not Column
So the first question is obviously how do I create and call the udf correctly so I don't run into the above error.
The second question is how do I complete the task, such that I get my desired dataframe?
So after some searching and experimenting I found a solution. I defined a udf that returns a date range between two dates with 1 hour intervals. And I then do a forward fill
I fixed the issue with the following code:
def missing_hours(t1, t2):
return [t1 + timedelta(hours=x) for x in range(0, int((t2-t1).total_seconds()/3600))]
missing_hours_udf = udf(missing_hours, ArrayType(TimestampType()))
window = Window.partitionBy("ProductId").orderBy("Date")
df_missing = df.withColumn("prev_timestamp", lag(col("Date"), 1, None).over(window)) \
.filter(col("prev_timestamp").isNotNull()) \
.withColumn("Date", explode(missing_hours_udf(col("prev_timestamp"), col("Date")))) \
.withColumn("Value", lit(None)) \
.drop("prev_timestamp")
df = df_original.union(df_missing)
window = Window.partitionBy("ProductId").orderBy("Date") \
.rowsBetween(-sys.maxsize, 0)
# define the forward-filled column
filled_values_column = last(df['Value'], ignorenulls=True).over(window)
# do the fill
df = df.withColumn('Value', filled_values_column)

Resources