Splitting a column into multiple columns

Splitting a column into multiple columns - python-3.x

I have a pandas dataframe as below :
| A | Value |
+----------+--------+
|ABC001035 | 34 |
|USN001185 | 45 |
|UCT010.75 | 23 |
|ATC001070 | 21 |
+----------+--------+
I want to split the column in A (based on last three digits in A) into columns X and Y, and it should look like below
| A | Value | X | Y |
+----------+--------+---------+-----+
|ABC001035 | 34 | ABC001 | 035 |
|USN001185 | 45 | USN001 | 185 |
|UCT010.75 | 23 | UCT01 | 0.75|
|ATC001070 | 21 | ATC001 | 070 |
+----------+--------+---------+-----+
So how to split the column A ?

You can index all strings in a series with the .str accessor:
>>> df['X'] = df['A'].str[:-3]
>>> df['Y'] = df['A'].str[-3:]
>>> df
A Value X Y
0 ABC001035 34.0 ABC001 035
1 USN001185 45.0 USN001 185
2 UCT010.75 23.0 UCT010 .75
3 ATC001070 21.0 ATC001 070

Split your problem into smaller ones, easier to solve! :)
How to split a string (take the last 3 characters):
'Hello world!'[-3:0]
# Returns: ld!
How to apply a function over a DataFrame value?
df.A.apply(lambda x: x[-3:])
# Returns pandas.Series: [035, 185, 0.75, 070]
How to save a Series to a new DataFrame column?
# Create Y column.
df['Y'] = df.A.apply(lambda x: x[-3:])

Related

Drop rows in Pandas where column value is not equal to specific suffix

Suppose, I have a df having rows values
ID Name Age
ABC-123 XYZ 22
ABC-345 LMK 12
ABC-123-1 MNO 22
After applying a filter on column ID,
I need only the first two rows to be returned in this dataset case.
Like.
ID Name Age
ABC-123 XYZ 22
ABC-345 LMK 12
You see all the rows are excluded from the final result which doesn't match the pattern. All rows should be returned that match the pattern like ABC-123.
Note: Suffix number can be anything so I think it should be done with some regex to check for string pattern.

import pandas
df = pd.DataFrame(dict(id=['ABC-123','ABC-345','ABC-123-1'], age=[22,12,22]))
| | id | age |
|---:|:----------|------:|
| 0 | ABC-123 | 22 |
| 1 | ABC-345 | 12 |
| 2 | ABC-123-1 | 22 |
df.query('id.str.len() <= 7')
| | id | age |
|---:|:--------|------:|
| 0 | ABC-123 | 22 |
| 1 | ABC-345 | 12 |

Combine multiple rows into Single row based on specific column using python

I need to modify available value in billable and non-billable utilization, earlier its default now the value is dynamic.
I have a Billable column value as 'Yes' and 'No'
If Value is 'Yes' then it will sum row-wise and created new columns as 'Billable Utilization'
Billing_utilization = df[Billing_utilization] * sum/available * 100
If value is 'No' then it will be sum row-wise and created new column as 'Non-Billable Utilization'.
Non-Billing_utilization = df[Non-Billing_utilization] * sum/ available1 * 100
Data:
| Employee Name | Java | Python | .Net | React | Billable |
| Priya | 10 | | 5 | | Yes |
| Priya | | 10 | | 5 | No |
| Krithi | | 10v | 20 | | No |
Output
Priya is in both billable and non-billable, priya name appears in two rows. I need to merge in single row with Employee Name. So expected output should be
| Employee Name | Java | Python | .Net | React | Total | Billing | Non-Billing |
| Priya | 10 | 10 | 5 | 5 | 30 | 8.928571429 | 8.928571429 |
| Krithi | 10 | 20 | | | 30 | | 17.85714286 |
df['Billable Status'] = np.where ( df['Billable Status'] == 'Billable', 'Billable Utilization','Non Billable Utilization' )
df2 = (df.groupby ( ['Employee Name', 'Billable Status'])[list_column].sum ().sum ( axis=1 ).unstack ().div (available2).mul(100)).round ( 2 ))
df = df1.join ( df2 ).reset_index ()
df.index = df.index
# Round the column value
df['Total'] = df['Total'].round ( 2 )
# df= df.round(2)

Try:
cols = df.select_dtypes ( 'number' ).columns.tolist ()
df['Total'] = df.groupby('Employee Name')[cols].transform('sum').sum(1)
df['Billing'] = df.mask(df['Billable'] == 'No')[cols].sum(1) / df['Total']
df['Non-Billing'] = df.mask(df['Billable'] == 'Yes')[cols].sum(1) / df['Total']
aggfuncs = dict(zip(cols, ['sum']*len(cols)))
aggfuncs.update({'Total': 'first', 'Billing': 'sum', 'Non-Billing': 'sum'})
out = df.pivot_table(aggfuncs, 'Employee Name', aggfunc=aggfuncs,
sort=False, fill_value=0)[aggfuncs].reset_index()
Output:
>>> out
Employee Name Java Python .Net React Total Billing Non-Billing
0 Priya 10 10 5 5 30 0.5 0.5
1 Krithi 0 10 20 0 30 0.0 1.0

rolling average and aggregate more than one column in pandas

How do I also aggregate the 'reviewer' lists together with average of 'quantities'?
For a data frame like below I can successfully calculate the average of the quantities per group over every 3 years. How do I add an extra column that aggregates the values of column 'reviewer, for every period as well? for example for company 'A' for year 1993, the column would be [[p1,p2],[p3,p2],[p4]].
df= pd.DataFrame(data=[
['A', 1990, 2,['p1','p2']],
['A', 1991,3,['p3','p2']],
['A', 1993,5,['p4']],
['A',2000,4,['p1','p5','p7']],
['B',2000,1, ['p3']],
['B',2001,2,['p6','p9']],
['B',2002,3,['p10','p1']]], columns=['company', 'year','quantity', 'reviewer'])
df['rolling_average'] = (df.groupby(['company'])
.rolling(3).agg({'quantity':'mean'}).reset_index(level=[0], drop=True))
The output currently looks like:
| index | company | year | quantity | reviewer | rolling_average |
| :---- | :------ | :--- | :------- | :------- | :-------------- |
| 0 | A | 1990 | 2 | [p1, p2] | NaN |
| 1 | A | 1991 | 3 | [p3, p2] | NaN |
| 2 | A | 1993 | 5 | [p4] | 3.33 |
| 3 | A | 2000 | 4 | [p5, p7] | 4.00 |
| 4 | B | 2000 | 1 | [p3] | NaN |
| 5 | B | 2001 | 2 | [p6, p9] | NaN |
| 6 | B | 2002 | 3 | [p10, p1]| 2.00 |

Since the rolling can not take non-numeric , we need self-define the rolling here
n = 3
df['new'] = df.groupby(['company'])['reviewer'].apply(lambda x :[x[y-n:y].tolist() if y>=n else np.nan for y in range(1,len(x)+1)]).explode().values
df
company year quantity reviewer new
0 A 1990 2 [p1, p2] NaN
1 A 1991 3 [p3, p2] NaN
2 A 1993 5 [p4] [[p1, p2], [p3, p2], [p4]]
3 A 2000 4 [p1, p5, p7] [[p3, p2], [p4], [p1, p5, p7]]
4 B 2000 1 [p3] NaN
5 B 2001 2 [p6, p9] NaN
6 B 2002 3 [p10, p1] [[p3], [p6, p9], [p10, p1]]

Pandas groupby compare count equal values in 2 columns in excel with subrows

I have an excel file like this:
link
.----.-------------.-------------------------.-----------------.
| | ID | Shareholder - Last name | DM Cognome |
:----+-------------+-------------------------+-----------------:
| 1. | 01287560153 | MASSIRONI | Bocapine Ardaya |
:----+-------------+-------------------------+-----------------:
| | | CAGNACCI | |
:----+-------------+-------------------------+-----------------:
| 2. | 05562881002 | | Directors |
:----+-------------+-------------------------+-----------------:
| 3. | 04113870655 | SABATO | Sabato |
:----+-------------+-------------------------+-----------------:
| | | VILLARI | |
:----+-------------+-------------------------+-----------------:
| 4. | 01419190846 | SALMERI | Salmeri |
:----+-------------+-------------------------+-----------------:
| | | MICALIZZI | Lipari |
:----+-------------+-------------------------+-----------------:
| | | LIPARI | |
'----'-------------'-------------------------'-----------------'
I open this file with pandas and ffill the ID column since there are subrows. Then groupby by ID to get the count of any equal values on the Shareholder - Last name and DM\nCognome columns. However I can't. In this case the result should be 0 row1 0 row2 1 row3 2 row4.
It should be noted that row 4 is consist of 3 subrow and row3 also consist of 2 subrow.(ex)
I have 2 questions:
What is the best way to read an unorganised excel file like above and do lots of comparisons, replacing values etc.
How can I achieve the results that I mentioned earlier.
Here is what I did, but it doesn't work:
data['ID'] = data['ID'].fillna(method='ffill')
data.groupby('ID', sort=False, as_index=False)['Shareholder - Last name', 'DM\nCognome'].apply(lambda x: (x['Shareholder - Last name']==x['DM\nCognome']).count())

First, read as input the table (keeping the ID as string instead of float):
df = pd.read_excel("Workbook1.xlsx", converters={'ID':str})
df = df.drop("Unnamed: 0", axis=1) #drop this column since it is not useful
Fill the ID and if a shareholder is missing replace Nan with "Missing":
df['ID'] = df['ID'].fillna(method='ffill')
df["Shareholder - Last name"] = df["Shareholder - Last name"].fillna("missing")
Convert to lowercase the surnames:
df["Shareholder - Last name"] = df["Shareholder - Last name"].str.lower()
Custom function to count how many householders occur in the other column:
def f(group):
s = pd.Series(group["DM\nCognome"].str.lower())
count = 0
for surname in group["Shareholder - Last name"]:
count += s.str.count(surname).sum()
return count
And finally get the count for each ID:
df.groupby("ID",sort=False)[["Shareholder - Last name", "DM\nCognome"]].apply(lambda x: f(x))
Output:
ID
01287560153 0.0
05562881002 0.0
04113870655 1.0
01419190846 2.0

Python selecting different number of rows for each group of a mutlilevel index

I have a data frame with a multilevel index. I would like to sort this data frame based on a specific column and extract the first n rows for each group of the first index, but n is different for each group.
For example:
| Index1| Index2| Sort_In_descending_order | How_manyRows_toChoose |
-----------------------------------------------------------------------
| 1 | 20 | 3 | 2 |
| | 40 | 2 | 2 |
| | 10 | 1 | 2 |
| 2 | 20 | 2 | 1 |
| | 50 | 1 | 1 |
the result should look like this:
| Index1| Index2| Sort_In_descending_order | How_manyRows_toChoose |
-----------------------------------------------------------------------
| 1 | 20 | 3 | 2 |
| | 40 | 2 | 2 |
| 2 | 20 | 2 | 1 |
I got this far:
df.groupby(level[0,1]).sum().sort_values(['Index1','Sort_In_descending_order'],ascending=False).groupby('Index1').head(2)
However the .head(2) picks 2 element of each group independent of the number in the column "How_manyRows_toChoose".
Some pice of code would be great!
Thank you!

Use lambda function in GroupBy.apply with head and add parameter group_keys=False for avoid duplicated index values:
#original code
df = (df.groupby(level[0,1])
.sum()
.sort_values(['Index1','Sort_In_descending_order'],ascending=False))
df = (df.groupby('Index1', group_keys=False)
.apply(lambda x: x.head(x['How_manyRows_toChoose'].iat[0])))
print (df)
Sort_In_descending_order How_manyRows_toChoose
Index1 Index2
1 20 3 2
40 2 2
2 20 2 1

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Splitting a column into multiple columns - python-3.x

You can index all strings in a series with the .str accessor: >>> df['X'] = df['A'].str[:-3] >>> df['Y'] = df['A'].str[-3:] >>> df A Value X Y 0 ABC001035 34.0 ABC001 035 1 USN001185 45.0 USN001 185 2 UCT010.75 23.0 UCT010 .75 3 ATC001070 21.0 ATC001 070

Related

Drop rows in Pandas where column value is not equal to specific suffix

Combine multiple rows into Single row based on specific column using python

rolling average and aggregate more than one column in pandas

Pandas groupby compare count equal values in 2 columns in excel with subrows

Python selecting different number of rows for each group of a mutlilevel index

Categories

Resources