How to create new columns in pandas dataframe using column values? - python-3.x

I'm working in Python with a pandas DataFrame similar to:
REQUESET_ID | DESCR | TEST | TEST_DESC | RESULT |
1 | 1 | T1 | TEST_1 | 2.0 |
1 | 2 | T2 | TEST_2 | 92.0 |
2 | 1 | T1 | TEST_1 | 8.0 |
3 | 3 | T3 | TEST_3 | 12.0 |
3 | 4 | T4 | TEST_4 | 45.0 |
What I want is a final dataframe like this:
REQUESET_ID | DESCR_1 | TEST_1 | TEST_DESC_1 | RESULT_1 | DESCR_2 | TEST_2 | TEST_DESC_2 | RESULT_2 |
1 | 1 | T1 | TEST_1 | 2.0 | 2 | T2 | TEST_2 | 92.0 |
2 | 1 | T1 | TEST_1 | 8.0 | NaN | NaN | NaN | Nan |
3 | 3 | T3 | TEST_3 | 12.0 | 4 | T4 | TEST_4 | 45.0 |
How I should implement that as a method working with DataFrames. I understand that if I try to do it with a merge instead of having 4x2 columns added beacuse the value_counts method of the REQUEST_ID will return 2, will add the 4 columns for each entry in the request column.

Assign a new column with cumcount, then do stack + unstack
s=df.assign(col=(df.groupby('REQUESET_ID').cumcount()+1).astype(str)).\
set_index(['REQUESET_ID','col']).unstack().sort_index(level=1,axis=1)
s.columns=s.columns.map('_'.join)
s
DESCR_1 RESULT_1 TEST_1 ... RESULT_2 TEST_2 TEST_DESC_2
REQUESET_ID ...
1 1.0 2.0 T1 ... 92.0 T2 TEST_2
2 1.0 8.0 T1 ... NaN NaN NaN
3 3.0 12.0 T3 ... 45.0 T4 TEST_4
[3 rows x 8 columns]

Related

Splitting ID column in pandas dataframe to multiple columns

I have a pandas dataframe like below :
| ID | Value |
+----------+--------+
|1C16 | 34 |
|1C1 | 45 |
|7P.75 | 23 |
|7T1 | 34 |
|1C10DG | 34 |
+----------+--------+
I want to split the ID column (its a string column) in a way that looks like below:
| ID | Value | Code | Core |size |
+----------+--------+-------+------+-----+
|1C16 | 34 | C | 1 | 16 |
|1C1 | 45 | C | 1 | 1 |
|7P.75 | 23 | P | 7 | .75 |
|7T1 | 34 | T | 7 | 1 |
|1C10DG | 34 | C | 1 | 10 |
+----------+--------+-------+------+-----+
So how can this be achieved ? Thanks
You can try .str.extract with regex (?P<Code>\d+)(?P<Core>[A-Z])(?P<size>[.0-9]+) to capture the patterns:
df.ID.str.extract(r'(?P<Code>\d+)(?P<Core>[A-Z])(?P<size>[.0-9]+)')
# Code Core size
#0 1 C 16
#1 1 C 1
#2 7 P .75
#3 7 T 1
#4 1 C 10
use .str.extract() with multiple capturing groups & join
df.join(
df['ID'].str.extract('(\d)(\w)(\d+|.\d+)').rename(
columns={0 : 'Core', 1 : 'Code', 2 : 'Size'}))
ID Value Core Code Size
1 1C16 34.0 1 C 16
2 1C1 45.0 1 C 1
3 7P.75 23.0 7 P .75
4 7T1 34.0 7 T 1
5 1C10DG 34.0 1 C 10

rolling average and aggregate more than one column in pandas

How do I also aggregate the 'reviewer' lists together with average of 'quantities'?
For a data frame like below I can successfully calculate the average of the quantities per group over every 3 years. How do I add an extra column that aggregates the values of column 'reviewer, for every period as well? for example for company 'A' for year 1993, the column would be [[p1,p2],[p3,p2],[p4]].
df= pd.DataFrame(data=[
['A', 1990, 2,['p1','p2']],
['A', 1991,3,['p3','p2']],
['A', 1993,5,['p4']],
['A',2000,4,['p1','p5','p7']],
['B',2000,1, ['p3']],
['B',2001,2,['p6','p9']],
['B',2002,3,['p10','p1']]], columns=['company', 'year','quantity', 'reviewer'])
df['rolling_average'] = (df.groupby(['company'])
.rolling(3).agg({'quantity':'mean'}).reset_index(level=[0], drop=True))
The output currently looks like:
| index | company | year | quantity | reviewer | rolling_average |
| :---- | :------ | :--- | :------- | :------- | :-------------- |
| 0 | A | 1990 | 2 | [p1, p2] | NaN |
| 1 | A | 1991 | 3 | [p3, p2] | NaN |
| 2 | A | 1993 | 5 | [p4] | 3.33 |
| 3 | A | 2000 | 4 | [p5, p7] | 4.00 |
| 4 | B | 2000 | 1 | [p3] | NaN |
| 5 | B | 2001 | 2 | [p6, p9] | NaN |
| 6 | B | 2002 | 3 | [p10, p1]| 2.00 |
Since the rolling can not take non-numeric , we need self-define the rolling here
n = 3
df['new'] = df.groupby(['company'])['reviewer'].apply(lambda x :[x[y-n:y].tolist() if y>=n else np.nan for y in range(1,len(x)+1)]).explode().values
df
company year quantity reviewer new
0 A 1990 2 [p1, p2] NaN
1 A 1991 3 [p3, p2] NaN
2 A 1993 5 [p4] [[p1, p2], [p3, p2], [p4]]
3 A 2000 4 [p1, p5, p7] [[p3, p2], [p4], [p1, p5, p7]]
4 B 2000 1 [p3] NaN
5 B 2001 2 [p6, p9] NaN
6 B 2002 3 [p10, p1] [[p3], [p6, p9], [p10, p1]]

Calculating the difference in value between columns

I have a dataframe with YYYYMM columns that contain monthly totals on the row level:
| yearM | feature | 201902 | 201903 | 201904 | 201905 |... ... ... 202009
|-------|----------|--------|--------|--------|--------|
| 0 | feature1 | Nan | Nan | 9.0 | 32.0 |
| 1 | feature2 | 1.0 | 1.0 | 1.0 | 4.0 |
| 2 | feature3 | Nan | 1.0 | 4.0 | 8.0 |
| 3 | feature4 | 9.0 | 15.0 | 19.0 | 24.0 |
| 4 | feature5 | 33.0 | 67.0 | 99.0 | 121.0 |
| 5 | feature6 | 12.0 | 15.0 | 17.0 | 19.0 |
| 6 | feature7 | 1.0 | 8.0 | 15.0 | 20.0 |
| 7 | feature8 | Nan | Nan | 1.0 | 9.0 |
I would like to convert the totals to the monthly change. The feature column should be excluded as I need to keep the feature names. The yearM in the index is a result of pivoting a dataframe to get the YYYYMM on the column level.
This is how the output would look like:
| yearM | feature | 201902 | 201903 | 201904 | 201905 |... ... ... 202009
|-------|----------|--------|--------|--------|--------|
| 0 | feature1 | Nan | 0.0 | 9.0 | 23.0 |
| 1 | feature2 | 1.0 | 0.0 | 0.0 | 3.0 |
| 2 | feature3 | Nan | 1.0 | 3.0 | 5.0 |
| 3 | feature4 | 9.0 | 6.0 | 4.0 | 5 |
| 4 | feature5 | 33.0 | 34.0 | 32.0 | 22.0 |
| 5 | feature6 | 12.0 | 3.0 | 2.0 | 2.0 |
| 6 | feature7 | 1.0 | 7.0 | 7.0 | 5.0 |
| 7 | feature8 | Nan | 0.0 | 1.0 | 8.0 |
The row level values now represent the change compared to the previous month instead of having the total for the month.
I know that I should start by filling the NaN rows in the starting column 201902 with 0:
df['201902'] = df['201902'].fillna(0)
I could also calculate them one by one with something similar to this:
df['201902'] = df['201902'].fillna(0) - df['201901'].fillna(0)
df['201903'] = df['201903'].fillna(0) - df['201902'].fillna(0)
df['201904'] = df['201904'].fillna(0) - df['201903'].fillna(0)
...
...
Hopefully there's a smarter solution though
use iloc or drop to access the other columns, then diff with axis=1 for row-wise differences.
monthly_change = df.iloc[:, 1:].fillna(0).diff(axis=1)
# or
# monthly_change = df.drop(['feature'], axis=1).fillna(0).diff(axis=1)

Pandas finding intervals (of n-Days) and capturing start/end dates

This started its life as a list of activities. I first built a matrix similar to the one below to represent all activities, which I inverted to show all inactivity, before building the following matrix, where zero indicates an activity, and anything greater than zero indicates the number of days before the next activity.
+------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
| Item | 01/08/2020 | 02/08/2020 | 03/08/2020 | 04/08/2020 | 05/08/2020 | 06/08/2020 | 07/08/2020 | 08/08/2020 | 09/08/2020 |
+------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
| A | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| B | 3 | 2 | 1 | 0 | 0 | 3 | 2 | 1 | 0 |
| C | 0 | 2 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| D | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | 0 |
| E | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 |
+------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
Now I need to find suitable intervals for each Item. For instance, in this case I want to find all intervals with a minimum duration of 3-days.
+------+------------+------------+------------+------------+
| Item | 1_START | 1_END | 2_START | 2_END |
+------+------------+------------+------------+------------+
| A | NaN | NaN | NaN | NaN |
| B | 01/08/2020 | 03/08/2020 | 06/08/2020 | 08/08/2020 |
| C | NaN | NaN | NaN | NaN |
| D | 01/08/2020 | 07/08/2020 | NaN | NaN |
| E | 01/08/2020 | NaN | NaN | NaN |
+------+------------+------------+------------+------------+
In reality the data is 700+ columns wide and 1,000+ rows. How can I do this efficiently?

Spark Window Functions: calculated once per frame/range?

This is a question about Window Functions in Spark.
Assume I have this DF
DATE_S | ID | STR | VALUE
-------------------------
1 | 1 | A | 0.5
1 | 1 | A | 1.23
1 | 1 | A | -0.4
2 | 1 | A | 2.0
3 | 1 | A | -1.2
3 | 1 | A | 0.523
1 | 2 | A | 1.0
2 | 2 | A | 2.5
3 | 2 | A | 1.32
3 | 2 | A | -3.34
1 | 1 | B | 1.5
1 | 1 | B | 0.23
1 | 1 | B | -0.3
2 | 1 | B | -2.0
3 | 1 | B | 1.32
3 | 1 | B | 523.0
1 | 2 | B | 1.3
2 | 2 | B | -0.5
3 | 2 | B | 4.3243
3 | 2 | B | 3.332
This is just an example! Assume that there are many more DATE_S for each (ID, STR), many more IDs and STRs, and many more entries per (DATE_S, ID, STR). Obviously there are multiple values per Combination (DATE_S, ID, STR)
Now I do this:
val w = Window.partitionBy("ID", "STR").orderBy("DATE_S").rangeBetween(-N, -1)
df.withColumn("RESULT", function("VALUE").over(w))
where N might lead to the inclusion of a large range of rows, from 100 to 100000 and more, depending on ("ID", "STR")
The result will be something like this
DATE_S | ID | STR | VALUE | RESULT
----------------------------------
1 | 1 | A | 0.5 | R1
1 | 1 | A | 1.23 | R1
1 | 1 | A | -0.4 | R1
2 | 1 | A | 2.0 | R2
3 | 1 | A | -1.2 | R3
3 | 1 | A | 0.523 | R3
1 | 2 | A | 1.0 | R4
2 | 2 | A | 2.5 | R5
3 | 2 | A | 1.32 | R6
3 | 2 | A | -3.34 | R7
1 | 1 | B | 1.5 | R8
1 | 1 | B | 0.23 | R8
1 | 1 | B | -0.3 | R9
2 | 1 | B | -2.0 | R10
3 | 1 | B | 1.32 | R11
3 | 1 | B | 523.0 | R11
1 | 2 | B | 1.3 | R12
2 | 2 | B | -0.5 | R13
3 | 2 | B | 4.3243| R14
3 | 2 | B | 3.332 | R14
There are identical "RESULT"s because for every row with identical (DATE_S, ID, ST), the values that go into the calculation of "function" are the same.
My question is this:
Does spark call "function" for each ROW (recalculating the same value multiple times) or calculate it once per range (frame?) of values and just pastes them on all rows that fall in the range?
Thanks for reading :)
From your data the result may not be the same if run twice from what I can see as there is no distinct ordering possibility. But we leave that aside.
Whilst there is codegen optimization, it is nowhere to be found that it checks in the way you state for if the next invocation is the same set of data to process for the next row. I have never read of that type of optimization. There is fusing due to lazy evaluation approach, but that is another matter. So, per row it calculates again.
From a great source: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-functions-windows.html
... At its core, a window function calculates a return value for every
input row of a table based on a group of rows, called the frame. Every
input row can have a unique frame associated with it. ...
... In other words, when executed, a window function computes a value
for each and every row in a window (per window specification). ...
The biggest issue is to have suitable number of partitions for parallel processing, which is expensive, but this is big data. partitionBy("ID", "STR") is the clue here and that is a good thing.

Resources