Calculating the difference in value between columns - python-3.x

I have a dataframe with YYYYMM columns that contain monthly totals on the row level:
| yearM | feature | 201902 | 201903 | 201904 | 201905 |... ... ... 202009
|-------|----------|--------|--------|--------|--------|
| 0 | feature1 | Nan | Nan | 9.0 | 32.0 |
| 1 | feature2 | 1.0 | 1.0 | 1.0 | 4.0 |
| 2 | feature3 | Nan | 1.0 | 4.0 | 8.0 |
| 3 | feature4 | 9.0 | 15.0 | 19.0 | 24.0 |
| 4 | feature5 | 33.0 | 67.0 | 99.0 | 121.0 |
| 5 | feature6 | 12.0 | 15.0 | 17.0 | 19.0 |
| 6 | feature7 | 1.0 | 8.0 | 15.0 | 20.0 |
| 7 | feature8 | Nan | Nan | 1.0 | 9.0 |
I would like to convert the totals to the monthly change. The feature column should be excluded as I need to keep the feature names. The yearM in the index is a result of pivoting a dataframe to get the YYYYMM on the column level.
This is how the output would look like:
| yearM | feature | 201902 | 201903 | 201904 | 201905 |... ... ... 202009
|-------|----------|--------|--------|--------|--------|
| 0 | feature1 | Nan | 0.0 | 9.0 | 23.0 |
| 1 | feature2 | 1.0 | 0.0 | 0.0 | 3.0 |
| 2 | feature3 | Nan | 1.0 | 3.0 | 5.0 |
| 3 | feature4 | 9.0 | 6.0 | 4.0 | 5 |
| 4 | feature5 | 33.0 | 34.0 | 32.0 | 22.0 |
| 5 | feature6 | 12.0 | 3.0 | 2.0 | 2.0 |
| 6 | feature7 | 1.0 | 7.0 | 7.0 | 5.0 |
| 7 | feature8 | Nan | 0.0 | 1.0 | 8.0 |
The row level values now represent the change compared to the previous month instead of having the total for the month.
I know that I should start by filling the NaN rows in the starting column 201902 with 0:
df['201902'] = df['201902'].fillna(0)
I could also calculate them one by one with something similar to this:
df['201902'] = df['201902'].fillna(0) - df['201901'].fillna(0)
df['201903'] = df['201903'].fillna(0) - df['201902'].fillna(0)
df['201904'] = df['201904'].fillna(0) - df['201903'].fillna(0)
...
...
Hopefully there's a smarter solution though

use iloc or drop to access the other columns, then diff with axis=1 for row-wise differences.
monthly_change = df.iloc[:, 1:].fillna(0).diff(axis=1)
# or
# monthly_change = df.drop(['feature'], axis=1).fillna(0).diff(axis=1)

Related

How to create new columns in pandas dataframe using column values?

I'm working in Python with a pandas DataFrame similar to:
REQUESET_ID | DESCR | TEST | TEST_DESC | RESULT |
1 | 1 | T1 | TEST_1 | 2.0 |
1 | 2 | T2 | TEST_2 | 92.0 |
2 | 1 | T1 | TEST_1 | 8.0 |
3 | 3 | T3 | TEST_3 | 12.0 |
3 | 4 | T4 | TEST_4 | 45.0 |
What I want is a final dataframe like this:
REQUESET_ID | DESCR_1 | TEST_1 | TEST_DESC_1 | RESULT_1 | DESCR_2 | TEST_2 | TEST_DESC_2 | RESULT_2 |
1 | 1 | T1 | TEST_1 | 2.0 | 2 | T2 | TEST_2 | 92.0 |
2 | 1 | T1 | TEST_1 | 8.0 | NaN | NaN | NaN | Nan |
3 | 3 | T3 | TEST_3 | 12.0 | 4 | T4 | TEST_4 | 45.0 |
How I should implement that as a method working with DataFrames. I understand that if I try to do it with a merge instead of having 4x2 columns added beacuse the value_counts method of the REQUEST_ID will return 2, will add the 4 columns for each entry in the request column.
Assign a new column with cumcount, then do stack + unstack
s=df.assign(col=(df.groupby('REQUESET_ID').cumcount()+1).astype(str)).\
set_index(['REQUESET_ID','col']).unstack().sort_index(level=1,axis=1)
s.columns=s.columns.map('_'.join)
s
DESCR_1 RESULT_1 TEST_1 ... RESULT_2 TEST_2 TEST_DESC_2
REQUESET_ID ...
1 1.0 2.0 T1 ... 92.0 T2 TEST_2
2 1.0 8.0 T1 ... NaN NaN NaN
3 3.0 12.0 T3 ... 45.0 T4 TEST_4
[3 rows x 8 columns]

Pandas finding intervals (of n-Days) and capturing start/end dates

This started its life as a list of activities. I first built a matrix similar to the one below to represent all activities, which I inverted to show all inactivity, before building the following matrix, where zero indicates an activity, and anything greater than zero indicates the number of days before the next activity.
+------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
| Item | 01/08/2020 | 02/08/2020 | 03/08/2020 | 04/08/2020 | 05/08/2020 | 06/08/2020 | 07/08/2020 | 08/08/2020 | 09/08/2020 |
+------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
| A | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| B | 3 | 2 | 1 | 0 | 0 | 3 | 2 | 1 | 0 |
| C | 0 | 2 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| D | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | 0 |
| E | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 |
+------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
Now I need to find suitable intervals for each Item. For instance, in this case I want to find all intervals with a minimum duration of 3-days.
+------+------------+------------+------------+------------+
| Item | 1_START | 1_END | 2_START | 2_END |
+------+------------+------------+------------+------------+
| A | NaN | NaN | NaN | NaN |
| B | 01/08/2020 | 03/08/2020 | 06/08/2020 | 08/08/2020 |
| C | NaN | NaN | NaN | NaN |
| D | 01/08/2020 | 07/08/2020 | NaN | NaN |
| E | 01/08/2020 | NaN | NaN | NaN |
+------+------------+------------+------------+------------+
In reality the data is 700+ columns wide and 1,000+ rows. How can I do this efficiently?

How do I create a Step Graph in Excel?

Suppose I have a set of values: <0-4.5, 1>, <4.6-9.3, 2>, <9.4-12.2, 3> and I want to display a simple, three step graph. One step would span from zero to 4.5 and the height should be 1, the second step from 4.6 to 9.3 and the height would be 2 and so on.
How do I do it in Excel?
Edit: A hack would be to generate pairs: <0,1>, <0.1,1>...<4.4,1>, <4.5,1> and use a scatter graph. But, really!
I'm not sure exactly how you want the graph to look, but another workaround is to select a scatter joined by lines. If you want a gap anywhere, put in a missing value:
+------+---+
| x | y |
+------+---+
| 0 | 0 |
| 0 | 1 |
| 4.5 | 1 |
| 4.5 | 0 |
| | |
| 4.6 | 0 |
| 4.6 | 2 |
| 9.3 | 2 |
| 9.3 | 0 |
| | |
| 9.4 | 0 |
| 9.4 | 3 |
| 12.2 | 3 |
| 12.2 | 0 |
+------+---+
Use the following data set:
| x | y |
|------|---|
| 0.1 | 1 |
| 0.2 | 1 |
| 0.3 | 1 |
| 0.4 | 1 |
| 0.5 | 1 |
| 0.6 | 1 |
| 0.7 | 1 |
| 0.8 | 1 |
| 0.9 | 1 |
| 1 | 1 |
| 1.1 | 1 |
| 1.2 | 1 |
| 1.3 | 1 |
| 1.4 | 1 |
| 1.5 | 1 |
| 1.6 | 1 |
| 1.7 | 1 |
| 1.8 | 1 |
| 1.9 | 1 |
| 2 | 1 |
| 2.1 | 1 |
| 2.2 | 1 |
| 2.3 | 1 |
| 2.4 | 1 |
| 2.5 | 1 |
| 2.6 | 1 |
| 2.7 | 1 |
| 2.8 | 1 |
| 2.9 | 1 |
| 3 | 1 |
| 3.1 | 1 |
| 3.2 | 1 |
| 3.3 | 1 |
| 3.4 | 1 |
| 3.5 | 1 |
| 3.6 | 1 |
| 3.7 | 1 |
| 3.8 | 1 |
| 3.9 | 1 |
| 4 | 1 |
| 4.1 | 1 |
| 4.2 | 1 |
| 4.3 | 1 |
| 4.4 | 1 |
| 4.5 | 1 |
| 4.6 | 2 |
| 4.7 | 2 |
| 4.8 | 2 |
| 4.9 | 2 |
| 5 | 2 |
| 5.1 | 2 |
| 5.2 | 2 |
| 5.3 | 2 |
| 5.4 | 2 |
| 5.5 | 2 |
| 5.6 | 2 |
| 5.7 | 2 |
| 5.8 | 2 |
| 5.9 | 2 |
| 6 | 2 |
| 6.1 | 2 |
| 6.2 | 2 |
| 6.3 | 2 |
| 6.4 | 2 |
| 6.5 | 2 |
| 6.6 | 2 |
| 6.7 | 2 |
| 6.8 | 2 |
| 6.9 | 2 |
| 7 | 2 |
| 7.1 | 2 |
| 7.2 | 2 |
| 7.3 | 2 |
| 7.4 | 2 |
| 7.5 | 2 |
| 7.6 | 2 |
| 7.7 | 2 |
| 7.8 | 2 |
| 7.9 | 2 |
| 8 | 2 |
| 8.1 | 2 |
| 8.2 | 2 |
| 8.3 | 2 |
| 8.4 | 2 |
| 8.5 | 2 |
| 8.6 | 2 |
| 8.7 | 2 |
| 8.8 | 2 |
| 8.9 | 2 |
| 9 | 2 |
| 9.1 | 2 |
| 9.2 | 2 |
| 9.3 | 2 |
| 9.4 | 3 |
| 9.5 | 3 |
| 9.6 | 3 |
| 9.7 | 3 |
| 9.8 | 3 |
| 9.9 | 3 |
| 10 | 3 |
| 10.1 | 3 |
| 10.2 | 3 |
| 10.3 | 3 |
| 10.4 | 3 |
| 10.5 | 3 |
| 10.6 | 3 |
| 10.7 | 3 |
| 10.8 | 3 |
| 10.9 | 3 |
| 11 | 3 |
| 11.1 | 3 |
| 11.2 | 3 |
| 11.3 | 3 |
| 11.4 | 3 |
| 11.5 | 3 |
| 11.6 | 3 |
| 11.7 | 3 |
| 11.8 | 3 |
| 11.9 | 3 |
| 12 | 3 |
| 12.1 | 3 |
Highlight the data set, and insert a chart using Recommended Charts and pick the one you prefer as shown below.
It can be a bar chart or line chart not necessarily a scatter chart, but the preparation is similar. Charting is not that intuitive in excel and needs a lot of workarounds sometimes.
Cheers :)

Detect consecutive timestamps with all rows with NaN values in pandas

I would like to detect in a dataframe the start and end (Datetime) of consecutive sets of rows with all the values being NaN.
What is the best way to store the results in a array of tuples with the start and end of each set of datetimes with NaN values?
For example using the dataframe bellow the tuple should be like this:
missing_datetimes = [('2018-10-10 22:00:00', '2018-10-11 00:00:00 '),
('2018-10-11 02:00:00','2018-10-11 02:00:00'), ('2018-10-11 04:00:00', '2018-10-11 04:00:00')
Example of dataframe:
-------------+---------------------+------------+------------+
| geo_id | Datetime | Variable1 | Variable2 |
+------------+---------------------+------------+------------+
| 1 | 2018-10-10 18:00:00 | 20 | 10 |
| 2 | 2018-10-10 18:00:00 | 22 | 10 |
| 1 | 2018-10-10 19:00:00 | 20 | nan |
| 2 | 2018-10-10 19:00:00 | 21 | nan |
| 1 | 2018-10-10 20:00:00 | 30 | nan |
| 2 | 2018-10-10 20:00:00 | 30 | nan |
| 1 | 2018-10-10 21:00:00 | nan | 5 |
| 2 | 2018-10-10 21:00:00 | nan | 5 |
| 1 | 2018-10-10 22:00:00 | nan | nan |
| 1 | 2018-10-10 23:00:00 | nan | nan |
| 1 | 2018-10-11 00:00:00 | nan | nan |
| 1 | 2018-10-11 01:00:00 | 5 | 2 |
| 1 | 2018-10-11 02:00:00 | nan | nan |
| 1 | 2018-10-11 03:00:00 | 2 | 1 |
| 1 | 2018-10-11 04:00:00 | nan | nan |
+------------+---------------------+------------+------------+
Update: And what if some datetimes are duplicated?
You may need to using groupby with condition
s=df.set_index('Datetime').isnull().all(axis=1)
df.loc[s,'Datetime'].groupby((~s).cumsum()[s]).agg(['first','last']).apply(tuple,1).tolist()
# find the all nan value and if they are consecutive we pull them into one group
Out[89]:
[('2018-10-1022:00:00', '2018-10-1100:00:00'),
('2018-10-1102:00:00', '2018-10-1102:00:00'),
('2018-10-1104:00:00', '2018-10-1104:00:00')]

How to pretty print the csv which has long columns from command line?

I want to view and pretty print this csv file from command line. For this purpose I am using csvlook nupic_out.csv | less -#2 -N -S command. The problem is that this csv file has very long one column (it is the 5th - multiStepPredictions.1) Everything up to this column is displayed properly
1 -----------------+--------------------+-----------------------------+------------------------------------------------------------------------------------------------------------------------------------
2 angle | sine | multiStepPredictions.actual | multiStepPredictions.1
3 -----------------+--------------------+-----------------------------+------------------------------------------------------------------------------------------------------------------------------------
4 string | string | string | string
5 | | |
6 0.0 | 0.0 | 0.0 | None
7 0.0314159265359 | 0.0314107590781 | 0.0314107590781 | {0.0: 1.0}
8 0.0628318530718 | 0.0627905195293 | 0.0627905195293 | {0.0: 0.0039840637450199202 0.03141075907812829: 0.99601593625497931}
9 0.0942477796077 | 0.0941083133185 | 0.0941083133185 | {0.03141075907812829: 1.0}
10 0.125663706144 | 0.125333233564 | 0.125333233564 | {0.06279051952931337: 0.98942669172932329 0.03141075907812829: 0.010573308270676691}
11 0.157079632679 | 0.15643446504 | 0.15643446504 | {0.03141075907812829: 0.0040463956041429626 0.09410831331851431: 0.94917381047888194 0.06279051952931337: 0.04677979391
12 0.188495559215 | 0.187381314586 | 0.187381314586 | {0.12533323356430426: 0.85789473684210527 0.09410831331851431: 0.14210526315789476}
13 0.219911485751 | 0.218143241397 | 0.218143241397 | {0.15643446504023087: 0.63177315983686211 0.12533323356430426: 0.26859584385317475 0.09410831331851431: 0.09963099630
14 0.251327412287 | 0.248689887165 | 0.248689887165 | {0.06279051952931337: 0.3300438596491227 0.1873813145857246: 0.47381368550527647 0.15643446504023087: 0.12643231695
15 0.282743338823 | 0.278991106039 | 0.278991106039 | {0.21814324139654254: 0.56140350877192935 0.03141075907812829: 0.0032894736842105313 0.1873813145857246: 0.105263157894
16 0.314159265359 | 0.309016994375 | 0.309016994375 | {0.2486898871648548: 0.8228480378168288 0.03141075907812829: 0.0029688002160632981 0.1873813145857246: 0.022936632244020292
17 0.345575191895 | 0.338737920245 | 0.338737920245 | {0.2486898871648548: 0.13291723147401985 0.2789911060392293: 0.77025390613412514 0.21814324139654254: 0.06654338668
18 0.376991118431 | 0.368124552685 | 0.368124552685 | {0.2486898871648548: 0.10230061459892241 0.2789911060392293: 0.14992465949587844 0.21814324139654254: 0.06517018413
19 0.408407044967 | 0.397147890635 | 0.397147890635 | {0.33873792024529137: 0.67450197451277849 0.2486898871648548: 0.028274124758268366 0.2789911060392293: 0.077399230934
20 0.439822971503 | 0.425779291565 | 0.425779291565 | {0.33873792024529137: 0.17676914536466748 0.3681245526846779: 0.6509556160617509 0.2486898871648548: 0.04784688995215327
21 0.471238898038 | 0.45399049974 | 0.45399049974 | {0.33873792024529137: 0.038582651338955089 0.3681245526846779: 0.14813277049357607 0.2486898871648548: 0.029239766081
22 0.502654824574 | 0.481753674102 | 0.481753674102 | {0.3681245526846779: 0.035163881050575212 0.42577929156507266: 0.61447711863333254 0.2486898871648548: 0.015554881705
23 0.53407075111 | 0.50904141575 | 0.50904141575 | {0.33873792024529137: 0.076923076923077108 0.42577929156507266: 0.11307647489430354 0.45399049973954675: 0.66410206612
24 0.565486677646 | 0.535826794979 | 0.535826794979 | {0.42577929156507266: 0.035628438284964516 0.45399049973954675: 0.22906083786048709 0.3971478906347806: 0.014132015120
25 0.596902604182 | 0.562083377852 | 0.562083377852 | {0.5090414157503713: 0.51578106597362727 0.45399049973954675: 0.095000708551421106 0.06279051952931337: 0.08649420683
26 0.628318530718 | 0.587785252292 | 0.587785252292 | {0.5090414157503713: 0.10561370056909389 0.45399049973954675: 0.063130123291224485 0.5358267949789967: 0.617348556187
27 0.659734457254 | 0.612907053653 | 0.612907053653 | {0.5090414157503713: 0.036017118165629407 0.45399049973954675: 0.013316643552779454 0.5358267949789967: 0.236874795987
28 0.69115038379 | 0.637423989749 | 0.637423989749 | {0.2486898871648548: 0.037593984962406228 0.21814324139654254: 0.033834586466165564 0.5358267949789967: 0.085397996837
29 0.722566310326 | 0.661311865324 | 0.661311865324 | {0.6129070536529765: 0.49088597257034694 0.2486898871648548: 0.072573707671854309 0.06279051952931337: 0.04684445139
30 0.753982236862 | 0.684547105929 | 0.684547105929 | {0.6129070536529765: 0.16399317807418579 0.2486898871648548: 0.066194656736965368 0.2789911060392293: 0.015074193295
But everything displayed behind this column is garbage
1 --------------------------------------------------------------------------------------------------------+--------------+---------------------------------+----------------------------+--------------+---
2 | anomalyScore | multiStepBestPredictions.actual | multiStepBestPredictions.1 | anomalyLabel | mu
3 --------------------------------------------------------------------------------------------------------+--------------+---------------------------------+----------------------------+--------------+---
4 | string | string | string | string | fl
5 | | | | |
6 | 1.0 | 0.0 | None | [] | 0
7 | 1.0 | 0.0314107590781 | 0.0 | [] | 10
8 | 1.0 | 0.0627905195293 | 0.0314107590781 | []
9 | 1.0 | 0.0941083133185 | 0.0314107590781 | [] | 66
10 | 1.0 | 0.125333233564 | 0.0627905195293 | []
11 | 1.0 | 0.15643446504 | 0.0941083133185 | []
12 | 1.0 | 0.187381314586 | 0.125333233564 | []
13 | 1.0 | 0.218143241397 | 0.15643446504 | []
14 | 1.0 | 0.248689887165 | 0.187381314586
15 | 1.0 | 0.278991106039 | 0.218143241397 |
16 | 1.0 | 0.309016994375 | 0.248689887165 | []
17 | 1.0 | 0.338737920245 | 0.278991106039
18 075907812829: 0.0008726186745285988 0.3090169943749474: 0.36571033632089267 0.15643446504023087: 0.15263157894736851} | 1.0 | 0.368124552685 | 0.30
19 69943749474: 0.12243639244611626 0.15643446504023087: 0.076923076923077024} | 1.0 | 0.397147890635 | 0.33873792
20 474: 0.042824288244468607} | 1.0 | 0.425779291565 | 0.368124552685
21 78906347806: 0.72014752277063943 0.3090169943749474: 0.019779736758565116} | 1.0 | 0.45399049974 | 0.39714789
22 323356430426: 0.030959752321981428 0.09410831331851431: 0.027863777089783253} | 1.0 | 0.481753674102 | 0.425779291
23 831331851431: 0.036437246963562819} | 1.0 | 0.50904141575 | 0.45399049974
24 831331851431: 0.011027980232581683} | 1.0 | 0.535826794979 | 0.481753674102
25 929156507266: 0.027856989831229011 0.15643446504023087: 0.02066616653788458 0.09410831331851431: 0.016739594895686508} | 1.0 | 0.562083377852 | 0.5090
26 13145857246: 0.08333333333333337 0.42577929156507266: 0.025020076940584089} | 1.0 | 0.587785252292 | 0.5358
27 075907812829: 0.0025974025974026035 0.5620833778521306: 0.59566175023106149} | 1.0 | 0.612907053653 | 0.5620833778
28 33778521306: 0.19639042255084313} one | 1.0 | 0.637423989749 | 0.587785252292
29 13145857246: 0.0046487548012272466 0.21814324139654254: 0.070071166027997234 0.5620833778521306: 0.087432430700408653} | 1.0 | 0.661311865324 | 0.612
30 39897486896: 0.53158336716673826 0.3090169943749474: 0.016749103661249369 0.5620833778521306: 0.027323827946545261} | 1.0 | 0.684547105929 | 0.6
How to pretty print whole csv?
PS: Similar garbage produces also following commands (inspiration here):
column -s, -t < nupic_out.csv | less -#2 -N -S
csvtool readable nupic_out.csv | less -#2 -N -S
I believe that csvlook is treating the tab characters in that column just like any other character, and doesn't know about their special behaviour.
The easiest way to get the columns to line up is to minimally expand the tabs:
expand -t1 nupic_out.csv | csvlook

Resources