Why is pivot_longer is creating duplicate rows in my data frame? - pivot

I am using pivot_longer to clean up my data frame and it created three duplicate rows for each plant species.
Here is a sample of the data frame:
latin_name plant_type treatment `ounces per plot` weight `grams per plot`
<chr> <chr> <chr> <dbl> <chr> <dbl>
1 Agastache foeniculum wildflower tx._one_rate 0.021 wt_per_plot_one 0.609
2 Agastache foeniculum wildflower tx._one_rate 0.021 wt_per_plot_two 1.22
3 Agastache foeniculum wildflower tx._one_rate 0.021 wt_per_plot_three 1.83
4 Agastache foeniculum wildflower tx._one_rate 0.021 wt_per_plot_four 2.44
5 Agastache foeniculum wildflower tx._two._rate 0.0430 wt_per_plot_one 0.609
6 Agastache foeniculum wildflower tx._two._rate 0.0430 wt_per_plot_two 1.22
7 Agastache foeniculum wildflower tx._two._rate 0.0430 wt_per_plot_three 1.83
8 Agastache foeniculum wildflower tx._two._rate 0.0430 wt_per_plot_four 2.44
9 Agastache foeniculum wildflower tx._three._rate 0.0645 wt_per_plot_one 0.609
10 Agastache foeniculum wildflower tx._three._rate 0.0645 wt_per_plot_two 1.22
And here is the code chunk I used:
library(tidyverse)
seed_rate1 <- seed_rate1%>%
pivot_longer(
cols = starts_with("tx"),
names_to = "treatment",
values_to = "ounces per plot")%>%
pivot_longer(
cols = starts_with("wt"),
names_to = "weight",
values_to = "grams per plot")
And here is 12 rows and all columns of the data using dput():
seed_struct <-
structure(structure(
list(
latin_name = c(
"Agastache foeniculum",
"Agastache foeniculum",
"Agastache foeniculum",
"Agastache foeniculum",
"Agastache foeniculum",
"Agastache foeniculum",
"Agastache foeniculum",
"Agastache foeniculum",
"Agastache foeniculum",
"Agastache foeniculum",
"Agastache foeniculum",
"Agastache foeniculum"
),
plant_type = c(
"wildflower",
"wildflower",
"wildflower",
"wildflower",
"wildflower",
"wildflower",
"wildflower",
"wildflower",
"wildflower",
"wildflower",
"wildflower",
"wildflower"
),
treatment = c(
"tx._one_rate",
"tx._one_rate",
"tx._one_rate",
"tx._one_rate",
"tx._two._rate",
"tx._two._rate",
"tx._two._rate",
"tx._two._rate",
"tx._three._rate",
"tx._three._rate",
"tx._three._rate",
"tx._three._rate"
),
`ounces per plot` = c(
0.021,
0.021,
0.021,
0.021,
0.042975207,
0.042975207,
0.042975207,
0.042975207,
0.06446281,
0.06446281,
0.06446281,
0.06446281
),
weight = c(
"wt_per_plot_one",
"wt_per_plot_two",
"wt_per_plot_three",
"wt_per_plot_four",
"wt_per_plot_one",
"wt_per_plot_two",
"wt_per_plot_three",
"wt_per_plot_four",
"wt_per_plot_one",
"wt_per_plot_two",
"wt_per_plot_three",
"wt_per_plot_four"
),
`grams per plot` = c(
0.609,
1.218,
1.827,
2.437,
0.609,
1.218,
1.827,
2.437,
0.609,
1.218,
1.827,
2.437
)
),
row.names = c(NA,-12L),
class = c("tbl_df", "tbl", "data.frame")
))
Unable to figure out how to fix this i tried to use distinct() with no luck either...I am new to R so I suspect there is something in the pivot_longer that I am not doing correctly. If anyone has any input I would greatly appreciate. I am new to R so very specific/ easy to understand responses would be much appreciated!

You name the data seed_struct but use in the code seed_rate1 for the pivot_longer. As it is not clear where seed_rate1 comes from I will work with seed_struct from your dput.
cols = starts_with() in pivot_longer selects columns with names starting with whatever you put in. If you want to use treatment - the variable which contains categories with names starting with tx - in pivot_longer that does not make sense, as this variable is already in the long format.
If you want to combine the variables treatment and weight in a long format, you can do this:
library(tidyverse)
seed_struct |>
pivot_longer(cols = c(treatment, weight))
#> # A tibble: 24 × 6
#> latin_name plant_type `ounces per plot` grams per plo…¹ name value
#> <chr> <chr> <dbl> <dbl> <chr> <chr>
#> 1 Agastache foeniculum wildflower 0.021 0.609 trea… tx._…
#> 2 Agastache foeniculum wildflower 0.021 0.609 weig… wt_p…
#> 3 Agastache foeniculum wildflower 0.021 1.22 trea… tx._…
#> 4 Agastache foeniculum wildflower 0.021 1.22 weig… wt_p…
#> 5 Agastache foeniculum wildflower 0.021 1.83 trea… tx._…
#> 6 Agastache foeniculum wildflower 0.021 1.83 weig… wt_p…
#> 7 Agastache foeniculum wildflower 0.021 2.44 trea… tx._…
#> 8 Agastache foeniculum wildflower 0.021 2.44 weig… wt_p…
#> 9 Agastache foeniculum wildflower 0.0430 0.609 trea… tx._…
#> 10 Agastache foeniculum wildflower 0.0430 0.609 weig… wt_p…
#> # … with 14 more rows, and abbreviated variable name ¹​`grams per plot`
As 2 columns are transferred into a longer format your original DF of 12 rows turns into one with 24 rows.
The other thing you might have in mind is maybe making the DF wide instead of long. This can be done with pivot_wider by choosing the 2 variables with their values.
seed_struct |>
pivot_wider(names_from = treatment, values_from = `ounces per plot`) |>
pivot_wider(names_from = weight, values_from = `grams per plot`)
#> # A tibble: 1 × 9
#> latin_name plant…¹ tx._o…² tx._t…³ tx._t…⁴ wt_pe…⁵ wt_pe…⁶ wt_pe…⁷ wt_pe…⁸
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Agastache foe… wildfl… 0.021 0.0430 0.0645 0.609 1.22 1.83 2.44
#> # … with abbreviated variable names ¹​plant_type, ²​tx._one_rate, ³​tx._two._rate,
#> # ⁴​tx._three._rate, ⁵​wt_per_plot_one, ⁶​wt_per_plot_two, ⁷​wt_per_plot_three,
#> # ⁸​wt_per_plot_four
This makes your original DF into one which combines all interesting variables into one row.

Related

Finding the Max of Excel Matrix Data based on Criteria from Maxtrix

I have a data on Matrix and I also have the criteria data in Matrix as well See below
Data from the Matrix
Period
0.0
30
45
60
75
90
105
120
135
150
180
6.0
0.356
0.443
0.469
0.505
0.579
0.525
0.516
0.475
0.342
0.271
0.171
7.0
0.439
0.541
0.558
0.678
0.802
0.642
0.747
0.499
0.436
0.336
0.232
8.0
0.505
0.544
0.591
0.694
0.759
0.747
0.736
0.584
0.560
0.467
0.269
9.0
0.489
0.614
0.618
0.630
0.791
0.687
0.631
0.577
0.507
0.562
0.340
10.0
0.538
0.603
0.572
0.580
0.703
0.643
0.619
0.556
0.489
0.459
0.399
11.0
0.503
0.491
0.513
0.578
0.585
0.630
0.587
0.542
0.439
0.459
0.345
12.0
0.517
0.446
0.539
0.588
0.546
0.564
0.552
0.497
0.411
0.412
0.355
13.0
0.470
0.439
0.545
0.534
0.530
0.482
0.510
0.470
0.422
0.404
0.329
14.0
0.399
0.427
0.469
0.442
0.462
0.434
0.409
0.425
0.382
0.395
0.340
15.0
0.370
0.390
0.388
0.397
0.421
0.393
0.355
0.387
0.355
0.341
0.331
Criteria for the matrix
Period
0.0
30
45
60
75
90
105
120
135
150
180
6.0
3
5
5
6
7
6
6
5
3
2
0
7.0
5
6
7
9
10
8
10
6
5
3
1
8.0
6
6
7
9
10
10
9
7
7
5
2
9.0
6
8
8
8
10
9
8
7
6
7
3
10.0
6
7
7
7
9
8
8
7
6
5
4
11.0
6
6
6
7
7
8
7
6
5
5
3
12.0
6
5
6
7
6
7
7
6
4
4
3
13.0
5
5
6
6
6
5
6
5
4
4
3
14.0
4
5
5
5
5
5
4
5
4
4
3
15.0
4
4
4
4
4
4
3
4
3
3
3
Is there any way to find the maximum of no 3 or 10 from the criteria data on the criteria Matrix, and the max values should be taken the matrix data based on the location from the matrix criteria ?
So from the above No 10 should be the maximum from Matrix ( [7,75] or [7,105] or [8,75] or [8,90] or [9,75] )?
I am expecting Excel function or VBA to find the max data of those numbers?
Thanks alot for your help and taught about it
Excel Function or Excel VBA
Assume tables start (with header row and column) in cell A1 of two sheets named Criteria and Data:
=MAX(SUMPRODUCT( (Criteria!B2:L11=10) * (Data!B2:L11) ) )
Max in Matrix Using Criteria Matrix
If you have Microsoft 365 and if the criteria are in the range N2:N12, in cell O2 of sheet Criteria you could use:
=MAX(TOCOL(($B$2:$L$11=N2)*Data!$B$2:$L$11))
or (more of the same i.e. getting familiar with the LET function)
=LET(tCriteria,$B$2:$L$11,tData,Data!$B$2:$L$11,Criteria,N2,
MAX(TOCOL((tCriteria=Criteria)*tData)))
used in cell P2 of the screenshot, and copy down.

How to subtract X rows in a dataframe with first value from another dataframe?

I am using pandas for this work.
I have a 2 datasets. The first dataset has approximately 6 million rows and 6 columns. For example the first data set looks something like this:
Date
Time
U
V
W
T
2020-12-30
2:34
3
4
5
7
2020-12-30
2:35
2
3
6
5
2020-12-30
2:36
1
5
8
5
2020-12-30
2:37
2
3
0
8
2020-12-30
2:38
4
4
5
7
2020-12-30
2:39
5
6
5
9
this is just the raw data collected from the machine.
The second is the average values of three rows at a time from each column (U,V,W,T).
U
V
W
T
2
4
6.33
5.67
3.66
4.33
3.33
8
What I am trying to do is calculate the perturbation for each column per second.
U(perturbation)=U(raw)-U(avg)
U(raw)= dataset 1
U(avg)= dataset 2
Basically take the first three rows from the first column of the first dataset and individually subtract them by the first value in the first column of the second dataset, then take the next three values from the first column of the first data set and individually subtract them by second value in the first column of the second dataset. Do the same for all three columns.
The desired final output should be as the following:
Date
Time
U
V
W
T
2020-12-30
2:34
1
0
-1.33
1.33
2020-12-30
2:35
0
-1
-0.33
-0.67
2020-12-30
2:36
-1
1
1.67
-0.67
2020-12-30
2:37
-1.66
-1.33
-3.33
0
2020-12-30
2:38
0.34
-0.33
1.67
-1
2020-12-30
2:39
1.34
1.67
1.67
1
I am new to pandas and do not know how to approach this.
I hope it makes sense.
a = df1.assign(index = df1.index // 3).merge(df2.reset_index(), on='index')
b = a.filter(regex = '_x', axis=1) - a.filter(regex = '_y', axis = 1).to_numpy()
pd.concat([a.filter(regex='^[^_]+$', axis = 1), b], axis = 1)
Date Time index U_x V_x W_x T_x
0 2020-12-30 2:34 0 0.00 0.00 -1.33 1.33
1 2020-12-30 2:35 0 -1.00 -1.00 -0.33 -0.67
2 2020-12-30 2:36 0 -2.00 1.00 1.67 -0.67
3 2020-12-30 2:37 1 -1.66 -1.33 -3.33 0.00
4 2020-12-30 2:38 1 0.34 -0.33 1.67 -1.00
5 2020-12-30 2:39 1 1.34 1.67 1.67 1.00
You can use numpy:
import numpy as np
df1[df2.columns] -= np.repeat(df2.to_numpy(), 3, axis=0)
NB. This modifies df1 in place, if you want you can make a copy first (df_final = df1.copy()) and apply the subtraction on this copy.

Color Outliers in LineChart

I am generating line charts with the following syntax:
df2 = df2[['runtime','per','dev','var']]
op = f"/tmp/image.png"
fig, ax = plt.subplots(facecolor='darkslategrey')
df2.plot(x='runtime',xlabel="Date", kind='line', marker='o',linewidth=2,alpha=.7,subplots=True,color=['khaki', 'lightcyan','thistle'])
plt.style.use('dark_background')
plt.suptitle(f'Historical Data:', fontsize=12,fontname = 'monospace')
#file output
plt.savefig(op, transparent=False,bbox_inches="tight")
plt.close('all')
Where df2 dataframe sample:
runtime per dev var
1 2021-05-28 50.85 2.11 2.13
1 2021-05-30 50.85 2.11 2.13
1 2021-06-02 51.13 2.16 2.11
1 2021-06-04 51.13 2.16 2.11
1 2021-06-07 51.13 2.16 2.11
1 2021-06-09 51.11 2.13 2.10
1 2021-06-10 51.11 2.13 2.10
1 2021-06-14 51.11 2.13 2.10
1 2021-06-16 51.34 2.12 2.10
1 2021-06-18 51.34 2.12 2.10
1 2021-06-21 51.34 2.12 2.10
1 2021-06-23 51.69 1.97 2.17
1 2021-06-25 51.69 1.97 2.17
1 2021-06-28 51.69 1.97 2.17
1 2021-06-30 56.46 1.74 2.14
1 2021-07-02 56.46 1.74 2.14
1 2021-07-05 56.46 1.74 2.14
1 2021-07-07 55.10 1.84 2.08
1 2021-07-09 55.10 1.84 2.08
1 2021-07-12 55.10 1.84 2.08
1 2021-07-14 54.58 1.85 2.07
1 2021-07-16 54.58 1.85 2.07
1 2021-07-19 54.58 1.85 2.07
1 2021-07-21 54.33 1.87 2.06
1 2021-07-23 54.33 1.87 2.06
1 2021-07-26 54.33 1.87 2.06
1 2021-07-28 54.98 1.91 2.19
1 2021-07-30 54.98 1.91 2.19
This works great.
Now, I would like to change the color of points if their values are "abnormal", specifically if per < 90.00 or per > 10.00, or if dev < 10.00 or if var < 10.00 to color the point RED.
Is this possible?
Instead of drawing the 3 subplots in one call, they could be drawn one-by-one. First draw the subplot as before, and on top of it a scatter plot, only with the "abnormal" points. zorder=3 makes sure that the scatter dots appear on top of the existing dots.
Here is some example code:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df2 = pd.DataFrame({'runtime': pd.date_range('20210101', freq='D', periods=100),
'per': np.random.uniform(1, 99, 100),
'dev': np.random.uniform(1, 11, 100),
'var': np.random.uniform(2, 11, 100)})
fig, axs = plt.subplots(nrows=3, figsize=(6, 10), facecolor='darkslategrey', sharex=True)
for ax, column, color, (min_normal, max_normal) in zip(axs,
['per', 'dev', 'var'],
['khaki', 'lightcyan', 'thistle'],
[(10, 90), (-np.inf, 10), (-np.inf, 10)]):
df2.plot(x='runtime', xlabel="Date", y=column, ylabel=column,
kind='line', marker='o', linewidth=2, alpha=.7, color=color, legend=False, ax=ax)
df_abnormal = df2[(df2[column] < min_normal) | (df2[column] > max_normal)]
df_abnormal.plot(x='runtime', xlabel="Date", y=column, ylabel=column,
kind='scatter', marker='o', color='red', legend=False, zorder=3, ax=ax)
plt.style.use('dark_background')
plt.suptitle(f'Historical Data:', fontsize=12, fontname='monospace')
plt.tight_layout()
plt.show()

How to compute prices form daily returns?

I have a dataframe with daily returns. I want to add another column with price series calculated based on daily returns.
This is the dataframe:
date daily
0 2020-09-01 0.0000
1 2020-09-02 0.0012
2 2020-09-03 -0.0005
3 2020-09-04 -0.0004
4 2020-09-07 0.0032
5 2020-09-08 -0.0015
6 2020-09-09 0.0005
7 2020-09-10 0.0003
8 2020-09-11 0.0001
9 2020-09-14 0.0043
10 2020-09-15 0.0037
11 2020-09-16 -0.0008
and this is the column of prices that i want to add:
prices
0 100.000000
1 100.120000
2 100.069940
3 100.029912
4 100.350008
5 100.199483
6 100.249582
7 100.279657
8 100.289685
9 100.720931
10 101.093598
11 101.012724
I've tried to do a loop on column 'daily' and then calculate the price but i do not preserve the new values into the list prz.
prz= []
for row in df['daily']:
prz.append(100 *(1+row))
First add 1, then use Series.cumprod and last multiply by 100, for invert use Series.pct_change with replace first NaN to 0:
df['prices'] = df['daily'].add(1).cumprod().mul(100)
df['back'] = df['prices'].pct_change().fillna(0)
print (df)
date daily prices back
0 2020-09-01 0.0000 100.000000 0.0000
1 2020-09-02 0.0012 100.120000 0.0012
2 2020-09-03 -0.0005 100.069940 -0.0005
3 2020-09-04 -0.0004 100.029912 -0.0004
4 2020-09-07 0.0032 100.350008 0.0032
5 2020-09-08 -0.0015 100.199483 -0.0015
6 2020-09-09 0.0005 100.249582 0.0005
7 2020-09-10 0.0003 100.279657 0.0003
8 2020-09-11 0.0001 100.289685 0.0001
9 2020-09-14 0.0043 100.720931 0.0043
10 2020-09-15 0.0037 101.093598 0.0037
11 2020-09-16 -0.0008 101.012724 -0.0008
You can also use numpy.cumproduct() here:
In [1349]: import numpy as np
In [1350]: df['prices'] = np.cumproduct(df.daily + 1) * 100
In [1351]: df
Out[1351]:
date daily prices
0 2020-09-01 0.0000 100.000000
1 2020-09-02 0.0012 100.120000
2 2020-09-03 -0.0005 100.069940
3 2020-09-04 -0.0004 100.029912
4 2020-09-07 0.0032 100.350008
5 2020-09-08 -0.0015 100.199483
6 2020-09-09 0.0005 100.249582
7 2020-09-10 0.0003 100.279657
8 2020-09-11 0.0001 100.289685
9 2020-09-14 0.0043 100.720931
10 2020-09-15 0.0037 101.093598
11 2020-09-16 -0.0008 101.012724

plot data organized by rows

The data are organized where the first column are year and next are monthly average, I need plot x-->month and y--> monthly average and the data base is organized first column= year next columns are monthly mean (12 values) corresponding to year, like that:
1871 -0.107 0.004 -0.503 -0.650 -0.379 0.025 0.317 0.027 -0.732 -0.685 0.037 0.566
1872 0.376 -0.241 -0.904 -1.019 0.367 0.282 -0.061 0.597 0.779 0.818 1.070 1.203
1873 0.831 0.762 0.379 -0.028 0.014 0.349 0.189 0.428 -0.170 0.643 0.859 0.317
1874 0.063 0.125 -0.068 -0.124 0.365 0.535 0.693 1.298 0.554 0.566 0.889 0.185
1875 -0.369 -0.764 -1.238 0.111 0.683 0.696 0.505 1.008 1.210 0.945 -0.307 -0.184
Similar to this graph:

Resources