Can't get pivot_wider values_fn to use multiple functions - pivot

I'm trying to use the values_fn argument in pivot_wider to apply a function to specific columns when there are multiple values. Using the iris dataset as an example:
iris.long <- iris %>%
mutate(seq = rep(1:75, each = 2)) %>%
select(-Petal.Length, -Petal.Width) %>%
pivot_longer(-c("Species", "seq"), names_to = "Name", values_to = "Value")
> print(iris.long)
# A tibble: 300 x 4
Species seq Name Value
<fct> <int> <chr> <dbl>
1 setosa 1 Sepal.Length 5.1
2 setosa 1 Sepal.Width 3.5
3 setosa 1 Sepal.Length 4.9
4 setosa 1 Sepal.Width 3
5 setosa 2 Sepal.Length 4.7
6 setosa 2 Sepal.Width 3.2
7 setosa 2 Sepal.Length 4.6
8 setosa 2 Sepal.Width 3.1
9 setosa 3 Sepal.Length 5
10 setosa 3 Sepal.Width 3.6
# ... with 290 more rows
Now when I try to use a named list for values_fn, it still gives me col-list output and instead of aggregating the multiple values:
iris.long %>%
pivot_wider(id_cols = c("Species", "seq"), names_from = "Name", values_from = "Value",
values_fn = list(Sepal.Length = mean, Sepal.Width = min))
# A tibble: 75 x 4
Species seq Sepal.Length Sepal.Width
<fct> <int> <list> <list>
1 setosa 1 <dbl [2]> <dbl [2]>
2 setosa 2 <dbl [2]> <dbl [2]>
3 setosa 3 <dbl [2]> <dbl [2]>
4 setosa 4 <dbl [2]> <dbl [2]>
5 setosa 5 <dbl [2]> <dbl [2]>
6 setosa 6 <dbl [2]> <dbl [2]>
7 setosa 7 <dbl [2]> <dbl [2]>
8 setosa 8 <dbl [2]> <dbl [2]>
9 setosa 9 <dbl [2]> <dbl [2]>
10 setosa 10 <dbl [2]> <dbl [2]>
# ... with 65 more rows
Warning message:
Values are not uniquely identified; output will contain list-cols.
* Use `values_fn = list` to suppress this warning.
* Use `values_fn = length` to identify where the duplicates arise
* Use `values_fn = {summary_fun}` to summarise duplicates
Can anyone help, please? Much appreciated.

Related

How to groupby a key and return min/max values in other columns on a single row?

I have a set of data that I am trying to group together based on a common key in column A and I want it to return a single row of information per grouped key value. Grouping is easy, but I am having issues with my other columns returning the values that I need. Here is the dataframe:
df = pd.DataFrame({'A': [1,2,1,2,3,3,3,4,5,6,6,4,5,5],
'B': [1.1,2.1,1.2,2.2,3.1,3.2,3.3,4.1,5.1,6.1,6.2,4.2,5.2,5.3],
'C':[10.1,20.1,10.1,20.1,30.1,30.1,30.1,40.1,50.1,60.1,60.1,40.1,50.1,50.1],
'D':['','',10.2,20.2,'','',30.2,'','','',60.2,40.2,'',50.2]
})
df
--------------------------------------------------------------------------------------------------
A B C D
0 1 1.1 10.1
1 2 2.1 20.1
2 1 1.2 10.1 10.2
3 2 2.2 20.1 20.2
4 3 3.1 30.1
5 3 3.2 30.1
6 3 3.3 30.1 30.2
7 4 4.1 40.1
8 5 5.1 50.1
9 6 6.1 60.1
10 6 6.2 60.1 60.2
11 4 4.2 40.1 40.2
12 5 5.2 50.1
13 5 5.3 50.1 50.2
I want to group by column "A", have column "B" display the minimum value, and then column "D" return the maximum value. My desired output would look something like this:
A B C D
0 1 1.1 10.1 10.2
1 2 2.1 20.1 20.2
2 3 3.1 30.1 30.2
3 4 4.1 40.1 40.2
4 5 5.1 50.1 50.2
5 6 6.1 60.1 60.2
I have tried grouping by column "A" and then have column "B" only pull the minimum value for each grouped key and then display the remaining column values for that minimum value in column "B" in a single row, but it outputs the NaN values for column "D". Currently the output of the code looks like this:
df = df.loc[df.groupby('A')['B'].idxmin()]
df
------------------------------------------------------------------------------------------------
A B C D
0 1 1.1 10.1
1 2 2.1 20.1
4 3 3.1 30.1
7 4 4.1 40.1
8 5 5.1 50.1
9 6 6.1 60.1
I also tried using groupby with lambda and ffill().tail(1), and got the result I wanted for column "D" but column "B" isn't the minimum/lowest value. Here is the code and output for that:
out = df.replace({'': pd.NA}) \
.groupby("A", as_index=False) \
.apply(lambda x: x.ffill().tail(1)) \
.reset_index(level=0,drop=True)
df = out
df
-------------------------------------------------------------------------------------------------
A B C D
2 1 1.2 10.1 10.2
3 2 2.2 20.1 20.2
6 3 3.3 30.1 30.2
11 4 4.2 40.1 40.2
13 5 5.3 50.1 50.2
10 6 6.2 60.1 60.2
Any ideas how I can combine these two pieces of code to make it so that I get the minimum value in column "A" and the maximum value in column "B" all within the same row based on the common key value.
Any help is appreciated.
try via replace() method:
df['D']=df['D'].replace('| ',float('NaN'),regex=True)
#replace the '' or ' ' to NaN
Finally use groupby() and agg():
out=df.groupby('A',as_index=False).agg({'B':'min','C':'first','D':'max'})
#use groupby and agg your according to your need
output of out:
A B C D
0 1 1.1 10.1 10.2
1 2 2.1 20.1 20.2
2 3 3.1 30.1 30.2
3 4 4.1 40.1 40.2
4 5 5.1 50.1 50.2
5 6 6.1 60.1 60.2

Fill missing rows in a python pandas dataframe with repetitive pattern

I am trying to fix missing rows in a pandas DataFrame like this:
import pandas as pd
df = pd.DataFrame([[1, 1.2, 3.4], [2, 4.5, 6.7], [3, 1.3, 2.5], [4, 5.6, 7.3],
[1, 3.4, 5.8], [2, 5.7, 8.9], [4, 2.4, 2.6], [1, 6.7, 8.4],
[3, 6.9, 4.2], [4, 4.2, 1.2]], columns = ['#', 'foo', 'bar'])
The above code give me a pandas dataframe like this:
Out[10]:
# foo bar
0 1 1.2 3.4
1 2 4.5 6.7
2 3 1.3 2.5
3 4 5.6 7.3
4 1 3.4 5.8
5 2 5.7 8.9
6 4 2.4 2.6
7 1 6.7 8.4
8 3 6.9 4.2
9 4 4.2 1.2
As you probably noticed, the values in the '#' column are in a repetitive pattern as 1, 2, 3, 4, 1, 2, 3, 4 ... but with some missing values (for this instance, 3 before row 6 and 2 before row 8). My question is: Is there any built in method (function) in pandas to fill the missing rows in this dataframe according to the repetitive pattern of '#' column? The values in the other columns of the filling rows can be NaN, or the interpolation\extrapolation\average of the values before and\or after the filling rows. In the other words, what I want is like this:
Out[16]:
# foo bar
0 1 1.2 3.4
1 2 4.5 6.7
2 3 1.3 2.5
3 4 5.6 7.3
4 1 3.4 5.8
5 2 5.7 8.9
6 3 NaN NaN
7 4 2.4 2.6
8 1 6.7 8.4
9 2 NaN NaN
10 3 6.9 4.2
11 4 4.2 1.2
I tried to set the '#' column as the index of the dataframe and reindex it with regular pattern without missing values. But the problem is the pd.reindex doesn't work with duplicate values. I know I can always go traditional way by iterating in a loop from line to line to fix it but I am afraid this would be time consuming if working with large size data.
I would appreciate if anyone can give me a hint on this.
You need create groups some way - here is used difference of values # and comparing with >1 by Series.le, then is used GroupBy.apply with Series.reindex:
df1 = (df.groupby(df['#'].diff().lt(1).cumsum())
.apply(lambda x: x.set_index('#').reindex(range(1, 5)))
.reset_index(level=0, drop=True)
.reset_index())
print (df1)
# foo bar
0 1 1.2 3.4
1 2 4.5 6.7
2 3 1.3 2.5
3 4 5.6 7.3
4 1 3.4 5.8
5 2 5.7 8.9
6 3 NaN NaN
7 4 2.4 2.6
8 1 6.7 8.4
9 2 NaN NaN
10 3 6.9 4.2
11 4 4.2 1.2
Another idea is create MultiIndex and reshape by unstack and stack:
df = (df.set_index(['#', df['#'].diff().lt(1).cumsum()])
.unstack()
.reindex(np.arange(4)+1)
.stack(dropna=False)
.sort_index(level=1)
.reset_index(level=1, drop=True)
.reset_index())
print (df)
# foo bar
0 1 1.2 3.4
1 2 4.5 6.7
2 3 1.3 2.5
3 4 5.6 7.3
4 1 3.4 5.8
5 2 5.7 8.9
6 3 NaN NaN
7 4 2.4 2.6
8 1 6.7 8.4
9 2 NaN NaN
10 3 6.9 4.2
11 4 4.2 1.2
We can mark each group of 1,2,3,4 with eq and cumsum.
Then we groupby on these groups and use reindex and finally concat them back together.
s = df['#'].eq(4).shift().cumsum().bfill()
pd.concat(
[d.set_index('#').reindex(np.arange(4)+1) for _, d in df.groupby(s)]
).reset_index()
Output
# foo bar
0 1 1.2 3.4
1 2 4.5 6.7
2 3 1.3 2.5
3 4 5.6 7.3
4 1 3.4 5.8
5 2 5.7 8.9
6 3 NaN NaN
7 4 2.4 2.6
8 1 6.7 8.4
9 2 NaN NaN
10 3 6.9 4.2
11 4 4.2 1.2
Note: if you would have a 4 as missing value in your # column, this method would fail.
This is similar to #jezrael sans the reindex and sort_index:
df['rep'] = df['#'].diff().le(0).cumsum()
(df.set_index(['rep','#'])
.unstack('#')
.stack('#', dropna=False)
.reset_index('#')
.reset_index(drop=True)
)
Output:
# foo bar
0 1 1.2 3.4
1 2 4.5 6.7
2 3 1.3 2.5
3 4 5.6 7.3
4 1 3.4 5.8
5 2 5.7 8.9
6 3 NaN NaN
7 4 2.4 2.6
8 1 6.7 8.4
9 2 NaN NaN
10 3 6.9 4.2
11 4 4.2 1.2
You could use the complete function from pyjanitor to expose the missing values :
# pip install pyjanitor
import pandas as pd
import janitor as jn
# cumsum creates identifiers for the groups in `#`
(df.assign(counter = df['#'].eq(1).cumsum())
.complete('#', 'counter')
# sorting can be ignored, if order is not important
.sort_values('counter', ignore_index = True)
.drop(columns='counter'))
# foo bar
0 1 1.2 3.4
1 2 4.5 6.7
2 3 1.3 2.5
3 4 5.6 7.3
4 1 3.4 5.8
5 2 5.7 8.9
6 3 NaN NaN
7 4 2.4 2.6
8 1 6.7 8.4
9 2 NaN NaN
10 3 6.9 4.2
11 4 4.2 1.2

How to select rows and assign them new values with SparkR?

In R programming language I can do following:
x <- c(1, 8, 3, 5, 6)
y <- rep("Down",5)
y[x>5] <- "Up"
This would result in a y vector being ("Down", "Up", "Down", "Down", "Up")
Now my x sequence is an output of the predict function on a linear model fit. The predict function in R returns a sequence while the predict function in Spark returns a DataFrame containing the columns of the test-dataset + the columns label and prediction.
By running
y[x$prediction > .5]
I get the error:
Error in y[x$prediction > 0.5] : invalid subscript type 'S4'
How would I solve this problem?
On selecting rows:
Your approach will not work, since y, as a product of Spark predict, is a Spark (and not R) dataframe; you should use the filter function of SparkR. Here is a reproducible example using the iris dataset:
library(SparkR)
sparkR.version()
# "2.2.1"
df <- as.DataFrame(iris)
df
# SparkDataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, Petal_Width:double, Species:string]
nrow(df)
# 150
# Let's keep only the records with Petal_Width > 0.2:
df2 <- filter(df, df$Petal_Width > 0.2)
nrow(df2)
# 116
Check also the example in the docs.
On replacing row values:
The standard practice for replacing row values in Spark dataframes is first to create a new column with the required condition, and then possibly dropping the old column; here is an example, where we replace values of Petal_Width greater than 0.2 with 0's in the df we have defined above:
newDF <- withColumn(df, "new_PetalWidth", ifelse(df$Petal_Width > 0.2, 0, df$Petal_Width))
head(newDF)
# result:
Sepal_Length Sepal_Width Petal_Length Petal_Width Species new_PetalWidth
1 5.1 3.5 1.4 0.2 setosa 0.2
2 4.9 3.0 1.4 0.2 setosa 0.2
3 4.7 3.2 1.3 0.2 setosa 0.2
4 4.6 3.1 1.5 0.2 setosa 0.2
5 5.0 3.6 1.4 0.2 setosa 0.2
6 5.4 3.9 1.7 0.4 setosa 0.0 # <- value changed
# drop the old column:
newDF <- drop(newDF, "Petal_Width")
head(newDF)
# result:
Sepal_Length Sepal_Width Petal_Length Species new_PetalWidth
1 5.1 3.5 1.4 setosa 0.2
2 4.9 3.0 1.4 setosa 0.2
3 4.7 3.2 1.3 setosa 0.2
4 4.6 3.1 1.5 setosa 0.2
5 5.0 3.6 1.4 setosa 0.2
6 5.4 3.9 1.7 setosa 0.0
The method also works along different columns; here is an example of a new column taking values 0 or Petal_Width, depending on a condition for Petal_Length:
newDF2 <- withColumn(df, "something_here", ifelse(df$Petal_Length > 1.4, 0, df$Petal_Width))
head(newDF2)
# result:
Sepal_Length Sepal_Width Petal_Length Petal_Width Species something_here
1 5.1 3.5 1.4 0.2 setosa 0.2
2 4.9 3.0 1.4 0.2 setosa 0.2
3 4.7 3.2 1.3 0.2 setosa 0.2
4 4.6 3.1 1.5 0.2 setosa 0.0
5 5.0 3.6 1.4 0.2 setosa 0.2
6 5.4 3.9 1.7 0.4 setosa 0.0

Type mismatch error for filter function with dplyr over a spark data frame

I am currently working on Rstudio over a rhel cluster.
I use spark 2.0.2 over a yarn client & have installed the following versions of sparklyr & dplyr
sparklyr_0.5.4 ;
dplyr_0.5.0
A simple test on the following lines results in error
data = copy_to(sc, iris)
filter(data , Sepal_Length >5)
Error in filter(data, Sepal_Length > 5) :
(list) object cannot be coerced to type 'double'
I checked with the read & all looks fine
head(data)
Source: query [6 x 5]
Database: spark connection master=yarn-client app=sparklyr local=FALSE
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
<dbl> <dbl> <dbl> <dbl> <chr>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
is this a known bug & are there known fixes for this?
It's not a bug. You have to specify that you want to use the filter function from the dplyr package. Probably you are using the filter function from the stats package. That's why you get that error. You can specify the right version with this: dplyr::filter
res <- dplyr::filter(data, Sepal_Length > 5) %>% dplyr::collect()
head(res)
# A tibble: 6 x 5
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
<dbl> <dbl> <dbl> <dbl> <chr>
1 5.1 3.5 1.4 0.2 setosa
2 5.4 3.9 1.7 0.4 setosa
3 5.4 3.7 1.5 0.2 setosa
4 5.8 4.0 1.2 0.2 setosa
5 5.7 4.4 1.5 0.4 setosa
6 5.4 3.9 1.3 0.4 setosa
To be sure, in the RStudio console, just type filter (or any other function) and check the popup with the function name that appears. On the right, you can see the package that it's going to be used if you don't explicitly name the package with ::.

Overwrite Value in Dataframe with checking Line before

So the DataFrame is:
1 28.3
2 27.9
3 22.4
4 18.1
5 15.5
6 7.1
7 5.1
8 12.0
9 15.1
10 10.1
Now i want to replace all over 25 with HSE and all below with LSE. Everthing else is "Middle". But i want to know if it was over 25 or below 8, before it got "Middle". So if it was over 25 before I would replace the value with "fHtM" and if it was below 8 before I would replace the value with "fLtM".
Thank you in advance.
Desired output:
Maybe like that:
1 S4
2 S4
3 S4
4 dS3 (down to class S3)
5 dS3
6 dS2
7 dS1
8 uS2 (up to class S2)
9 uS3
10 dS2
You can use cut:
bins = [-np.inf, 6, 13, 19, np.inf]
labels=['S1','S2','S3','S4']
df['label'] = pd.cut(df['value'], bins=bins, labels=labels)
print (df)
a value label
0 1 28.3 S4
1 2 27.9 S4
2 3 22.4 S4
3 4 18.1 S3
4 5 15.5 S3
5 6 7.1 S2
6 7 5.1 S1
7 8 12.0 S2
8 9 15.1 S3
9 10 10.1 S2
And if need add trend, use diff:
Explaining:
First get from column label second characters by str[1], convert it to int number and count diff. If duplicates, you get 0, so need replace them by NaN and forward fill by ffill().
dif = (df.label.str[1].astype(int).diff().replace(0,np.nan).ffill())
print (dif)
0 NaN
1 NaN
2 NaN
3 -1.0
4 -1.0
5 -1.0
6 -1.0
7 1.0
8 1.0
9 -1.0
Name: label, dtype: float64
Then use numpy.where for creating u where value is 1, d where is -1 and empty string if something else what is added to column label.
df['label1'] = dif.where(dif.isnull(), np.where(dif == 1.0, 'u','d')).fillna('') + df.label.astype(str)
print (df)
a value label
0 1 28.3 S4
1 2 27.9 S4
2 3 22.4 S4
3 4 18.1 dS3
4 5 15.5 dS3
5 6 7.1 dS2
6 7 5.1 dS1
7 8 12.0 uS2
8 9 15.1 uS3
9 10 10.1 dS2

Resources