performance comparison stringr--str_replace_all - performance-testing

Just wondering if there will be performance differences in running str_replace_all.
For example:
text <- c("a","b", "c")
str_replace_all(text, c("a", "b", "c"), c("d", "e", "f"))
and
str_replace_all(text, "a", "d")
str_replace_all(text, "b", "e")
str_replace_all(text, "c", "f")
Both get me the same result but I was wondering what would be faster if I was doing the same procedure for close to 200,000 documents and if each document file was longer?

It is evident you will have better performance with a single str_replace_all call since you do not have to change text value. See, when you need to call str_replace_all to change the text value, you need to re-assign the value each time you replace and that means additional overhead.
Here is a test with 3 functions: f1 uses the first approach, f2 uses the second and f3 is just a "chained" version of f2:
> library(microbenchmark)
> text <- c("a", "b", "c")
> f1 <- function(text) { text=str_replace_all(text, "a", "d"); text = str_replace_all(text, "b", "e"); text=str_replace_all(text, "c", "f"); return(text) }
> f1(text)
[1] "d" "e" "f"
> f2 <- function(text) { return(str_replace_all(text, c("a", "b", "c"), c("d", "e", "f"))) }
> f2(text)
[1] "d" "e" "f"
> f3 <- function(text) { return(str_replace_all(str_replace_all(str_replace_all(text, "c", "f"), "b", "e"), "a", "d")) }
> f3(text)
[1] "d" "e" "f"
> test <- microbenchmark( f1(text), f2(text), f3(text), times = 50000 )
> test
Unit: microseconds
expr min lq mean median uq max neval
f1(text) 225.788 233.335 257.2998 239.673 262.313 25071.76 50000
f2(text) 182.321 187.755 207.1858 191.980 210.393 24844.76 50000
f3(text) 224.581 231.825 255.2167 237.863 259.898 24531.74 50000
With times = 50000, the functions were run 50,000 times and the median value, being the lowest with f2, together with lower quartile (lq) and upper quartile (uq) values, proves that a single str_replace_all is the fastest. autoplot(test) (from ggplot2 library) shows:
And finally, it is best to use stri_replace_all_fixed from stringi package if you need to only replace literal strings. Then, here is the benchmark:
> library(stringi)
> f1 <- function(text) { text=stri_replace_all_fixed(text, "a", "d"); text = stri_replace_all_fixed(text, "b", "e"); text=stri_replace_all_fixed(text, "c", "f"); return(text) }
> f2 <- function(text) { return(stri_replace_all_fixed(text, c("a", "b", "c"), c("d", "e", "f"))) }
> f3 <- function(text) { return(stri_replace_all_fixed(stri_replace_all_fixed(stri_replace_all_fixed(text, "c", "f"), "b", "e"), "a", "d")) }
> test <- microbenchmark( f1(text), f2(text), f3(text), times = 50000 )
> test
Unit: microseconds
expr min lq mean median uq max neval cld
f1(text) 7.547 7.849 9.197490 8.151 8.453 1008.800 50000 b
f2(text) 3.321 3.623 4.420453 3.925 3.925 2053.821 50000 a
f3(text) 7.245 7.547 9.802766 7.849 8.151 50816.654 50000 b

Related

String Permutations of Different Lengths

I have been trying to wrap my head around something and can't seem to find an answer. I know how to get all the permutations of a string as it is fairly easy. What I want to try and do is get all the permutations of the string in different sizes. For example:
Given "ABCD" and a lower limit of 3 chars I would want to get back ABC, ABD, ACB, ACD, ADB, ADC, ... , ABCD, ACBD, ADBC, .. etc.
I'm not quite sure how to accomplish that. I have it in my head that it is something that could be very complicated or very simple. Any help pointing me in a direction is appreciated. Thanks.
If you've already got the full-length permutations, you can drop stuff off of the front or back, and insert the result into a set.
XCTAssertEqual(
Permutations(["A", "B", "C"]).reduce( into: Set() ) { set, permutation in
permutation.indices.forEach {
set.insert( permutation.dropLast($0) )
}
},
[ ["A", "B", "C"],
["A", "C", "B"],
["B", "C", "A"],
["B", "A", "C"],
["C", "A", "B"],
["C", "B", "A"],
["B", "C"],
["C", "B"],
["C", "A"],
["A", "C"],
["A", "B"],
["B", "A"],
["A"],
["B"],
["C"]
]
)
public struct Permutations<Sequence: Swift.Sequence>: Swift.Sequence, IteratorProtocol {
public typealias Array = [Sequence.Element]
private let array: Array
private var iteration = 0
public init(_ sequence: Sequence) {
array = Array(sequence)
}
public mutating func next() -> Array? {
guard iteration < array.count.factorial!
else { return nil }
defer { iteration += 1 }
return array.indices.reduce(into: array) { permutation, index in
let shift =
iteration / (array.count - 1 - index).factorial!
% (array.count - index)
permutation.replaceSubrange(
index...,
with: permutation.dropFirst(index).shifted(by: shift)
)
}
}
}
public extension Collection where SubSequence: RangeReplaceableCollection {
func shifted(by shift: Int) -> SubSequence {
let drops =
shift > 0
? (shift, count - shift)
: (count + shift, -shift)
return dropFirst(drops.0) + dropLast(drops.1)
}
}
public extension BinaryInteger where Stride: SignedInteger {
/// - Note: `nil` for negative numbers
var factorial: Self? {
switch self {
case ..<0:
return nil
case 0...1:
return 1
default:
return (2...self).reduce(1, *)
}
}
}

Pass row to UDF and select column based on pattern match

How can I achieve the following by passing a row to a udf ?
val df1 = df.withColumn("col_Z",
when($"col_x" === "a", $"col_A")
.when($"col_x" === "b", $"col_B")
.when($"col_x" === "c", $"col_C")
.when($"col_x" === "d", $"col_D")
.when($"col_x" === "e", $"col_E")
.when($"col_x" === "f", $"col_F")
.when($"col_x" === "g", $"col_G")
)
As I understand it, only columns can be passed as arguments to a UDF in Scala Spark.
I have taken a look at this question:
How to pass whole Row to UDF - Spark DataFrame filter
and tried to implement this udf:
def myUDF(r:Row) = udf {
val z : Float = r.getAs("col_x") match {
case "a" => r.getAs("col_A")
case "b" => r.getAs("col_B")
case other => lit(0.0)
}
z
}
but I'm getting a type mismatch error:
error: type mismatch;
found : String("a")
required: Nothing
case "a" => r.getAs("col_A")
^
What am I doing wrong ?

Golang: print string array in an unique way

I want a function func format(s []string) string such that for two string slices s1 and s2, if reflect.DeepEqual(s1, s2) == false, then format(s1) != format(s2).
If I simply use fmt.Sprint, slices ["a", "b", "c"] and ["a b", "c"] are all printed as [a b c], which is undesirable; and there is also the problem of string([]byte('4', 0, '2')) having the same representation as "42".
Use a format verb that shows the data structure, like %#v. In this case %q works well too because the primitive types are all strings.
fmt.Printf("%#v\n", []string{"a", "b", "c"})
fmt.Printf("%#v\n", []string{"a b", "c"})
// prints
// []string{"a", "b", "c"}
// []string{"a b", "c"}
You may use:
func format(s1, s2 []string) string {
if reflect.DeepEqual(s1, s2) {
return "%v\n"
}
return "%q\n"
}
Like this working sample (The Go Playground):
package main
import (
"fmt"
"reflect"
)
func main() {
s1, s2 := []string{"a", "b", "c"}, []string{"a b", "c"}
frmat := format(s1, s2)
fmt.Printf(frmat, s1) // ["a" "b" "c"]
fmt.Printf(frmat, s2) // ["a b" "c"]
s2 = []string{"a", "b", "c"}
frmat = format(s1, s2)
fmt.Printf(frmat, s1) // ["a" "b" "c"]
fmt.Printf(frmat, s2) // ["a b" "c"]
}
func format(s1, s2 []string) string {
if reflect.DeepEqual(s1, s2) {
return "%v\n"
}
return "%q\n"
}
output:
["a" "b" "c"]
["a b" "c"]
[a b c]
[a b c]

Golang Alphabetic representation of a number

Is there an easy way to convert a number to a letter?
For example,
3 => "C" and 23 => "W"?
For simplicity range check is omitted from below solutions.
They all can be tried on the Go Playground.
Number -> rune
Simply add the number to the const 'A' - 1 so adding 1 to this you get 'A', adding 2 you get 'B' etc.:
func toChar(i int) rune {
return rune('A' - 1 + i)
}
Testing it:
for _, i := range []int{1, 2, 23, 26} {
fmt.Printf("%d %q\n", i, toChar(i))
}
Output:
1 'A'
2 'B'
23 'W'
26 'Z'
Number -> string
Or if you want it as a string:
func toCharStr(i int) string {
return string('A' - 1 + i)
}
Output:
1 "A"
2 "B"
23 "W"
26 "Z"
This last one (converting a number to string) is documented in the Spec: Conversions to and from a string type:
Converting a signed or unsigned integer value to a string type yields a string containing the UTF-8 representation of the integer.
Number -> string (cached)
If you need to do this a lot of times, it is profitable to store the strings in an array for example, and just return the string from that:
var arr = [...]string{"A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M",
"N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z"}
func toCharStrArr(i int) string {
return arr[i-1]
}
Note: a slice (instead of the array) would also be fine.
Note #2: you may improve this if you add a dummy first character so you don't have to subtract 1 from i:
var arr = [...]string{".", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M",
"N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z"}
func toCharStrArr(i int) string { return arr[i] }
Number -> string (slicing a string constant)
Also another interesting solution:
const abc = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
func toCharStrConst(i int) string {
return abc[i-1 : i]
}
Slicing a string is efficient: the new string will share the backing array (it can be done because strings are immutable).
If you need not a rune, but a string and also more than one character for e.g. excel column
package main
import (
"fmt"
)
func IntToLetters(number int32) (letters string){
number--
if firstLetter := number/26; firstLetter >0{
letters += IntToLetters(firstLetter)
letters += string('A' + number%26)
} else {
letters += string('A' + number)
}
return
}
func main() {
fmt.Println(IntToLetters(1))// print A
fmt.Println(IntToLetters(26))// print Z
fmt.Println(IntToLetters(27))// print AA
fmt.Println(IntToLetters(1999))// print BXW
}
preview here: https://play.golang.org/p/GAWebM_QCKi
I made also package with this: https://github.com/arturwwl/gointtoletters
The simplest solution would be
func stringValueOf(i int) string {
var foo = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
return string(foo[i-1])
}
Hope this will help you to solve your problem. Happy Coding!!

constructing an identifier string for each row in data

I have the following data:
library(data.table)
d = data.table(a = c(1:3), b = c(2:4))
and would like to get this result (in a way that would work with arbitrary number of columns):
d[, c := paste0('a_', a, '_b_', b)]
d
# a b c
#1: 1 2 a_1_b_2
#2: 2 3 a_2_b_3
#3: 3 4 a_3_b_4
The following works, but I'm hoping to find something shorter and more legible.
d = data.table(a = c(1:3), b = c(2:4))
d[, c := apply(mapply(paste, names(.SD), .SD, MoreArgs = list(sep = "_")),
1, paste, collapse = "_")]
one way, only slightly cleaner:
d[, c := apply(d, 1, function(x) paste(names(d), x, sep="_", collapse="_")) ]
a b c
1: 1 2 a_1_b_2
2: 2 3 a_2_b_3
3: 3 4 a_3_b_4
Here is an approach using do.call('paste'), but requiring only a single call to paste
I will benchmark on a situtation where the columns are integers (as this seems a more sensible test case
N <- 1e4
d <- setnames(as.data.table(replicate(5, sample(N), simplify = FALSE)), letters[seq_len(5)])
f5 <- function(d){
l <- length(d)
o <- c(1L, l + 1L) + rep_len(seq_len(l) -1L, 2L * l)
do.call('paste',c((c(as.list(names(d)),d))[o],sep='_'))}
microbenchmark(f1(d), f2(d),f5(d))
Unit: milliseconds
expr min lq median uq max neval
f1(d) 41.51040 43.88348 44.60718 45.29426 52.83682 100
f2(d) 193.94656 207.20362 210.88062 216.31977 252.11668 100
f5(d) 30.73359 31.80593 32.09787 32.64103 45.68245 100
To avoid looping through rows, you can use this:
do.call(paste, c(lapply(names(d), function(n)paste0(n,"_",d[[n]])), sep="_"))
Benchmarking:
N <- 1e4
d <- data.table(a=runif(N),b=runif(N),c=runif(N),d=runif(N),e=runif(N))
f1 <- function(d)
{
do.call(paste, c(lapply(names(d), function(n)paste0(n,"_",d[[n]])), sep="_"))
}
f2 <- function(d)
{
apply(d, 1, function(x) paste(names(d), x, sep="_", collapse="_"))
}
require(microbenchmark)
microbenchmark(f1(d), f2(d))
Note: f2 inspired in #Ricardo's answer.
Results:
Unit: milliseconds
expr min lq median uq max neval
f1(d) 195.8832 213.5017 216.3817 225.4292 254.3549 100
f2(d) 418.3302 442.0676 451.0714 467.5824 567.7051 100
Edit note: previous benchmarking with N <- 1e3 didn't show much difference in times. Thanks again #eddi.

Resources