The more threads that changes the Clojure's ref are, the more does the rate of retries per threads rise? - multithreading

I worry about this a little.
Imagine the simplest version controll way that programmers just copy all directory from the master repository and after changing a file do reversely if the master repository is still the same. If it has been changed by another, they must try again.
When the number of programmers increases, it is natural that retries also increase, but it might not be proportional to the number of programmers.
If ten programmers work and a work takes an hour per person, to complete all work ten hours are needed at least.
If they are earnest, about 9 + 8 + 7 + ... 1 = 45 man-hours come to nothing.
In a hundread of programmers, about 99 + 98 + ... 1 = 4950 man-hours come to nothing.
I tried to count the number of retries and got the results.
Source
(defn fib [n]
(if (or (zero? n) (= n 1))
1
(+ (fib (dec n) ) (fib (- n 2)))))
(defn calc! [r counter-A counter-B counter-C n]
(dosync
(swap! counter-A inc)
;;(Thread/sleep n)
(fib n)
(swap! counter-B inc)
(alter r inc)
(swap! counter-C inc)))
(defn main [thread-num n]
(let [r (ref 0)
counter-A (atom 0)
counter-B (atom 0)
counter-C (atom 0)]
(doall (pmap deref
(for [_ (take thread-num (repeat nil))]
(future (calc! r counter-A counter-B counter-C n)))))
(println thread-num " Thread. #ref:" #r)
(println "A:" #counter-A ", B:" #counter-B ", C:" #counter-C)))
CPU: 2.93GHz Quad-Core Intel Core i7
result
user> (time (main 10 25))
10 Thread. #ref: 10
A: 53 , B: 53 , C: 10
"Elapsed time: 94.412 msecs"
nil
user> (time (main 100 25))
100 Thread. #ref: 100
A: 545 , B: 545 , C: 100
"Elapsed time: 966.141 msecs"
nil
user> (time (main 1000 25))
1000 Thread. #ref: 1000
A: 5507 , B: 5507 , C: 1000
"Elapsed time: 9555.165 msecs"
nil
I changed the job to (Thread/sleep n) instead of (fib n) and got similar results.
user> (time (main 10 20))
10 Thread. #ref: 10
A: 55 , B: 55 , C: 10
"Elapsed time: 220.616 msecs"
nil
user> (time (main 100 20))
100 Thread. #ref: 100
A: 689 , B: 689 , C: 117
"Elapsed time: 2013.729 msecs"
nil
user> (time (main 1000 20))
1000 Thread. #ref: 1000
A: 6911 , B: 6911 , C: 1127
"Elapsed time: 20243.214 msecs"
nil
In Thread/sleep case, I think retries could increase more than this result because CPU is available.
Why don't retries increase?
Thanks.

Because you are not actually spawning 10, 100 or 1000 threads! Creating a future does not always create a new thread. It uses a thread pool behind the scenes where it keeps queuing the jobs (or Runnables to be technical). The thread pool is a cached thread pool which reuses the threads for running the jobs.
So in your case, you are not actually spawning a 1000 threads. If you want to see the retries in action, get a level below future - create your own thread pool and push Runnables into it.

self answer
I have modified main function not to use pmap and got results, which work out as calculated.
(defn main [thread-num n]
(let [r (ref 0)
counter-A (atom 0)
counter-B (atom 0)
counter-C (atom 0)]
(doall (map deref (doall (for [_ (take thread-num (repeat nil))]
(future (calc! r counter-A counter-B counter-C n))))))
(println thread-num " Thread. #ref:" #r)
(println "A:" #counter-A ", B:" #counter-B ", C:" #counter-C)))
fib
user=> (main 10 25)
10 Thread. #ref: 10
A: 55 , B: 55 , C: 10
nil
user=> (main 100 25)
100 Thread. #ref: 100
A: 1213 , B: 1213 , C: 100
nil
user=> (main 1000 25)
1000 Thread. #ref: 1000
A: 19992 , B: 19992 , C: 1001
nil
Thread/sleep
user=> (main 10 20)
10 Thread. #ref: 10
A: 55 , B: 55 , C: 10
nil
user=> (main 100 20)
100 Thread. #ref: 100
A: 4979 , B: 4979 , C: 102
nil
user=> (main 1000 20)
1000 Thread. #ref: 1000
A: 491223 , B: 491223 , C: 1008
nil

Related

Clojure Running Multiple Threads with Functions with zero or one input

I'm experimenting in Clojure with running independent threads and I'm getting different behaviors I don't understand.
For my code editor I'm using Atom (not emacs), REPL is Chlorine.
I'm testing a really simple function that just prints numbers.
This one prints from 100 to 1 and takes no inputs:
(defn pl100 []
"pl100 = Print Loop from 100 to 1"
(loop [counter 100]
(when (pos? counter)
(do
(Thread/sleep 100)
(println (str "counter: " counter))
(recur (dec counter))))))
This one does the exact same thing, except it takes an input:
(defn pl-n [n]
"pl-n = Print Loop from n to 1"
(loop [counter n]
(when (pos? counter)
(do
(Thread/sleep 100)
(println (str "counter: " counter))
(recur (dec counter))))))
When I use
(.start (Thread. #(.run pl100)))
; --> prints to console REPL
; --> runs with no errors
this code
prints to the console REPL (where I call lein) and
runs with no errors
When I use
(.start (Thread. #(.run (pl-n 100))))
; prints to console REPL
; --> java.lang.NullPointerException: Cannot invoke "Object.getClass()" because "target" is null
this code
prints to the console REPL
ends with the above exception
When I use
(.start (Thread. pl100))
; --> prints to the console REPL
; --> runs with no errors
this code
prints to the console REPL
runs with no errors
When I use
(.start (Thread. (pl-n 100)))
; --> prints to Atom REPL, not console REPL!
; ends with exception
; Execution error (NullPointerException) at java.lang.Thread/<init> (Thread.java:396).
; name cannot be null
; class java.lang.NullPointerException
this code
prints to the Atom REPL (I'm using Atom, not emacs)! Not to the console REPL like the others
ends with exception
So, can someone please help me understand:
Why is it when I'm running a function that takes an input, Java gives an error? Why are the function calls not equivalent?
What is (.run ...) doing?
Why is it that sometimes the code prints to the console and other times to Atom/Chlorine?
To answer in brief: Thread.run requires a function. Your first exhibit gives it a function, pl100, and works as you expect:
#(.run pl100)
Something altogether different would happen if you gave .run not a function, but instead the value returned by calling the pl100 function. In fact, pl100 returns nil, so Thread.run would throw a NullPointerException:
#(.run (pl100)) ;; NullPointerException
That explains why your second exhibit did not do what you expected. pl-n returned nil, and then you got an exception when you passed nil to Thread.run:
#(.run (pl-n 100)) ;; NullPointerException
To bridge the gap - between Thread.run which requires a function of no arguments, and your function pl-n which requires an argument, you could introduce a function-of-no-arguments (to satisfy Thread.run) which calls pl-n with the desired argument. Idiomatically this would be an anonymous function. Unfortunately, you can't nest #() within #(), so you will have to use the more verbose (fn [] ...) syntax for one of the anonymous functions, most likely the outer one.
Try something like this:
(ns tst.demo.core
(:use demo.core tupelo.core tupelo.test)
(:require [tupelo.string :as str]))
(defn pl100 []
"pl100 = Print Loop from 100 to 1"
(loop [counter 10]
(when (pos? counter)
(do
(Thread/sleep 100)
(println (str "pl100 counter: " counter))
(recur (dec counter))))))
(defn pl-n [n]
"pl-n = Print Loop from n to 1"
(loop [counter n]
(when (pos? counter)
(do
(Thread/sleep 100)
(println (str "pl-n counter: " counter))
(recur (dec counter))))))
(dotest
(newline)
(.start (Thread. pl100))
(Thread/sleep (* 2 1000))
(newline)
(.start (Thread. #(pl-n 5)))
(Thread/sleep (* 2 1000))
(newline)
(println :done)
)
Clojure functions are already an instance of Runnable, and you don't need #(.run xxx) syntax. Result:
--------------------------------------
Clojure 1.10.2-alpha1 Java 15
--------------------------------------
Testing tst.demo.core
pl100 counter: 10
pl100 counter: 9
pl100 counter: 8
pl100 counter: 7
pl100 counter: 6
pl100 counter: 5
pl100 counter: 4
pl100 counter: 3
pl100 counter: 2
pl100 counter: 1
pl-n counter: 5
pl-n counter: 4
pl-n counter: 3
pl-n counter: 2
pl-n counter: 1
:done
To make it even simpler, just use a Clojure future:
(future (pl100))
(Thread/sleep (* 2 1000))
(newline)
(future (pl-n 5))
(Thread/sleep (* 2 1000))
(newline)
(println :done)
If you remove the Thread/sleep, you can see them running in parallel:
(future (pl100))
(future (pl-n 5))
(Thread/sleep (* 2 1000))
(newline)
(println :done)
with result
pl100 counter: 10
pl-n counter: 5
pl100 counter: 9
pl-n counter: 4
pl100 counter: 8
pl-n counter: 3
pl100 counter: 7pl-n counter: 2
pl100 counter: 6pl-n counter: 1
pl100 counter: 5
pl100 counter: 4
pl100 counter: 3
pl100 counter: 2
pl100 counter: 1

SPIN assert not triggered

I am trying to understand why the assert in this model isn't triggered.
ltl { !A#wa U B#sb && !B#wb U A#sa }
byte p = 0
byte q = 0
int x = 0
inline signal(sem) { sem++ }
inline wait (sem) { atomic { sem > 0 ; sem-- } }
proctype A() {
x = 10*x + 1
signal(p)
sa: wait(q)
wa: x = 10*x + 2
}
proctype B() {
x = 10*x + 3
signal(q)
sb: wait(p)
wb: x = 10*x + 4
}
init {
atomic { run A(); run B() }
_nr_pr == 1
assert(x != 1324)
}
Clearly, there is an order of operations that produces the final value x = 1324:
Initially x = 0
A sets x = 10*0 + 1 = 1
B sets x = 10*1 + 3 = 13
A and B allow each other to proceed
A sets x = 10*13 + 2 = 132
B sets x = 10*132 + 4 = 1324
The assertion isn't triggered because it is "never reached" when the solver proves that the property
ltl { !A#wa U B#sb && !B#wb U A#sa }
is true.
Take a look at the output that is given by the solver, it clearly states that:
it is checking any assertion, but only if within the scope of the claim:
Full statespace search for:
never claim + (ltl_0)
assertion violations + (if within scope of claim)
the assertion isn't reached:
unreached in init
t.pml:27, state 5, "assert((x!=1324))"
t.pml:28, state 6, "-end-"
(2 of 6 states)
You can use the option -noclaim so to check the model only for the assertion, which is then easily proven false:
~$ spin -search -noclaim t.pml
ltl ltl_0: ((! ((A#wa))) U ((B#sb))) && ((! ((B#wb))) U ((A#sa)))
pan:1: assertion violated (x!=1324) (at depth 13)
pan: wrote t.pml.trail
(Spin Version 6.4.8 -- 2 March 2018)
Warning: Search not completed
+ Partial Order Reduction
Full statespace search for:
never claim - (not selected)
assertion violations +
cycle checks - (disabled by -DSAFETY)
invalid end states +
State-vector 36 byte, depth reached 15, errors: 1
48 states, stored
6 states, matched
54 transitions (= stored+matched)
1 atomic steps
hash conflicts: 0 (resolved)
Stats on memory usage (in Megabytes):
0.003 equivalent memory usage for states (stored*(State-vector + overhead))
0.286 actual memory usage for states
128.000 memory used for hash table (-w24)
0.534 memory used for DFS stack (-m10000)
128.730 total actual memory usage
pan: elapsed time 0 seconds

Clojure proxy multithreading issue

I'm trying to create a proxy for ArrayBlockingQueue that intercepts calls to it for monitoring
(ns clj-super-bug.core
(:import [java.util.concurrent ArrayBlockingQueue Executors]))
(let [thread-count 10
put-count 100
executor (Executors/newFixedThreadPool thread-count)
puts (atom 0)
queue (proxy [ArrayBlockingQueue] [1000]
(put [el]
(proxy-super put el)
(swap! puts inc)))]
(.invokeAll executor (repeat put-count #(.put queue 0)))
(assert (= (.size queue) put-count) "should have put in put-count items")
(println #puts))
I would expect this code to always print 100, but occaissonally it's something else like 51. Am I using proxy or proxy-super wrong?
I debugged this to the point that it seems that the proxy method is not actually called on some occasions, just the base method (the items show up in the queue, as indicated by the assert). Also, I suppose it's multithreading related because if I have thread-count = 1 it's always 100.
Turns out this is a known issue with proxy-super: https://dev.clojure.org/jira/browse/CLJ-2201
"If you have a proxy with method M, which invokes proxy-super, then while that proxy-super is running all calls to M on that proxy object will immediately invoke the super M not the proxied M." That's exactly what's happening.
I would not do the subclass via proxy.
If you subclass ArrayBlockingQueue, you are saying your code is an instance of ABQ. So, you are making a specialized version of ABQ, and must take responsibility for all of the implementation details of the ABQ source code.
However, you don't need to be an instance of ABQ. All you really need is to use an instance of ABQ, which is easily done by composition.
So, we write a wrapper function which delegates to an ABQ:
(ns tst.demo.core
(:use demo.core tupelo.core tupelo.test)
(:require
[clojure.string :as str]
[clojure.java.io :as io])
(:import [java.util.concurrent ArrayBlockingQueue Executors TimeUnit]) )
(dotest
(let [N 100
puts-done (atom 0)
abq (ArrayBlockingQueue. (+ 3 N))
putter (fn []
(.put abq 0)
(swap! puts-done inc))]
(dotimes [_ N]
(future (putter)))
(Thread/sleep 1000)
(println (format "N: %d puts-done: %d" N #puts-done))
(assert (= N #puts-done)
(format "should have put in puts-done items; N = %d puts-done = %d" N #puts-done))
))
result:
N: 100 puts-done: 100
Using the executor:
(dotest
(let [N 100
puts-done (atom 0)
thread-count 10
executor (Executors/newFixedThreadPool thread-count)
abq (ArrayBlockingQueue. (+ 3 N))
putter (fn []
(.put abq 0)
(swap! puts-done inc))
putters (repeat N #(putter)) ]
(.invokeAll executor putters)
(println (format "N: %d puts-done: %d" N #puts-done))
(assert (= N #puts-done)
(format "should have put in puts-done items; N = %d puts-done = %d" N #puts-done))))
result:
N: 100 puts-done: 100
Update #1
Regarding the cause, I'm not sure. I tried to fix the original version with locking, but no joy:
(def lock-obj (Object.))
(dotest
(let [N 100
puts-done (atom 0)
thread-count 10
executor (Executors/newFixedThreadPool thread-count)
abq (proxy [ArrayBlockingQueue]
[(+ 3 N)]
(put [el]
(locking lock-obj
(proxy-super put el)
(swap! puts-done inc))))]
(.invokeAll executor (repeat N #(.put abq 0)))
with results:
N: 100 puts-done: 46
N: 100 puts-done: 71
N: 100 puts-done: 85
N: 100 puts-done: 83
Update #2
Tried some more tests using a java subclass of ABQ:
package demo;
import java.util.concurrent.ArrayBlockingQueue;
import java.util.concurrent.atomic.AtomicInteger;
public class Que<E> extends ArrayBlockingQueue<E> {
public static AtomicInteger numPuts = new AtomicInteger(0);
public static Que<Integer> queInt = new Que<>( 999 );
public Que(int size) { super(size); }
public void put(E element) {
synchronized (numPuts) {
try {
super.put(element);
numPuts.getAndIncrement();
} catch (Exception ex) {
System.out.println( "caught " + ex);
} } } }
...
(:import [java.util.concurrent Executors TimeUnit]
[demo Que] ) )
(dotest
(let [N 100
puts-done (atom 0)
thread-count 10
executor (Executors/newFixedThreadPool thread-count) ]
(.invokeAll executor (repeat N #(.put Que/queInt 0)))
(println (format "N: %d puts-done: %d" N (.get Que/numPuts)))))
results (repeated runs => accumulation):
N: 100 puts-done: 100
N: 100 puts-done: 200
N: 100 puts-done: 300
N: 100 puts-done: 400
N: 100 puts-done: 500
so it works great with a Java subclass. Get same results with/without the synchronized block.
So, it looks to be something in the Clojure proxy area.

Truncate string in Julia

Is there a convenience function for truncating strings to a certain length?
It would equivalent to something like this
test_str = "test"
if length(test_str) > 8
out_str = test_str[1:8]
else
out_str = test_str
end
In the naive ASCII world:
truncate_ascii(s,n) = s[1:min(sizeof(s),n)]
would do. If it's preferable to share memory with original string and avoid copying SubString can be used:
truncate_ascii(s,n) = SubString(s,1,min(sizeof(s),n))
But in a Unicode world (and it is a Unicode world) this is better:
truncate_utf8(s,n) = SubString(s,1, (eo=endof(s) ; neo=0 ;
for i=1:n
if neo<eo neo=nextind(s,neo) ; else break ; end ;
end ; neo) )
Finally, #IsmaelVenegasCastelló reminded us of grapheme complexity (arrrgh), and then this is what's needed:
function truncate_grapheme(s,n)
eo = endof(s) ; tt = 0 ; neo=0
for i=1:n
if (neo<eo)
tt = nextind(s,neo)
while neo>0 && tt<eo && !Base.UTF8proc.isgraphemebreak(s[neo],s[tt])
(neo,tt) = (tt,nextind(s,tt))
end
neo = tt
else
break
end
end
return SubString(s,1,neo)
end
These last two implementations try to avoid calculating the length (which can be slow) or allocating/copying, or even just looping n times when the length is shorter.
This answer draws on contributions of #MichaelOhlrogge, #FengyangWang, #Oxinabox and #IsmaelVenegasCastelló
I would do strtruncate(str, n) = join(take(str, n)).
Example:
julia> strtruncate("αβγδ", 3)
"αβγ"
julia> strtruncate("αβγδ", 5)
"αβγδ"
Note that your code is not fully valid for Unicode strings.
If the string is ASCII, this is pretty efficient:
String(resize!(str.data, n))
Or in-place:
resize!(str.data, n)
For unicode, #Fengyang Wangs's method is very fast, but converting to a Char array can be slightly faster if you only truncate the very end of the string:
trunc1(str::String, n) = String(collect(take(str, n)))
trunc2(str::String, n) = String(Vector{Char}(str)[1:n])
trunc3(str::String, n) = String(resize!(Vector{Char}(str), n))
trunc4(str::String, n::Int)::String = join(collect(graphemes(str))[1:n])
function trunc5(str::String, n)
if isascii(str)
return String(resize!(str.data, n))
else
trunc1(str, n)
end
end
Timing:
julia> time_trunc(100, 100000, 25)
0.112851 seconds (700.00 k allocations: 42.725 MB, 7.75% gc time)
0.165806 seconds (700.00 k allocations: 91.553 MB, 11.84% gc time)
0.160116 seconds (600.00 k allocations: 73.242 MB, 11.58% gc time)
1.167706 seconds (31.60 M allocations: 1.049 GB, 11.12% gc time)
0.017833 seconds (100.00 k allocations: 1.526 MB)
true
julia> time_trunc(100, 100000, 98)
0.367191 seconds (700.00 k allocations: 83.923 MB, 5.23% gc time)
0.318507 seconds (700.00 k allocations: 132.751 MB, 9.08% gc time)
0.301685 seconds (600.00 k allocations: 80.872 MB, 6.19% gc time)
1.561337 seconds (31.80 M allocations: 1.122 GB, 9.86% gc time)
0.061827 seconds (100.00 k allocations: 1.526 MB)
true
Edit: Whoops.. I just realized that I'm actually destroying the original string in trunc5. This should be correct, but with less superior performance:
function trunc5(str::String, n)
if isascii(str)
return String(str.data[1:n])
else
trunc1(str, n)
end
end
New timings:
julia> time_trunc(100, 100000, 25)
0.123629 seconds (700.00 k allocations: 42.725 MB, 7.70% gc time)
0.162332 seconds (700.00 k allocations: 91.553 MB, 11.41% gc time)
0.152473 seconds (600.00 k allocations: 73.242 MB, 9.19% gc time)
1.152640 seconds (31.60 M allocations: 1.049 GB, 11.54% gc time)
0.066662 seconds (200.00 k allocations: 12.207 MB)
true
julia> time_trunc(100, 100000, 98)
0.369576 seconds (700.00 k allocations: 83.923 MB, 5.10% gc time)
0.312237 seconds (700.00 k allocations: 132.751 MB, 9.42% gc time)
0.297736 seconds (600.00 k allocations: 80.872 MB, 5.95% gc time)
1.545329 seconds (31.80 M allocations: 1.122 GB, 10.02% gc time)
0.080399 seconds (200.00 k allocations: 19.836 MB, 5.07% gc time)
true
Aaand new edit: Aargh, forgot the timing function. I'm inputting an ascii string:
function time_trunc(m, n, m_)
str = randstring(m)
#time for _ in 1:n trunc1(str, m_) end
#time for _ in 1:n trunc2(str, m_) end
#time for _ in 1:n trunc3(str, m_) end
#time for _ in 1:n trunc4(str, m_) end
#time for _ in 1:n trunc5(str, m_) end
trunc1(str, m_) == trunc2(str, m_) == trunc3(str, m_) == trunc4(str, m_) == trunc5(str, m_)
end
Final edit (I hope):
Trying out #Dan Getz's truncate_grapheme and using unicode strings:
function time_trunc(m, n, m_)
# str = randstring(m)
str = join(["αβγπϕ1t_Ω₃!" for i in 1:100])
#time for _ in 1:n trunc1(str, m_) end
#time for _ in 1:n trunc2(str, m_) end
#time for _ in 1:n trunc3(str, m_) end
# #time for _ in 1:n trunc4(str, m_) end # too slow
#time for _ in 1:n trunc5(str, m_) end
#time for _ in 1:n truncate_grapheme(str, m_) end
trunc1(str, m_) == trunc2(str, m_) == trunc3(str, m_) == trunc5(str, m_) == truncate_grapheme(str, m_)
end
Timing:
julia> time_trunc(100, 100000, 98)
0.690399 seconds (800.00 k allocations: 103.760 MB, 3.69% gc time)
1.828437 seconds (800.00 k allocations: 534.058 MB, 3.66% gc time)
1.795005 seconds (700.00 k allocations: 482.178 MB, 3.19% gc time)
0.667831 seconds (800.00 k allocations: 103.760 MB, 3.17% gc time)
0.347953 seconds (100.00 k allocations: 3.052 MB)
true
julia> time_trunc(100, 100000, 25)
0.282922 seconds (800.00 k allocations: 48.828 MB, 4.01% gc time)
1.576374 seconds (800.00 k allocations: 479.126 MB, 3.98% gc time)
1.643700 seconds (700.00 k allocations: 460.815 MB, 3.70% gc time)
0.276586 seconds (800.00 k allocations: 48.828 MB, 4.59% gc time)
0.091773 seconds (100.00 k allocations: 3.052 MB)
true
So the last one seems clearly the best (and this post is now way too long.)
You could use the graphemes function:
C:\Users\Ismael
λ julia5
_
_ _ _(_)_ | By greedy hackers for greedy hackers.
(_) | (_) (_) | Documentation: http://docs.julialang.org
_ _ _| |_ __ _ | Type "?help" for help.
| | | | | | |/ _' | |
| | |_| | | | (_| | | Version 0.5.0-rc3+0 (2016-08-22 23:43 UTC)
_/ |\__'_|_|_|\__'_| | Official http://julialang.org/ release
|__/ | x86_64-w64-mingw32
help?> graphemes
search: graphemes
graphemes(s) -> iterator over substrings of s
Returns an iterator over substrings of s that correspond to the extended
graphemes in the string, as defined by Unicode UAX #29.
(Roughly, these are what users would perceive as single characters, even
though they may contain more than one codepoint; for example a letter
combined with an accent mark is a single grapheme.)
Example:
julia> s = "αβγπϕ1t_Ω₃!"; n = 8;
julia> length(s)
11
julia> graphemes(s)
length-11 GraphemeIterator{String} for "αβγπϕ1t_Ω₃!"
julia> collect(ans)[1:n]
8-element Array{SubString{String},1}:
"α"
"β"
"γ"
"π"
"ϕ"
"1"
"t"
"_"
julia> join(ans)
"αβγπϕ1t_"
Check out the truncate function:
julia> methods(truncate)
# 2 methods for generic function "truncate":
truncate(s::IOStream, n::Integer) at iostream.jl:43
truncate(io::Base.AbstractIOBuffer, n::Integer) at iobuffer.jl:140
help?> truncate
search: truncate
truncate(file,n)
Resize the file or buffer given by the first argument to exactly n bytes,
filling previously unallocated space with '\0' if the file or buffer is
grown.
So the solution could look like this:
julia> #doc """
truncate(s::String, n::Int)::String
truncate a `String`; `s` up to `n` graphemes.
# Example
```julia
julia> truncate("αβγπϕ1t_Ω₃!", 8)
"αβγπϕ1t_"
julia> truncate("test", 8)
"test"
```
""" ->
function Base.truncate(s::String, n::Int)::String
if length(s) > n
join(collect(graphemes(s))[1:n])
else
s
end
end
Base.truncate
Test it:
julia> methods(truncate)
# 3 methods for generic function "truncate":
truncate(s::String, n::Int64)
truncate(s::IOStream, n::Integer) at iostream.jl:43
truncate(io::Base.AbstractIOBuffer, n::Integer) at iobuffer.jl:140
help?> truncate
truncate(file,n)
Resize the file or buffer given by the first argument to exactly n bytes,
filling previously unallocated space with '\0' if the file or buffer is
grown.
truncate(s::String, n::Int)::String
truncate a String; s up to n graphemes.
Example
≡≡≡≡≡≡≡≡≡
julia> truncate("αβγπϕ1t_Ω₃!", 8)
"αβγπϕ1t_"
julia> truncate("test", 8)
"test"
julia> truncate("αβγπϕ1t_Ω₃!", n)
"αβγπϕ1t_"
julia> truncate("test", n)
"test"
Profile it:
julia> Pkg.add("BenchmarkTools")
INFO: Nothing to be done
INFO: METADATA is out-of-date — you may not have the latest version of BenchmarkTools
INFO: Use `Pkg.update()` to get the latest versions of your packages
julia> using BenchmarkTools
julia> #benchmark truncate("αβγπϕ1t_Ω₃!", 8)
BenchmarkTools.Trial:
samples: 10000
evals/sample: 9
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 1.72 kb
allocs estimate: 48
minimum time: 1.96 μs (0.00% GC)
median time: 2.10 μs (0.00% GC)
mean time: 2.45 μs (7.80% GC)
maximum time: 353.75 μs (98.40% GC)
julia> Sys.cpu_info()[]
Intel(R) Core(TM) i7-4710HQ CPU # 2.50GHz:
speed user nice sys idle irq ticks
2494 MHz 937640 0 762890 11104468 144671 ticks
You could use:
"test"[1:min(end,8)]
Also
SubString("test", 1, 8)
Here's one that can handle any UTF-8 string:
function trim_str(str, max_length)
edge = nextind(str, 0, max_length)
if edge >= ncodeunits(str)
str
else
str[1:edge]
end
end

constructing an identifier string for each row in data

I have the following data:
library(data.table)
d = data.table(a = c(1:3), b = c(2:4))
and would like to get this result (in a way that would work with arbitrary number of columns):
d[, c := paste0('a_', a, '_b_', b)]
d
# a b c
#1: 1 2 a_1_b_2
#2: 2 3 a_2_b_3
#3: 3 4 a_3_b_4
The following works, but I'm hoping to find something shorter and more legible.
d = data.table(a = c(1:3), b = c(2:4))
d[, c := apply(mapply(paste, names(.SD), .SD, MoreArgs = list(sep = "_")),
1, paste, collapse = "_")]
one way, only slightly cleaner:
d[, c := apply(d, 1, function(x) paste(names(d), x, sep="_", collapse="_")) ]
a b c
1: 1 2 a_1_b_2
2: 2 3 a_2_b_3
3: 3 4 a_3_b_4
Here is an approach using do.call('paste'), but requiring only a single call to paste
I will benchmark on a situtation where the columns are integers (as this seems a more sensible test case
N <- 1e4
d <- setnames(as.data.table(replicate(5, sample(N), simplify = FALSE)), letters[seq_len(5)])
f5 <- function(d){
l <- length(d)
o <- c(1L, l + 1L) + rep_len(seq_len(l) -1L, 2L * l)
do.call('paste',c((c(as.list(names(d)),d))[o],sep='_'))}
microbenchmark(f1(d), f2(d),f5(d))
Unit: milliseconds
expr min lq median uq max neval
f1(d) 41.51040 43.88348 44.60718 45.29426 52.83682 100
f2(d) 193.94656 207.20362 210.88062 216.31977 252.11668 100
f5(d) 30.73359 31.80593 32.09787 32.64103 45.68245 100
To avoid looping through rows, you can use this:
do.call(paste, c(lapply(names(d), function(n)paste0(n,"_",d[[n]])), sep="_"))
Benchmarking:
N <- 1e4
d <- data.table(a=runif(N),b=runif(N),c=runif(N),d=runif(N),e=runif(N))
f1 <- function(d)
{
do.call(paste, c(lapply(names(d), function(n)paste0(n,"_",d[[n]])), sep="_"))
}
f2 <- function(d)
{
apply(d, 1, function(x) paste(names(d), x, sep="_", collapse="_"))
}
require(microbenchmark)
microbenchmark(f1(d), f2(d))
Note: f2 inspired in #Ricardo's answer.
Results:
Unit: milliseconds
expr min lq median uq max neval
f1(d) 195.8832 213.5017 216.3817 225.4292 254.3549 100
f2(d) 418.3302 442.0676 451.0714 467.5824 567.7051 100
Edit note: previous benchmarking with N <- 1e3 didn't show much difference in times. Thanks again #eddi.

Resources