Tail call optimization in Racket

Tail call optimization in Racket - garbage-collection

I was doing SICP exercise 2.28 and stumbled upon a strange behaviour of the following code:
(define (fringe tree)
(cond
((null? tree) '())
((not (pair? tree)) (list tree))
(else (append (fringe (car tree)) (fringe (cdr tree))))))
(define (fringe-tail tree)
(define (fringe-iter tree result)
(cond
((null? tree) result)
((not (pair? tree)) (list tree))
(else (fringe-iter (cdr tree) (append result (fringe-tail (car tree)))))))
(fringe-iter tree '()))
(define x (make-list (expt 10 4) 4))
(time (fringe x))
(time (fringe-tail x))
Ordinary fringe runs much faster than its iterative version fringe-tail:
cpu time: 4 real time: 2 gc time: 0
vs.
cpu time: 1063 real time: 1071 gc time: 191
It looks like fringe was optimized into loop and avoids any allocations, while fringe-tail runs much slower and spends time creating and destroying objects.
Can anyone explain this to me?
(Just in case I'm using racket 5.2.1)

If you replace the last clause with:
(else (fringe-iter (cdr tree) (append (fringe-tail (car tree)) result)))
then they run at the same speed for that input, and the tail-recursive version is faster for larger input.
The problem is that you're appending the much longer list for the cdr on to the front, which traverses and allocates much more than the naive version, which appends the fringe of the car on to the front.

The given code has applications in non-tail position, so the function is not iterative, despite its name. :)
Try this:
(define (fringe-tail tree)
(define (iter tree k)
(cond
[(null? tree)
(k '())]
[(not (pair? tree))
(k (list tree))]
[else
(iter (car tree)
(lambda (v1)
(iter (cdr tree)
(lambda (v2)
(k (append v1 v2))))))]))
(iter tree (lambda (a-fringe) a-fringe)))
However, it still uses append which is as expensive as the length of its first argument. Certain degenerate inputs into fringe and fringe-tail will cause a lot of computational suffering.
Let's give an example of such degenerate inputs:
(define (build-evil-struct n)
(if (= n 0)
(list 0)
(list (list (build-evil-struct (sub1 n)))
(build-evil-struct (sub1 n))
(list n))))
(define evil-struct (build-evil-struct 20))
When applied to both fringe and fringe-iter, you'll see very bad performance: I observe seconds of compute time on my own system for fringe and fringe-tail. These tests were run under DrRacket with debugging disabled. If you enable debugging, your numbers will be significantly different.
> (time (void (fringe evil-struct)))
cpu time: 2600 real time: 2602 gc time: 1212
> (time (void (fringe-tail evil-struct)))
cpu time: 4156 real time: 4155 gc time: 2740
With both of these, the use of append is what makes these susceptible to certain degenerate inputs. If we write an accumulating version of fringe, we can eliminate that cost, since we then get to use the constant-time cons operation:
(define (fringe/acc tree)
(define (iter tree acc)
(cond [(null? tree)
acc]
[(not (pair? tree))
(cons tree acc)]
[else
(iter (car tree) (iter (cdr tree) acc))]))
(iter tree '()))
Let's look at the performance of fringe/acc on this structure:
> (time (void (fringe/acc evil-struct)))
cpu time: 272 real time: 274 gc time: 92
Much better! And it's a straightforward matter to turn all the calls here to tail calls.
(define (fringe/acc/tail tree)
(define (iter tree acc k)
(cond [(null? tree)
(k acc)]
[(not (pair? tree))
(k (cons tree acc))]
[else
(iter (cdr tree) acc
(lambda (v1)
(iter (car tree) v1 k)))]))
(iter tree '() (lambda (v) v)))
> (time (void (fringe/acc/tail evil-struct)))
cpu time: 488 real time: 488 gc time: 280
Racket's implementation of the stack is, in this particular case, a bit faster than our reified stack we're representing in the continuations, so fringe/acc is faster than fringe/acc/tail. Still, both of these are significantly better than fringe because they avoid append.
All this being said: this function is already built-into Racket as the flatten function! So you might as well just use that if you don't want to reinvent the wheel. :)

Related

How to implement multithreading

Game rules
Consider a simple 2-player game played as follows: An even number of coins is laid out in a row. Taking turns, each player removes the coin on one of the ends of the row. The object is to have the highest value in coins when all coins have been taken.
Player one finds the sum of all of the even-numbered coins, and all the odd-numbered coins. If the sum of the odd numbered coins is higher, player one takes the leftmost coin; otherwise he takes the rightmost.
Player two has a choice, with an odd number of coins. So he tries taking a coin from both ends, to see which option would leave player 1 worse off.
The problem
I basically want to implement multitheading to this program. I am still very new to Clojure, and I couldn't really find any good material online, regarding multithreading, that could be applied to my program.
The code
(ns game.core
(:gen-class))
(defn vector-from-string [s]
(drop 1 (map read-string (clojure.string/split (clojure.string/trim-newline s) #" "))))
(defn string-from-file [f]
(slurp f))
(defn sum-of-evens [v]
(def evens (vector))
(loop [v v, index 1]
(when (seq v)
(if (even? index)
(def evens (conj evens (first v))))
(recur (rest v) (inc index))))
(reduce + evens))
(defn sum-of-odds [v]
(def odds (vector))
(loop [v v, index 1]
(when (seq v)
(if (odd? index)
(def odds (conj odds (first v))))
(recur (rest v) (inc index))))
(reduce + odds))
(defn player-two [v p1score p2score]
(if (not (empty? v))
(if (> (max (sum-of-odds (drop 1 v)) (sum-of-evens (drop 1 v))) (max (sum-of-odds (drop-last v)) (sum-of-evens (drop-last v))))
(player-one (drop-last v) p1score (+ p2score(last v)))
(player-one (drop 1 v) p1score (+ p2score (first v))))
(println "p1score" p1score "p2score" p2score)))
(defn player-one [v p1score p2score]
(if (not (empty? v))
(if (> (sum-of-odds v) (sum-of-evens v))
(player-two (drop 1 v) (+ p1score (first v)) p2score)
(player-two (drop-last v) (+ p1score (last v)) p2score))
(println "p1score" p1score "p2score" p2score)))
(defn -main [& args]
(let [v (vector-from-string (string-from-file "numbers.txt")) ]
(player-one v 0 0)))
So -main runs the player-one function first, and player-one calls player-two, and they both continue on until the end of the program. I would like to somehow implement multithreading to speed up the executing of this game with a higher amount of starting coins.

Your code is currently very unidiomatic.
A few remarks that hopefully help you getting into the right direction:
A def inside a defn (or def) is (almost) always wrong. You're thinking in terms of variable assignment and mutable variables here. This is not how Clojure works. Use variables in your recur instead, if you absolutely must, use a local atom (also almost always wrong, but less often wrong than def inside defn).
Your loops are unnecessarily complicated. You want to sum over the elements at even or odd indices? Use a combination of reduce, take-nth and rest:
(take-nth 2 [1 2 3])
;=> (1 3)
(take-nth 2 (rest [1 2 3 4]))
;=> (2 4)
The whole things looks like you're compiling this over and over again and then run the JVM with it. Am I right? The preferred way is to work at the REPL. How to access it, depends on which editing environment you use. There are many beginner-friendly REPLs out there. Gorilla REPL is one example.
Once you got your code and development workflow in better shape, you may want to explore functions like pmap and future for easy access to multi-threading. More advanced stuff involves a library called core.async, but that's probably not the ideal route for the beginner. You can also fall back to Java interop to create your threads. Again something that, while not really hard to do, requires a bit of experience with Clojure.
Hope that helps, even it is not a direct answer to your question.

First let's look at some issues in your example that will need to be addressed before parallelizing this code.
sum-of-evens is using def inside a function, which is almost always a mistake. This might seem to have the effect you want, but it's not the right way to achieve it. defs are typically used for namespace-level (at the same level as your function defns) values. We can refactor sum-of-evens to not rely on unintentionally side-effecty behavior via def:
(defn sum-of-evens [v]
(loop [v v
index 1
evens []]
(if (seq v)
(recur (rest v)
(inc index)
(if (even? index) ;; add a binding to loop, not a def
(conj evens (first v))
evens)) ;; pass unchanged value when necessary
(reduce + evens))))
But we can further simplify this function with keep-indexed:
(defn sum-of-evens [coll]
(->> coll
(keep-indexed (fn [i v] (when (even? (inc i))
v)))
(apply +)))
And when we do the same for sum-of-odds, we can see the functions are nearly identical except for the condition they use: odd? vs. even?. We can make another function that takes a predicate function:
(defn sum-by-index-pred [f coll]
(->> coll
(keep-indexed (fn [i v] (when (f i) v)))
(apply +)))
;; using partial application and function composition
(def sum-of-evens (partial sum-by-index-pred (comp even? inc)))
(def sum-of-odds (partial sum-by-index-pred (comp odd? inc)))
Looking at the implementation of player-one and player-two, they seem to be mutually recursive. I don't see how you could parallelize this to make it any faster because each turn is dependent on the previous turn's outcome; there's nothing to parallelize.
I'd suggest refactoring this so that your game rules and state are computed in one place, rather than mutually recursive functions.
(loop [scores (array-map :player-1 0 :player-2 0)
turns (cycle (keys scores))
vs (shuffle (range 100))]
(if (seq vs)
(let [higher-odds? (> (sum-of-odds vs) (sum-of-evens vs))
scores (if higher-odds?
(update scores (first turns) + (first vs))
(update scores (first turns) + (last vs)))
remain (if higher-odds?
(rest vs)
(butlast vs))]
(recur scores (rest turns) remain))
(prn scores)))
;; {:player-1 2624, :player-2 2326}
I'm not sure if this preserves your original game logic but it should be close, and it does generalize it for more than two players. Try adding :player-3 0 to the starting scores.

NLP with Racket

I am studying NLP with Racket and Dr. Racket.
I am working with this code:
#lang racket
(define english-1
'((Initial (1))
(Final (9))
(From 1 to 3 by NP)
(From 1 to 2 by DET)
(From 2 to 3 by N)
(From 3 to 4 by BV)
(From 4 to 5 by ADV)
(From 4 to 5 by |#|)
(From 5 to 6 by DET)
(From 5 to 7 by DET)
(From 5 to 8 by |#|)
(From 6 to 7 by ADJ)
(From 6 to 6 by MOD)
(From 7 to 9 by N)
(From 8 to 8 by MOD)
(From 8 to 9 by ADJ)
(From 9 to 4 by CNJ)
(From 9 to 1 by CNJ)))
(define (getf x y)
(if (eq? (car x) y)
(cadr x)
(getf (cdr x) y)))
(define (initial-nodes network)
(list-ref (assoc 'Initial network) 1))
(define (final-nodes network)
(list-ref (assoc 'Final network) 1))
(define (transitions network)
(filter (lambda (x) (eq? (car x) 'From)) network))
(define (trans-node transition)
(getf transition 'From))
(define(trans-newnode transition)
(getf transition 'to))
(define (trans-label transition)
(getf transition 'by))
(define abbreviations
'((NP kim sandy lee)
(DET a the her)
(N consumer man woman)
(BV is was)
(CNJ and or)
(ADJ happy stupid)
(MOD very)
(ADV often always sometimes)))
(define (recognize network tape)
;; returns t if sucessfully recognizes tape - nil otherwise
(call/cc (lambda (return)
(define (recognize-next node tape network)
(if (and (null? tape) (member node (final-nodes network)))
(return #t) ; success
(for ([transition (transitions network)])
;; try each transition of the network
(when (equal? node (trans-node transition)) ; if it starts at the right node
(for ([newtape (recognize-move (trans-label transition) tape)])
;; try each possible new value of tape
(recognize-next (trans-newnode transition) newtape network))))))
(for ([initialnode (initial-nodes network)])
(recognize-next initialnode tape network))
null))) ; failed to recognize
(define (recognize-move label tape)
(if (or (eq? label (car tape))
(member (car tape) (assoc label abbreviations)))
(list (cdr tape))
(if (eq? label '|#|)
(list tape)
null)))
(require racket/trace)
(trace recognize-move)
(recognize-move english-1 '(hahaha))
The code seems to be mostly fine. However, I keep getting a error messaging related to the recognize-move function:
member: not a proper list: #f
And I thought I was dealing with lists... How can I solve this?

The problem is with this form:
(member (car tape) (assoc label abbreviations))
If assoc does not find anything the result is #f. (member 'anything #f) will not work. In Common Lisp false is the same as an empty list so member on false will work, but not in Scheme. You can perhaps make sure it's a list like this:
(member (car tape) (or (assoc label abbreviations) '()))

This is code translated from Common Lisp. In CL, nil is false and is also the empty list, (). In Racket, #f is false and is not the same as (). assoc wants to return false if it does not find a match: because of the way CL puns on false and the empty list, this means that (member ... (assoc ...)) will always work. In Racket it won't: you need to check to see if assoc failed to find a match.

Trying to get a Scheme Depth-First or Breadth-First Search Currying function to work

I'm new here, and I need help with a function that I'm writing in scheme.
Basically, it involves a search function that can work for either Breadth-First search or Depth-First Search. I think I got the Depth-First-Merge and Breadth-First Merge to work.
However, the problem is to modify the main Search to work as a "currying function" so that when algorithm-specific merge procedures (such as the depth-first-merge or the breadth-first-merge) are passed in as arguments, the search uses that specific type of search. The return
There's two files I have with it. Coins is okay, but search needs fixed. How do I modify the search function here to work as a curried version?
Here's my codes below. first one for search.ss. I made a search2 as an early attempt, but it didn't work. I need to make either search or search2 work as the curried search, (then delete the other). I'm not sure, but I think the merges and two searches are working.
;;;
;;; SEARCH:
;;; -- Non-curried version of generic search algorithm
;;; -- Can be customized for depth-first and breadth-first search
;;; -- You must convert it to a curried version so that
;;; - the function accepts 1 algorithm specific parameter and returns a function
;;; - that accepts 3 problem-specific parameters and returns a function
;;; - that accepths 1 instance specific parameter and performs the search
;;; -- The 5 parameters are described below
;;;
;;; Input:
;;; merge-queue
;;; -- algorithm specific
;;; -- procedure that takes a list of new paths and a queue
;;; and returns a new queue
;;; extend
;;; -- problem-specific
;;; -- procedure that takes a state and a list of visited states,
;;; and returns a list of states that are reachable in one move
;;; from the given state
;;; goal?
;;; -- problem-specific
;;; -- predicate that takes a state and returns true if the
;;; state is a goal state, false otherwise
;;; print-path
;;; -- problem-specific
;;; -- procedure that takes a state and prints out a state nicely
;;; init-state
;;; -- problem instance-specific
;;; -- an initial state to start the search from
;;;
;;; OUTPUT:
;;; -- When succeeded, a path from the initial state to a goal state
;;; -- When failed, #f
;;;
;;Either this or search2 needs to be rewritten into a curried version
;;To accept either depth-first-merge or breadth-first merge as merge procedures into merge-queue
(define search
(lambda (merge-queue init-config extend goal? print-state)
(letrec
((helper
(lambda (queue)
(newline)
(for-each
(lambda (p) (print-path p print-state))
queue)
(cond ((null? queue) #f)
((goal? (caar queue))
(print-state (caar queue))
(newline)
(let ((ans (reverse (car queue))))
(for-each (lambda (x) (print-state x) (newline)) ans)
ans))
(else
(let ((successors (extend (caar queue))))
(print-state (caar queue)) (newline)
(cond ((null? successors)
(helper (cdr queue)))
(else
(for-each (lambda (x) (print-state x) (newline))
successors)
(helper
(merge-queue (cdr queue)
(extend-path successors (car queue))))))))))))
(helper
(list (list (config->state init-config ))))))
(define search2
(lambda (merge-queue extend goal? print-path init-state)
(letrec
((search-helper
(lambda (queue visited)
(cond
((null? queue) #f)
((goal? (caar queue))
(begin
(print-path (car queue))
(car queue)))
(else
(let ((successors (extend (caar queue) visited)))
(cond
((null? successors)
(search-helper (cdr queue) visited))
(else
(let ((new-paths (extend-path successors (car queue))))
(search-helper
(merge-queue queue new-paths)
(cond
(merge-queue))
(append successors visited)))))))))))
(search-helper
(list (list init-state)) ; initial queue
(list init-state))))) ; initial visited
(define extend-path
(lambda (successors path)
(if (null? successors)
'()
(cons (cons (car successors) path)
(extend-path (cdr successors) path)))))
;; merge new extended paths to queue for depth first search
;; - uncomment and define your merge for depth first search
(define depth-first-merge
(lambda (queue paths)
(append! paths queue)))
;; merge new extended paths to queue for breadth first search
;; - uncomment and define your merge for breadth first search
(define breadth-first-merge
(lambda (queue paths)
(append! queue paths)))
;; customize the generic search for depth first search
;; - uncomment and define your depth-first-search in terms of your
;; curried version of search and depth-first-merge
;; Curry Methods are helpful to this.
(define depth-first-search (search depth-first-merge))
(lambda (extend goal? print-path)
(search (depth-first-merge extend goal? print-path))))
;; customize the generic search for breadth first search
;; - uncomment and define your breadth-first-search in terms of your
;; curried version of search and breadth-first-merge
(define breadth-first-search (search breadth-first-merge))
(lambda (extend goal? print-path)
(search (breadth-first-merge extend goal? print-path))))
And this is the Coins file code that is used to compliment the search code. They are in separate files and it loads search.ss (the above one) to work.
;; load algorithm specific code for search
(load "search.ss")
;;; Problem specific code for solving the old British coin problems
;;; using the curried version of the simple search procedure.
;;; The old British coin problem was discussed in the lecture.
;;;
;;; To solve the problem, load this file and run
;;; (coin-depth-first amount)
;;; or
;;; (coin-breadth-first amount)
;;; where, amount is replaced with some number, e.g., 48.
;;;
;;; Here, a state is represented as follows:
;;; (amount (coin1 coin2 ...))
;;;
;;; The car of the state represents how much change you need to pay further.
;;; The cadr of the state represents the combination of coins you used
;;; to pay so far. For example,
;;; (48 ())
;;; is the initial state for the amount of 48 cents and
;;; (0 (24 24)
;;; can be one of the goal states using two 24-cent coins.
;; There are 7 kinds of old British coins
(define old-british-coins '(120 30 24 12 6 3 1))
;; Or, you can do the same for US coins
(define us-coins '(100 50 25 10 5 1))
;; Here, we will do the old British coins
(define *coins* old-british-coins)
;; Is a state the goal state?
(define goal?
(lambda (state)
(zero? (car state))))
;; returns children of a state
(define extend
(lambda (state visited)
(let ((coins (applicable-coins state visited *coins*)))
(map
(lambda (coin)
(list (- (car state) coin)
(append (cadr state) (list coin))))
coins))))
;; find all applicable coins from a state
(define applicable-coins
(lambda (state visited coins)
(cond
((null? coins) '())
((<= (car coins) (car state))
(if (visited? state visited (car coins))
(applicable-coins state visited (cdr coins))
(cons (car coins) (applicable-coins state visited (cdr coins)))))
(else (applicable-coins state visited (cdr coins))))))
;; see if a state has been visited before
(define visited?
(lambda (state visited coin)
(cond
((null? visited) #f)
((= (- (car state) coin) (caar visited)) #t)
(else (visited? state (cdr visited) coin)))))
;; pretty-print a state
(define pretty-print-path
(lambda (path)
(pretty-print-state (car path))))
(define pretty-print-state
(lambda (state)
(let ((change (car state))
(coins (cadr state))
(total (apply + (cadr state))))
(printf
"===> Total of ~a paid with ~a, with remainder of ~a <===~%"
total coins change))))
;; customize the generic depth-first-search for coin problem
(define coin-depth-first-search
(depth-first-search extend goal? pretty-print-path))
;; instance of a coin problem using depth-first search
(define coin-depth-first
(lambda (amount)
(coin-depth-first-search (list amount '()))))
;; customize the generic breadth-first-search for coin problem
(define coin-breadth-first-search
(breadth-first-search extend goal? pretty-print-path))
;; instance of a coin problem with breadth-first search
(define coin-breadth-first
(lambda (amount)
(coin-breadth-first-search (list amount '()))))
Can someone please help me? I think all I need to get it to work is to find out how to make the search or search2 code become a curried version.

To curry a function means to redefine it in such a way that it takes a number of parameters less than the current definition and returns a new function that takes the rest of the parameters and perform the work of the first one. For instance, you can curry the following two-parameters summing function:
(define add
(lambda (a b)
(+ a b)))
(add 7 10) ;; => 17
in the following way:
(define add-to
(lambda (a)
(lambda (b)
(+ a b))))
((add-to 7) 10) ;; => 17
(define add-to-7 (add-to 7)) ;; we give a name to the function that add 7 to its argument
(add-to-7 8) ;; => 15
(add-to-7 9) ;; => 16
So, to transform the search2 function (you must extend that function since its last parameter is the problem instance specific one):
(define search2
(lambda (merge-queue extend goal? print-path init-state)
...body of search2...
as required, you could simply write something like this:
(define search2
(lambda (merge-queue)
(lambda (extend goal? print-path)
(lambda (init-state)
...body of search2...
and then, calling it with the correct number of parameters, you could obtain “partial” functions to be called later. For instance you can define a generic depth first search as:
(define depth-first-search (search2 depth-first-merge))
then you can define the depth first search specialized for the coin problem, given appropriate definitions for the coin functions:
(define coin-depth-first (depth-first-search coin-extend coin-goal? coin-print-path))
and finally you can call it with a certain amount to solve the problem:
(coin-depth-first 100)

lisp hash symbol replaces returned values

I've been trying to build something similar to a breadth-first tree-like structure for a graph, which contains all possible paths from a given node. I didn't have problem with the algorithm as much as I do with some sort of error that pops up. Here's the relevant code below:
(set 'my-graph '((A (B C))
(B (D E))
(C (F G))
(D (E))
(E (H))
(F (H I))
(G (I))
(H (J))
(I (J))
(J ())))
(defun search-tree(graph traversed visited)
(cond
((null traversed) NIL)
(:else (let*
((new-visited (append visited (list (car traversed))))
(children (add-children graph (car traversed)
(append (cdr traversed) new-visited))))
(cond
((null children) (list (car traversed)))
(:else
(cons (car traversed)
(mapcar (lambda(x) (search-tree graph (list x) new-visited)) children)))
)
)
)
)
)
;;; Selects the node to pick returned children from
(defun add-children(graph node visited)
(cond
((null graph) NIL)
((equal (caar graph) node) (new-nodes (cadar graph) visited))
(:else (add-children (cdr graph) node visited))
)
)
;;; Returns new, unvisited nodes from the children of a node
(defun new-nodes(children visited)
(cond
((null children) NIL)
((member (car children) visited) (new-nodes (cdr children) visited))
(:else (cons (car children) (new-nodes (cdr children) visited)))
)
)
Function search tree is called as (search-tree my-graph '(A) '()) and it returns almost everything as I want correctly, but the first terminal node which is replaced with a # symbol (it should be (J)). What could be the problem in here? That's the returned value.
(A (B (D (E (H #))) (E (H (J)))) (C (F (H (J)) (I (J))) (G (I (J)))))
I've tried tracing the code, but I still don't understand why is the (J) list swapped in mid-recursion with a # symbol.

Usually I would guess that it has something to do with *print-level*.
This variable controls how deep nested lists are printed. Set it to a number for the level. Lists in a deeper level are replaced with a # character.
If setting that to NIL does not help, then you might also want to consult the Allegro CL manual - I can remotely remember that the IDE also has its own settings.

Which amount of code execution should I parallelize?

If I want to parallelize the execution of an algorithm what are the smalls chunks of code that I should split?
A classic example is a sorting algorithm. For what element size or typical execution time does it make sense to split the sorting between multiple threads? Or when is the overhead for waiting on another thread larger than the execution time on a single thread?
Are there any simple rules? Does this depend on the OS?

The key rule is "fork only when the forking overhead is much smaller than the amount of work the fork will do". Since forking overhead is a property of the specific technology you use, and so is the effort to do the work, you in some sense have to determine this empirically. You'll likely end up with some threshold tuning constant in your code to represent this tradeoff.
What you will discover in practice is that finding seperable chunks of work is actually hard. If you make the work chunk small, it hasn't got a lot of dependencies and you can schedule it once all its input dataflows are ready. But small chunks usually mean small work, and the forking overhead usually negates the gain. If you try to make the chunks big, they have so many dependences that you can't break them out to schedule them.
Some people are lucky and can find such big chunks; we call most of those people physicists and/or Fortran programmers and they are taking advantage of data parallelism induced by dividing the world into as many tiny pieces as they can.
The only decent cure I know of is to use a spectacularly fast forking mechanism, so that you can find the smallest practical chunks. Unfortunately, the parallelism libraries offered to do this are ... libraries, invoked dynamically, with corresponding dynamic invocation overhead. Typical libraries containing parallelism primitives takes 100s to thousands of cycles to implement a "fork"; this is bad news if your chunk of work is 100 machine instructions.
I believe strongly that to get such fast forking mechanisms, the language compiler has to know that you are doing the fork, e.g., "fork" (however spelled :-) has be a keyword in the language. Then the compiler can see the forks, and preallocate everything needed to minimize the time to accomplish this, and generate special code to manage the forking (and joining) steps.
The PARLANSE language that I designed, and that we use at Semantic Designs is one such language.
It is a Lisp-like language in syntax (but not in semantics). Its parallelism operator is spelled "(|| ... )". You can see it below in the Quicksort module we use daily, below.
You can also see the explicit QuickSortParallelThreshold value, determined empirically.
This Quicksort scales linearly to 8 cores on an Intel x86 system.
(define QuickSort
(module
(;; (define Value nu)
(compileifthen (~ (defined QuickSortWithParlanseBuiltInOrderingOfNu))
(define QuickSortWithParlanseBuiltInOrderingOfNu ~f) ; use PARLANSE comparison operators
)compileifthen
(compileifthen (~ (defined QuickSortParallelThreshold))
(define QuickSortParallelThreshold 100)
)compileifthen
(compileifthen (~ (defined QuickSortThreshold))
(compileifthenelse QuickSortWithParlanseBuiltInOrderingOfNu
(define QuickSortThreshold 16)
(define QuickSortThreshold 8)
)compileifthenelse
)compileifthen
(compileifthenelse (~ (defined QuickSortWithCompareByReference))
(define QuickSortWithCompareByReference ~f)
(compileifthen QuickSortWithParlanseBuiltInOrderingOfNu
(define QuickSortWithCompareByReference ~f)
)compileifthen
)compileifthenelse
(define SortRange
(action (procedure (structure (compileifthen (~ QuickSortWithParlanseBuiltInOrderingOfNu)
(compileifthenelse (~ QuickSortWithCompareByReference)
[compare (function (sort integer (range -1 +1)) (structure [value1 Value] [value2 Value]))]
[compare (function (sort integer (range -1 +1)) (structure [value1 (reference Value)] [value2 (reference Value)]))]
)compileifthenelse
)compileifthen
[a (reference (array Value 1 dynamic))]
[from natural]
[to natural]
)structure
)procedure
(local (;; (define quicksort
(action (procedure (structure [l integer] [r integer])))
)define
(define quicksort
(action (procedure (structure [l integer] [r integer]))
(ifthenelse (<= (- r l) (coerce integer QuickSortThreshold))
(do [i integer] (++ l) r +1
(local (= [exch Value] a:i)
(block exit_if_inserted
(;; (do [j integer] (-- i) l -1
(ifthenelse (compileifthenelse QuickSortWithParlanseBuiltInOrderingOfNu
(> a:j exch)
(compileifthenelse (~ QuickSortWithCompareByReference)
(== (compare a:j exch) +1)
(== (compare (. a:j) (. exch)) +1)
)compileifthenelse
)compileifthenelse
(= a:(++ j) a:j)
(;; (= a:(++ j) exch)
(exitblock exit_if_inserted)
);;
)ifthenelse
)do
(= a:l exch)
);;
)block
)local
)do
(local (;; (= [i integer] l)
(= [j integer] r)
(= [p integer] l)
(= [q integer] r)
[exch Value]
);;
(;;
`use middle element as pivot':
(local (= [m integer] (// (+ l r) +2))
(;; (= exch a:m)
(= a:m a:r)
(= a:r exch)
);;
)local
`4-way partitioning = < > =':
(loop exit_if_partitioned
(;;
`find element greater than pivot':
(loop exit_if_greater_than_found
(;; (compileifthenelse QuickSortWithParlanseBuiltInOrderingOfNu
(ifthenelse (< a:i a:r)
(consume ~t)
(ifthenelse (> a:i a:r)
(exitblock exit_if_greater_than_found)
(;; (ifthen (>= i j)
(exitblock exit_if_partitioned)
)ifthen
(= exch a:p)
(= a:p a:i)
(= a:i exch)
(+= p 1)
);;
)ifthenelse
)ifthenelse
(case (compileifthenelse (~ QuickSortWithCompareByReference)
(compare a:i a:r)
(compare (. a:i) (. a:r))
)compileifthenelse
-1
(consume ~t)
+1
(exitblock exit_if_greater_than_found)
else (;; (ifthen (>= i j)
(exitblock exit_if_partitioned)
)ifthen
(= exch a:p)
(= a:p a:i)
(= a:i exch)
(+= p 1)
);;
)case
)compileifthenelse
(+= i 1)
);;
)loop
`find element less than to pivot':
(loop exit_if_less_than_found
(;; (-= j 1)
(ifthen (>= i j)
(exitblock exit_if_partitioned)
)ifthen
(compileifthenelse QuickSortWithParlanseBuiltInOrderingOfNu
(ifthenelse (< a:j a:r)
(exitblock exit_if_less_than_found)
(ifthenelse (> a:j a:r)
(consume ~t)
(;; (-= q 1)
(= exch a:j)
(= a:j a:q)
(= a:q exch)
);;
)ifthenelse
)ifthenelse
(case (compileifthenelse (~ QuickSortWithCompareByReference)
(compare a:j a:r)
(compare (. a:j) (. a:r))
)compileifthenelse
-1
(exitblock exit_if_less_than_found)
+1
(consume ~t)
else (;; (-= q 1)
(= exch a:j)
(= a:j a:q)
(= a:q exch)
);;
)case
)compileifthenelse
);;
)loop
`move found elements to proper partitions':
(;; (= exch a:i)
(= a:i a:j)
(= a:j exch)
);;
`increment index':
(+= i 1)
);;
)loop
`3-way partitioning < = >':
(;;
`move pivot to final location':
(;; (= exch a:i)
(= a:i a:r)
(= a:r exch)
(= j (-- i))
(= i (++ i))
);;
`move elements equal to pivot to final locations':
(;; (do [k integer] l (-- p) +1
(;; (= exch a:k)
(= a:k a:j)
(= a:j exch)
(-= j 1)
);;
)do
(do [k integer] (-- r) q -1
(;; (= exch a:i)
(= a:i a:k)
(= a:k exch)
(+= i 1)
);;
)do
);;
);;
`sort partitions not equal to pivot':
(ifthenelse (<= (- r l) (coerce integer QuickSortParallelThreshold))
(;; (quicksort l j)
(quicksort i r)
);;
(|| (quicksort l j)
(quicksort i r)
)||
)ifthenelse
);;
)local
)ifthenelse
)action
)define
);;
(;; (quicksort (coerce integer from) (coerce integer to))
(ifdebug (do [i integer] (coerce integer from) (-- (coerce integer to)) +1
(trust (compileifthenelse QuickSortWithParlanseBuiltInOrderingOfNu
(<= a:i a:(++ i))
(compileifthenelse (~ QuickSortWithCompareByReference)
(<= (compare a:i a:(++ i)) +0)
(<= (compare (. a:i) (. a:(++ i))) +0)
)compileifthenelse
)compileifthenelse
`QuickSort:Sort -> The array is not sorted.'
)trust
)do
)ifdebug
);;
)local
)action
)define
(define Sort
(action (procedure (structure (compileifthen (~ QuickSortWithParlanseBuiltInOrderingOfNu)
(compileifthenelse (~ QuickSortWithCompareByReference)
[compare (function (sort integer (range -1 +1)) (structure [value1 Value] [value2 Value]))]
[compare (function (sort integer (range -1 +1)) (structure [value1 (reference Value)] [value2 (reference Value)]))]
)compileifthenelse
)compileifthen
[a (reference (array Value 1 dynamic))]
)structure
)procedure
(compileifthenelse (~ QuickSortWithParlanseBuiltInOrderingOfNu)
(SortRange compare a (coerce natural (lowerbound (# a) 1)) (coerce natural (upperbound (# a) 1)))
(SortRange a (coerce natural (lowerbound (# a) 1)) (coerce natural (upperbound (# a) 1)))
)compileifthenelse
)action
)define
);;
)module
)define

It depends on the overhead of the inter-thread communication. I tested openMP with image processing, and there a line of pixels was convenient, as well giving good speedups. My image was a megapixel, so there were 1000 tasks, which is probably more than enough to keep today's manycore machines busy. You also don't need to limit yourself to jobs that take more than a second or so. In this example the speedups of jobs of the order of 10 milliseconds where clearly visible.
Now this was a pleasant algorithm because it was not recursive, so there were no dependencies of one task on the other, and all the tasks were automatically the same size.
Sorting algorithms will be harder, due to varying task sizes. You'd want to be able to experiment with this, and maybe choose a sort that is easier to paralellize.

Take couple of courses in concurrent and parallel programming. Learn several technologies like plain old fork & forget or "manual" multithreading (Java threads or pthreads), MPI, OpenMP, BSP, maybe even CUDA or OpenCL. Then either decide to be an expert or let the experts design and implement efficient and correct parallel algorithms. The "parallel" part is easy, the "efficient" and "correct" parts are not, when both are needed. Even Java concurrent Vector collection, designed and implemented by experts, was not free from bugs in the first versions. The mere definition of memory model was not clear in the first versions of Java standard!
The simplest rule: use ready-to-use components designed and implemented by experts and don't try to achieve both correctness and efficiency designing your own parallel algorithms unless you're an expert.

Solving this problem programmatically is one of the holy grails of parallel computing, and there are many libraries that can approximate the optimal parallelism for particular problems (e.g., Data Parallel Haskell).
Anyhow, to do this by hand, you need to understand:
The algorithm that you wish to parallelize (Is it parallelizable?)
The characteristics of the data, e.g., sizes, location (on disk, in memory), etc.
The hardware that you're running on, e.g., number cores, memory latency, cache sizes/lines/associavity, etc.
The threading model of both the implementation language (coroutines, green threads, OS threads) and OS.
Cost of spawning and context-switching between threads.
Assuming that the algorithm is parallelizable, your goal is to find the number of threads and the relative chunk size of the data, such that you can make optimal use of the hardware to generate a solution.
This is quite hard to do without lots of experimentation. My preferred way of figuring this out is by running lots of benchmarks, and getting performance data as a function of one or more combinations of the following:
Number of threads.
Buffer sizes (if the data is not in RAM) incrementing at some reasonable value (e.g., block size, packet size, cache size, etc.)
Varying chunk sizes (if you can process the data incrementally).
Various tuning knobs for the OS or language runtime.
Pinning threads to CPUs to improve locality.
Etc.
Anyhow, this is no easy task, and there are tools and libraries to help you squeeze as much performance as is possible out of your parallelizable problems. The only reasonable way you can do this correctly by having a good understanding of your data, your code, and your runtime environment.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string