How do I optimise this Haskell limit order book (with code, reports, graphs)? - haskell

I've written a haskell version of a limit order book, referencing this version written in C:
https://github.com/jordanbaucke/Limit-Order-Book/blob/master/Others/C%2B%2B/engine.c
A limit order book is the mechanism many stock and currency exchanges use for computing trades of currency and stock.
This haskell version (source code further down) submits 2000 random limit orders to the orderbook, and calculates the average execution price.
main = do
orders <- randomOrders
let (orderBook, events) = foldr (\order (book, ev) -> let (b, e) = processOrder order book in (b, ev++e)) (empty, [])
(take 2000 orders)
let (total, count) = ((fromIntegral $ sum $ map executePrice events), fromIntegral $ length events)
print $ "Average execution price: " ++ show (total / count) ++ ", " ++ (show count) ++ " executions"
I've compiled it with -O2, and running the program without profiling takes almost 10 seconds.
time ./main
"Average execution price: 15137.667036215817, 2706.0 executions"
./main 9.90s user 0.09s system 89% cpu 11.205 total
I've tried to set the program to process 10000 orders, taking 160 seconds.
time ./main
"Average execution price: 15047.099824996354, 13714.0 executions"
./main 161.99s user 2.08s system 57% cpu 4:44.16 total
What can I do to make it dramatically faster without sacrificing functionality? Do you think it is possible to bring it to process 10000 orders per second?
Here are the memory usage charts (with the 2000 orders), generated with +RTS hc/hd/hy and hp2ps:
Here is the source code:
import Data.Array
import Data.List
import Data.Word
import Data.Maybe
import Data.Tuple
import Debug.Trace
import System.Random
import Control.Monad (replicateM)
-- Price is measured in smallest divisible unit of currency.
type Price = Word64
maximumPrice = 30000
type Quantity = Word64
type Trader a = a
type Entry a = (Quantity, Trader a)
type PricePoint a = [Entry a]
data OrderBook a = OrderBook {
pricePoints :: Array Price (PricePoint a),
minAsk :: Price,
maxBid :: Price
} deriving (Show)
data Side = Buy | Sell deriving (Eq, Show, Read, Enum, Bounded)
instance Random Side where
randomR (a, b) g =
case randomR (fromEnum a, fromEnum b) g of
(x, g') -> (toEnum x, g')
random g = randomR (minBound, maxBound) g
data Order a = Order {
side :: Side,
price :: Price,
size :: Quantity,
trader :: Trader a
} deriving (Show)
data Event a =
Execution {
buyer :: Trader a,
seller :: Trader a,
executePrice :: Price,
executeQuantity :: Quantity
} deriving (Show)
empty :: OrderBook a
empty = OrderBook {
pricePoints = array (1, maximumPrice) [(i, []) | i <- [1..maximumPrice]],
minAsk = maximumPrice,
maxBid = 0
}
insertOrder :: Order a -> OrderBook a -> OrderBook a
insertOrder (Order side price size t) (OrderBook pricePoints minAsk maxBid) =
OrderBook {
pricePoints = pricePoints // [(price, (pricePoints!price) ++ [(size, t)])],
maxBid = if side == Buy && maxBid < price then price else maxBid,
minAsk = if side == Sell && minAsk > price then price else minAsk
}
processOrder :: Order a -> OrderBook a -> (OrderBook a, [Event a])
processOrder order orderBook
| size /= 0 && price `comp` current =
let (_order, _ob, _events) = executeForPrice order{price=current} _orderBook
in (\(a,b) c -> (a,c++b)) (processOrder _order{price=price} _ob) _events
| otherwise = (insertOrder order orderBook, [])
where
Order side price size _ = order
(current, comp, _orderBook)
| side == Buy = (minAsk orderBook, (>=), orderBook{minAsk=current+1})
| side == Sell = (maxBid orderBook, (<=), orderBook{maxBid=current-1})
executeForPrice :: Order a -> OrderBook a -> (Order a, OrderBook a, [Event a])
executeForPrice order orderBook
| null pricePoint = (order, orderBook, [])
| entrySize < size = (\(a, b, c) d -> (a, b, d:c))
(executeForPrice order{size=size-entrySize} (set rest)) (execute entrySize)
| otherwise =
let entries
| entrySize > size = (entrySize-size, entryTrader):rest
| otherwise = rest
in (order{size=0}, set entries, [execute size])
where
pricePoint = (pricePoints orderBook)!price
(entrySize, entryTrader):rest = pricePoint
Order side price size trader = order
set = \p -> orderBook{pricePoints=(pricePoints orderBook)//[(price, p)]}
(buyer, seller) = (if side == Buy then id else swap) (trader, entryTrader)
execute = Execution buyer seller price
randomTraders :: IO [Int]
randomTraders = do
g <- newStdGen
return (randomRs (1, 3) g)
randomPrices :: IO [Word64]
randomPrices = do
g <- newStdGen
return (map fromIntegral $ randomRs (1 :: Int, fromIntegral maximumPrice) g)
randomSizes :: IO [Word64]
randomSizes = do
g <- newStdGen
return (map fromIntegral $ randomRs (1 :: Int, 10) g)
randomSides :: IO [Side]
randomSides = do
g <- newStdGen
return (randomRs (Buy, Sell) g)
randomOrders = do
sides <- randomSides
prices <- randomPrices
sizes <- randomSizes
traders <- randomTraders
let zipped = zip4 sides prices sizes traders
let orders = map (\(side, price, size, trader) -> Order side price size trader) zipped
return orders
main = do
orders <- randomOrders
let (orderBook, events) = foldr (\order (book, ev) -> let (b, e) = processOrder order book in (b, ev++e)) (empty, [])
(take 2000 orders)
let (total, count) = ((fromIntegral $ sum $ map executePrice events), fromIntegral $ length events)
print $ "Average execution price: " ++ show (total / count) ++ ", " ++ (show count) ++ " executions"
Here are the profiling reports:
ghc -rtsopts --make -O2 OrderBook.hs -o main -prof -auto-all -caf-all -fforce-recomp
time ./main +RTS -sstderr +RTS -hd -p -K100M && hp2ps -e8in -c main.hp
./main +RTS -sstderr -hd -p -K100M
"Average execution price: 15110.97202536367, 2681.0 executions"
3,184,295,808 bytes allocated in the heap
338,666,300 bytes copied during GC
5,017,560 bytes maximum residency (149 sample(s))
196,620 bytes maximum slop
14 MB total memory in use (2 MB lost due to fragmentation)
Generation 0: 4876 collections, 0 parallel, 1.98s, 2.01s elapsed
Generation 1: 149 collections, 0 parallel, 1.02s, 1.07s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 5.16s ( 5.24s elapsed)
GC time 3.00s ( 3.08s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.01s ( 0.01s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 8.17s ( 8.33s elapsed)
%GC time 36.7% (36.9% elapsed)
Alloc rate 617,232,166 bytes per MUT second
Productivity 63.1% of total user, 61.9% of total elapsed
./main +RTS -sstderr +RTS -hd -p -K100M 8.17s user 0.06s system 98% cpu 8.349 total
cat main.prof
Sun Feb 9 12:03 2014 Time and Allocation Profiling Report (Final)
main +RTS -sstderr -hd -p -K100M -RTS
total time = 0.64 secs (32 ticks # 20 ms)
total alloc = 1,655,532,980 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
processOrder Main 46.9 81.2
insertOrder Main 21.9 0.0
executeForPrice Main 18.8 9.7
randomPrices Main 9.4 0.1
main Main 3.1 4.5
minAsk Main 0.0 2.1
maxBid Main 0.0 2.0
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 1 0 0.0 0.0 100.0 100.0
main Main 392 3 3.1 4.5 100.0 99.8
executePrice Main 417 2681 0.0 0.0 0.0 0.0
processOrder Main 398 5695463 46.9 81.2 87.5 95.0
executeForPrice Main 412 5695252 18.8 9.7 18.8 9.7
pricePoints Main 413 5695252 0.0 0.0 0.0 0.0
insertOrder Main 406 1999 21.9 0.0 21.9 0.0
minAsk Main 405 0 0.0 2.1 0.0 2.1
maxBid Main 400 0 0.0 2.0 0.0 2.0
randomOrders Main 393 1 0.0 0.0 9.4 0.2
randomTraders Main 397 1 0.0 0.0 0.0 0.0
randomSizes Main 396 2 0.0 0.1 0.0 0.1
randomPrices Main 395 2 9.4 0.1 9.4 0.1
randomSides Main 394 1 0.0 0.1 0.0 0.1
CAF:main14 Main 383 1 0.0 0.0 0.0 0.0
randomPrices Main 401 0 0.0 0.0 0.0 0.0
CAF:lvl42_r2wH Main 382 1 0.0 0.0 0.0 0.0
main Main 418 0 0.0 0.0 0.0 0.0
CAF:empty_rqz Main 381 1 0.0 0.0 0.0 0.0
empty Main 403 1 0.0 0.0 0.0 0.0
CAF:lvl40_r2wB Main 380 1 0.0 0.0 0.0 0.0
empty Main 407 0 0.0 0.0 0.0 0.0
CAF:lvl39_r2wz Main 379 1 0.0 0.0 0.0 0.1
empty Main 409 0 0.0 0.1 0.0 0.1
CAF:lvl38_r2wv Main 378 1 0.0 0.0 0.0 0.1
empty Main 410 0 0.0 0.1 0.0 0.1
CAF:maximumPrice Main 377 1 0.0 0.0 0.0 0.0
maximumPrice Main 402 1 0.0 0.0 0.0 0.0
CAF:lvl14_r2vF Main 350 1 0.0 0.0 0.0 0.0
executeForPrice Main 414 0 0.0 0.0 0.0 0.0
CAF:lvl12_r2vB Main 349 1 0.0 0.0 0.0 0.0
processOrder Main 415 0 0.0 0.0 0.0 0.0
CAF:lvl10_r2vx Main 348 1 0.0 0.0 0.0 0.0
processOrder Main 416 0 0.0 0.0 0.0 0.0
CAF:lvl8_r2vt Main 347 1 0.0 0.0 0.0 0.0
processOrder Main 399 0 0.0 0.0 0.0 0.0
CAF:lvl6_r2vp Main 346 1 0.0 0.0 0.0 0.0
empty Main 408 0 0.0 0.0 0.0 0.0
CAF:lvl4_r2vl Main 345 1 0.0 0.0 0.0 0.0
empty Main 411 0 0.0 0.0 0.0 0.0
CAF:lvl2_r2vh Main 344 1 0.0 0.0 0.0 0.0
empty Main 404 0 0.0 0.0 0.0 0.0
CAF GHC.Float 319 8 0.0 0.0 0.0 0.0
CAF GHC.Int 304 2 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 278 2 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 239 2 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 232 1 0.0 0.0 0.0 0.0
CAF System.Random 222 1 0.0 0.0 0.0 0.0
CAF Data.Fixed 217 3 0.0 0.0 0.0 0.0
CAF Data.Time.Clock.POSIX 214 2 0.0 0.0 0.0 0.0
I'm a newbie in Haskell. How do I interpret these reports, what do they mean and what can I do to make my code faster?

There are two things we can note from the profiling you've made. There seems to be a lot of arrays in memory and also a fair amount of tuples, or rather tuple projections functions. So those seems to be good targets for optimization.
I first tried replacing arrays with Data.Map and for me that cut execution time in half. This is a much bigger win than you reported in one of the comments to your question. You didn't say exactly how you used maps but one thing I did was make sure that the initial map is empty, i.e. I didn't initialize it with lots of empty price points. In order for this to work, I used findWithDefault in Data.Map and let it return an empty list whenever the key wasn't available. If you didn't do that, then that might be the reason I got a much better speedup than you.
I went on to investigate the tuple selection functions. One common trick when writing high performance Haskell is to make sure that things are properly unboxed. Returning tuples from functions can be costly and you do that for the two of the most called functions, executePrice and processOrder. Before rewriting the code I looked at GHC's intermediate code to see if GHC had managed to unbox the tuples by itself. See this post for information on how to look at GHC intermediate representation: Reading GHC Core. The thing to look for is whether the functions has return type (OrderBook a, [Event a]) or (# OrderBook a, [Event a] #). The latter is good, the former is bad.
What I found was that GHC had not been able to unbox the tuples and so I started by unboxing the return type of processOrder by hand. In order to do so I had to replace the foldr in main with a specialized loop, since foldr cannot deal with unboxed tuples. That gave a modest gain. Then I tried to unbox executeForPrice but that resulted in the following bug: https://ghc.haskell.org/trac/ghc/ticket/8762. There might be a way to avoid that bug but I didn't pursue it further.
Another small improvement: unbox all the fields you can in the types OrderBook and Order. It gave me a small gain.
I hope this helps. Good luck with optimizing your Haskell programs.

Related

I was working on a movie sentiment analysis but, code but i'm facing issues in my code related to processing words

My data set has 42,000 rows. This is the code I used to edit my text before vectorizing it. However the problem is it has a nested for loop which I guess makes it very slow and I'm not being able to use it with more than 1500 rows. Can someone please help out on a better way to do this?
filtered = []
for i in range(2):
rev = re.sub('[^a-zA-Z]', ' ', df['text'][i])
rev = rev.lower()
rev = rev.split()
filtered =[]
for word in rev:
if word not in stopwords.words("english"):
word = PorterStemmer().stem(word)
filtered.append(word)
filtered = " ".join(filtered)
corpus.append(filtered)
I've used line_profiler to measure the speed of the code you posted.
The measurement results are as follows.
Line # Hits Time Per Hit % Time Line Contents
==============================================================
8 #profile
9 def profile_nltk():
10 1 435819.0 435819.0 0.3 df = pd.read_csv('IMDB_Dataset.csv') # (50000, 2)
11 1 1.0 1.0 0.0 filtered = []
12 1 247.0 247.0 0.0 reviews = df['review'][:4000]
13 1 0.0 0.0 0.0 corpus = []
14 4001 216341.0 54.1 0.1 for i in range(len(reviews)):
15 4000 221885.0 55.5 0.2 rev = re.sub('[^a-zA-Z]', ' ', df['review'][i])
16 4000 3878.0 1.0 0.0 rev = rev.lower()
17 4000 30209.0 7.6 0.0 rev = rev.split()
18 4000 1097.0 0.3 0.0 filtered = []
19 950808 235589.0 0.2 0.2 for word in rev:
20 946808 115658060.0 122.2 78.2 if word not in stopwords.words("english"):
21 486614 30898223.0 63.5 20.9 word = PorterStemmer().stem(word)
22 486614 149604.0 0.3 0.1 filtered.append(word)
23 4000 11290.0 2.8 0.0 filtered = " ".join(filtered)
24 4000 1429.0 0.4 0.0 corpus.append(filtered)
As #parsa-abbasi pointed out, the process of checking the stopword accounts for about 80% of the total.
The measurement results for the modified script are as follows. The same process has been reduced to about 1/100th of the processing time.
Line # Hits Time Per Hit % Time Line Contents
==============================================================
8 #profile
9 def profile_nltk():
10 1 441467.0 441467.0 1.4 df = pd.read_csv('IMDB_Dataset.csv') # (50000, 2)
11 1 1.0 1.0 0.0 filtered = []
12 1 335.0 335.0 0.0 reviews = df['review'][:4000]
13 1 1.0 1.0 0.0 corpus = []
14 1 2696.0 2696.0 0.0 stopwords_set = stopwords.words('english')
15 4001 59013.0 14.7 0.2 for i in range(len(reviews)):
16 4000 186393.0 46.6 0.6 rev = re.sub('[^a-zA-Z]', ' ', df['review'][i])
17 4000 3657.0 0.9 0.0 rev = rev.lower()
18 4000 27357.0 6.8 0.1 rev = rev.split()
19 4000 999.0 0.2 0.0 filtered = []
20 950808 220673.0 0.2 0.7 for word in rev:
21 # if word not in stopwords.words("english"):
22 946808 1201271.0 1.3 3.8 if word not in stopwords_set:
23 486614 29479712.0 60.6 92.8 word = PorterStemmer().stem(word)
24 486614 141242.0 0.3 0.4 filtered.append(word)
25 4000 10412.0 2.6 0.0 filtered = " ".join(filtered)
26 4000 1329.0 0.3 0.0 corpus.append(filtered)
I hope this is helpful.
The most time-consuming part of the written code is the stopwords part.
It will call the library to get the list of stopwords each time the loop iterates.
Therefore it's better to get the stopwords set once and use the same set at each iteration.
I rewrote the code as following (other differences are made just for the sake of the readability):
corpus = []
texts = df['text']
stopwords_set = stopwords.words("english")
stemmer = PorterStemmer()
for i in range(len(texts)):
rev = re.sub('[^a-zA-Z]', ' ', texts[i])
rev = rev.lower()
rev = rev.split()
filtered = []
filtered = [stemmer.stem(word) for word in rev if word not in stopwords_set]
filtered = " ".join(filtered)
corpus.append(filtered)

Improving speed when using a for loop for each user groups

Suppose we have following dataset with the output window_num:
index user1 date different_months org_different_months window_num
1690289 2670088 2006-08-01 243.0 243.0 1
1772121 2717874 2005-12-01 0.0 0.0 1
1772123 2717874 2005-12-01 0.0 0.0 1
1772125 2717874 2005-12-01 0.0 0.0 1
1772130 2717874 2005-12-01 0.0 0.0 1
1772136 2717874 2006-01-01 0.0 0.0 1
1772132 2717874 2006-02-01 0.0 2099.0 1
1772134 2717874 2020-08-27 0.0 0.0 4
1772117 2717874 0.0 0.0 4
1772118 2717874 0.0 0.0 4
1772128 2717874 2019-11-01 300.0 300.0 3
1772127 2717874 2011-11-01 2922.0 2922.0 2
1774815 2719456 2006-09-01 0.0 0.0 2
1774809 2719456 2006-10-01 0.0 1949.0 2
1774821 2719456 2020-05-20 0.0 0.0 7
1774803 2719456 0.0 0.0 7
1774806 2719456 0.0 0.0 7
1774819 2719456 2019-08-29 265.0 265.0 6
1774825 2719456 2014-10-01 384.0 384.0 4
1774812 2719456 2005-07-01 427.0 427.0 1
1774816 2719456 2012-02-01 973.0 973.0 3
1774824 2719456 2015-10-20 1409.0 1409.0 5
The user number is represented by user1. The output is the window_num which is generated using different_months and orig_different_months columns. The different_months column is the difference between the date[n] and date[n+1].
Previously, I was using groupby.apply to output window_num, however it became extremely slow when the dataset increased. The code was improved considerably by using the shift functions on the entire dataset to calculate the different_months and orig_different_months column, as well as applying the sort on entire dataset, as seen below:
data = data.sort_values(by=['user','ContractInceptionDateClean'], ascending=[True,True])
#data['user1'] =data['user']
data['different_months'] = (abs((data['ContractInceptionDateClean'].shift(-1)-data['ContractInceptionDateClean'] ).dt.days)).fillna(0)
data.different_months[data['different_months'] < 91] =0
data['shift_different_months']=data['different_months'].shift(1)
data['org_different_months']=data['different_months']
data.loc[((data['different_months'] == 0) | (data['shift_different_months'] == 0)),'different_months']=0
data = salesswindow_cal(data,list(data.user.unique()))
The code that I am currently struggling to improve the speed on is shown below:
def salesswindow_cal(data_,users):
temp = pd.DataFrame()
for u in range(0,len(users)):
df=data_[data_['user']==users[u]]
df['different_months'].values[0]= df['org_different_months'].values[0]
df['window_num']=(df['different_months'].diff() != 0).cumsum()
temp= pd.concat([df,temp],axis=0)
return pd.DataFrame(temp)
A rule of thumb is not to loop through the users and extract df = data_[data_['user']==user]. Instead do groupby:
for u, df in data_.gropuby('user'):
do_some_stuff
Another issue is not to concatenate data iteratively
data_out = []
for user, df in data.groupby('user'):
do_some_stuff
data_out.append(sub_data)
out = pd.concat(data_out)
In your case, you can do a function and groupby().apply() and pandas will concatenate the data for you.
def group_func(df):
d = df.copy()
d['different_months'].values[0] = d['org_different_months'].value[0]
d['window_num'] = (d['different_months'].diff().ne(0).cumsum()
return d
data.groupby('user').apply(group_func)
Update:
Let's try this vectorized approach, which modifies your data inplace
# update the first `different_months`
mask = ~data['user'].duplicated()
data.loc[mask, 'different_months'] == data.loc[mask, 'orginal_different_months']
groups = data.groupby('user')
data['diff'] = groups['different_months'].diff().ne(0)
data['window_num'] = groups['diff'].cumsum()

Haskell: can't understand the bottleneck

I solved a Project Euler problem and then confronted my solution with the one on the Haskell wiki. They were pretty similar, but mine was taking 7.5 seconds, while the other 0.6! I compiled them both.
Mine looks as follows:
main = print . maximumBy (compare `on` cycleLength) $ [1..999]
where cycleLength d = remainders d 10 []
and the one of the wiki:
main = print . fst $ maximumBy (comparing snd) [(n, cycleLength n) | n <- [1..999]]
where cycleLength d = remainders d 10 []
I also tried to change compare `on` with comparing cycleLength but the performance remained the same.
So I must conclude that all the difference lays in computing values on the fly vs. doing the transformation in the list comprehension.
The difference in time is pretty huge though: the second version has 12.5x speedup!
The maximumBy function will repeatedly check the same number in your list multiple times — and every time it checks a number, it will have to re-compute cycleLength. And that's an expensive operation!
The wiki algorithm thus uses a technique known as decorate-sort-undecorate. Now, here you're not sorting, but it's close enough. You first precompute the cycleLength values for all numbers, (i.e. you make a 'cache') then you do the maximum operation, and then you undecorate them (using fst.) That way, you save yourself a lot of computing!
EDIT: to illustrate it, have a look at the maximumBy function in the Data.List source:
-- | The 'maximumBy' function takes a comparison function and a list
-- and returns the greatest element of the list by the comparison function.
-- The list must be finite and non-empty.
maximumBy :: (a -> a -> Ordering) -> [a] -> a
maximumBy _ [] = error "List.maximumBy: empty list"
maximumBy cmp xs = foldl1 maxBy xs
where
maxBy x y = case cmp x y of
GT -> x
_ -> y
It moves in a window of 2; every number is requested (and in your case computed) twice.
This means that for 999 iterations, your version will call cycleLength d 1996 times (n*2-2), whereas the wiki version would call it 999 (n) times.
This doesn't explain the full delay — only a factor of 2, but the factor was closer to about 10.
Here's the profile of your version,
COST CENTRE entries %time %alloc %time %alloc
MAIN 0 0.0 0.0 100.0 100.0
CAF 0 0.0 0.0 100.0 100.0
main 1 0.0 0.0 100.0 100.0
f 1 0.0 0.0 100.0 100.0
maximumBy 1 0.0 0.0 100.0 99.9
maximumBy.maxBy 998 0.0 0.1 100.0 99.9
cycleLength 1996 0.1 0.2 100.0 99.8
remainders 581323 99.3 94.4 99.9 99.7
remainders.r' 581294 0.7 5.2 0.7 5.2
and the wiki version:
COST CENTRE entries %time %alloc %time %alloc
MAIN 0 0.0 0.0 100.0 100.0
CAF 0 0.0 0.0 100.0 99.9
main 1 0.0 0.1 100.0 99.9
f' 1 0.0 0.8 100.0 99.8
cycleLength 999 0.2 0.5 100.0 98.6
remainders 95845 98.3 93.0 99.8 98.2
remainders.r' 95817 1.5 5.2 1.5 5.2
maximumBy 1 0.0 0.1 0.0 0.4
maximumBy.maxBy 998 0.0 0.2 0.0 0.2
Looking at the profile here, it seems that your version goes through a lot more allocations (around 10-12 times as many,) but doesn't use a lot more RAM overall. So we need to explain about a cumulative factor of 5 or 6 in terms of allocations.
Remainders is recursive. In your example, it gets called 581294 times. In the wiki example, it gets called 95817 times. There's our 5-6 fold increase!
So I think the compare call here is also a problem. Since it applies cycleLength to both things we want to compare as well! In the wiki problem, cycleLength gets applied to every number, but here, we have it applied to every number twice, and the comparison seems to be applied more often, and that is a problem especially with the bigger numbers, since remainders has a bad complexity (it seems exponential, but I'm not sure.)
Since the maximum memory consumption of both programs isn't that dramatically different, I don't think this has anything to do with the heap.

Optimizing sum, ZipList, Vector, and unboxed types

I have identified the following hotspot function that is currently 25% of my program execution time:
type EncodeLookup = [(GF256Elm, [GF256Elm])]
-- | Given the coefficients of a Polynomial and an x value,
-- calculate its corresponding y value.
-- coefficients are in ascending order [d, a1, a2, a3...] where y = d + a1*x + a2*x^2 ...
calc :: EncodeLookup -> [GF256Elm] -> [GF256Elm]
calc xsLookup poly =
map f xsLookup
where
f (_, pList) =
let zl = (*) <$> ZipList (pList) <*> ZipList poly
lst = getZipList zl
s = sum lst
in s
Where GF256Elm is just a newtype wrapper around an Int, with custom * and other operations defined for it (FiniteFields).
Here are the related profiling data for this function:
individual inherited
COST CENTRE no. entries %time %alloc %time %alloc
calc 544 794418 1.6 3.1 27.5 19.7
calc.f 698 3972090 0.9 0.0 25.9 16.6
calc.f.s 709 3972090 7.4 6.2 11.0 7.8
fromInteger 711 3972090 0.7 0.0 0.7 0.0
+ 710 11916270 2.9 1.6 2.9 1.6
calc.f.lst 708 3972090 0.0 0.0 0.0 0.0
calc.f.zl 699 3972090 6.8 8.8 14.0 8.8
* 712 11916270 7.2 0.0 7.2 0.0
My observations:
sum's slowness probably came from the lazy list thunks
* calculation takes a while too. You can see the full implementation of GF256Elm here. * is basically a vector lookup of pregenerated table and some bit flipping.
'ZipList` seems to take a significant amount of time as well.
My questions:
How would I go about optimizing this function? Especially regarding sum - would using deepSeq on the list make it faster?
Should I be using unboxed Int# type for GF256Elm? What other ways can I improve the speed of GF256Elm's operations?
Thanks!

Why do obj. files contain normals [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
Simple question, why do object files contain normals, you can just calculate the normals right?
If I'm correct I'd just have to take the crossproduct between the vector point1-point2 and point1-point3, so which would save me the time of reading them from a file.
EDIT:
Trying to be more specific, this is a file I've found and which I want to use:
g cube
v 0.0 0.0 0.0
v 0.0 0.0 1.0
v 0.0 1.0 0.0
v 0.0 1.0 1.0
v 1.0 0.0 0.0
v 1.0 0.0 1.0
v 1.0 1.0 0.0
v 1.0 1.0 1.0
vn 0.0 0.0 1.0
vn 0.0 0.0 -1.0
vn 0.0 1.0 0.0
vn 0.0 -1.0 0.0
vn 1.0 0.0 0.0
vn -1.0 0.0 0.0
f 1//2 7//2 5//2
f 1//2 3//2 7//2
f 1//6 4//6 3//6
f 1//6 2//6 4//6
f 3//3 8//3 7//3
f 3//3 4//3 8//3
f 5//5 7//5 8//5
f 5//5 8//5 6//5
f 1//4 5//4 6//4
f 1//4 6//4 2//4
f 2//1 6//1 8//1
f 2//1 8//1 4//1
EDIT 2:
Because people complained:
http://en.wikipedia.org/wiki/Wavefront_.obj_file
you can calculate normals, but it takes time to compute them. When you have a lot of meshes and have to render at 60 fps (or more), its more performant to load precomputed normals into the GPU. also crossproduct between the vector point1-point2 and point1-point3, just gives the face normal. to get the per vertex normals that are required for Goraud shading, you have to average the face normals at every vertex. so you can see the computation gets deeper.

Resources