Case conversion of large dictionary word set - haskell

I'm trying to match words from a dictionary, case-insensitively. My initial approach
looks like so:
read dict; convert all words to lowercase, store in set.
check new word for membership in set
Is there a better (more efficient) way to achieve this? I'm new to Haskell.
import System.IO
import Data.Text (toLower, pack, unpack)
import Data.Set (fromList, member)
main = do
let path = "/usr/share/dict/american-english"
h <- openFile path ReadMode
hSetEncoding h utf8
contents <- hGetContents h
let mySet = (fromList . map (unpack . toLower . pack) . lines) contents
putStrLn $ show $ member "acadia" mySet

I would just work with Text directly instead of converting to/from Strings.
Data.Text.IO contains versions of hGetContents, readFile, etc. for reading Text from files, and Data.Text has lines for Text.
{-# LANGUAGE OverloadedStrings #-}
import System.IO
import qualified Data.Text as T
import qualified Data.Text.IO as T
import qualified Data.Set as S
main = do
let path = "/usr/share/dict/american-english"
h <- openFile path ReadMode
hSetEncoding h utf8
contents <- T.hGetContents h
let mySet = (S.fromList . map T.toLower . T.lines) contents
putStrLn $ show $ S.member "acadia" mySet
By using T.tolower and T.lines we avoid explicit pack/unpack calls.
mySet is now a set of Text values rather than of Strings. By using
the OverloadedStrings pragma the literal "acadia" will be interpreted
as a Text value.

Yes, what you propose is reasonable. Some few remarks, mostly unrelated to the main question:
It would be more efficient to restrict your self to using only Text and not String.
Prefer the toCaseFold function to toLower, it's more appropriate for this case.

Even though you found my first answer helpful, let me propose another approach...
A boggle solver I wrote simply reads in the entire dictionary as a single ByteString, and to look up words performs a binary search on that ByteString.
The dictionary must already be in sorted order and normalized to lower case, but usually this is not a problem since the dictionary is static
and known in advance.
Of course, when you compute (lo+hi)/2 in performing the binary search you might land in the middle of word, so you simply back up to the beginning of the current word.
The main advantage of this is that loading the dictionary is extremely fast and it is memory efficient. Moreover, the search algorithm has good memory locality. I haven't measured it, but I wouldn't be surprised if creating a Data.Set will more than double the size of the raw data.
The code is available here: https://github.com/erantapaa/hoggle

Related

Haskell remove trailing and leading whitespace from a file

Let's say I have a file
mary had a little lamb
It's fleece was white as snow
Everywhere
the child went
The lamb, the lamb was sure to go, yeah
How would I read the file as a string, and remove the trailing and leading whitespace? It could be spaces or tabs. It would print like this after removing whitespace:
mary had a little lamb
It's fleece was white as snow
Everywhere
the child went
The lamb, the lamb was sure to go, yeah
Here's what I have currently:
import Data.Text as T
readTheFile = do
handle <- openFile "mary.txt" ReadMode
contents <- hGetContents handle
putStrLn contents
hClose handle
return(contents)
main :: IO ()
main = do
file <- readTheFile
file2 <- (T.strip file)
return()
Your code suggests a few misunderstandings about Haskell so let's go through your code before getting to the solution.
import Data.Text as T
You're using Text, great! I suggest you also use the IO operations that read and write Text types instead of what is provided by the prelude which works on Strings (linked lists of characters). That is, import Data.Text.IO as T
readTheFile = do
handle <- openFile "mary.txt" ReadMode
contents <- hGetContents handle
putStrLn contents
hClose handle
return(contents)
Oh, hey, the use of hGetContents and manually opening and closing a file can be error prone. Consider readTheFile = T.readFile "mary.txt".
main :: IO ()
main = do
file <- readTheFile
file2 <- (T.strip file)
return()
Two issues here.
Issue one Notice here you have used strip as though it's an IO action... but it isn't. I suggest you learn more about IO and binding (do notation) vs let-bound variables. strip computes a new value of type Text and presumably you want to do something useful with that value, like write it.
Issue two Stripping the whole file is different than stripping each line one at a time. I suggest you read mathk's answer.
So in the end I think you want:
-- Qualified imports are accessed via `T.someSymbol`
import qualified Data.Text.IO as T
import qualified Data.Text as T
-- Not really need as a separate function unless you want to also
-- put the stripping here too.
readTheFile :: IO T.Text
readTheFile = T.readFile "mary.txt"
-- First read, then strip each line, then write a new file.
main :: IO ()
main =
do file <- readTheFile
let strippedFile = T.unlines $ map T.strip $ T.lines file
T.writeFile "newfile.txt" (T.strip strippedFile)
Here is a possible solution for what you are looking for:
import qualified Data.Text as T
main = do
trimedFile <- (T.unlines . map T.strip . T.lines) <$> T.readFile "mary.txt"
T.putStr trimedFile
strip from Data.Text is doing the job.
Read the file or process the file one line at a time then
> intercalate " ".words $ " The lamb, the lamb was sure to go, yeah "
"The lamb, the lamb was sure to go, yeah"
But, unwords with no parameter is better than intercalate " " and it neither has to be imported.

Haskell read/write binary files complete working example

I wish if someone gives a complete working code that allows to do the following in Haskell:
Read a very large sequence (more than 1 billion elements) of 32-bit
int values from a binary file into an appropriate container (e.g.
certainly not a list, for performance issues) and doubling each number
if it's less than 1000 (decimal) and then write the resulting 32-bit
int values to another binary file. I may not want to read the entire
contents of the binary file in the memory at once. I want to read one
chunk after the previous.
I am confused because I could find very little documentation about this. Data.Binary, ByteString, Word8 and what not, it just adds to the confusion. There is pretty straight-forward solution to such problems in C/C++. Take an array (e.g. of unsigned int) of desired size, and use the read/write library calls and be done with it. In Haskell it didn't seem so easy, at least to me.
I'd appreciate if your solution uses the best possible standard packages that are available with mainstream Haskell (> GHC 7.10) and not some obscure/obsolete ones.
I read from these pages
https://wiki.haskell.org/Binary_IO
https://wiki.haskell.org/Dealing_with_binary_data
If you're doing binary I/O, you almost certainly want ByteString for the actual input/output part. Have a look at the hGet and hPut functions it provides. (Or, if you only need strictly linear access, you can try using lazy I/O, but it's easy to get that wrong.)
Of course, a byte string is just an array of bytes; your next problem is interpreting those bytes as character / integers / doubles / whatever else they're supposed to be. There are a couple of packages for that, but Data.Binary seems to be the most mainstream one.
The documentation for binary seems to want to steer you towards using the Binary class, where you write code to serialise and deserialise whole objects. But you can use the functions in Data.Binary.Get and Data.Binary.Put to deal with individual items. There you will find functions such as getWord32be (get Word32 big-endian) and so forth.
I don't have time to write a working code example right now, but basically look at the functions I mention above and ignore everything else, and you should get some idea.
Now with working code:
module Main where
import Data.Word
import qualified Data.ByteString.Lazy as BIN
import Data.Binary.Get
import Data.Binary.Put
import Control.Monad
import System.IO
main = do
h_in <- openFile "Foo.bin" ReadMode
h_out <- openFile "Bar.bin" WriteMode
replicateM 1000 (process_chunk h_in h_out)
hClose h_in
hClose h_out
chunk_size = 1000
int_size = 4
process_chunk h_in h_out = do
bin1 <- BIN.hGet h_in chunk_size
let ints1 = runGet (replicateM (chunk_size `div` int_size) getWord32le) bin1
let ints2 = map (\ x -> if x < 1000 then 2*x else x) ints1
let bin2 = runPut (mapM_ putWord32le ints2)
BIN.hPut h_out bin2
This, I believe, does what you asked for. It reads 1000 chunks of chunk_size bytes, converts each one into a list of Word32 (so it only ever has chunk_size / 4 integers in memory at once), does the calculation you specified, and writes the result back out again.
Obviously if you did this "for real" you'd want EOF checking and such.
Best way to work with binary I/O in Haskell is by using bytestrings. Lazy bytestrings provide buffered I/O, so you don't even need to care about it.
Code below assumes that chunk size is a multiple of 32-bit (which it is).
module Main where
import Data.Word
import Control.Monad
import Data.Binary.Get
import Data.Binary.Put
import qualified Data.ByteString.Lazy as BS
import qualified Data.ByteString as BStrict
-- Convert one bytestring chunk to the list of integers
-- and append the result of conversion of the later chunks.
-- It actually appends only closure which will evaluate next
-- block of numbers on demand.
toNumbers :: BStrict.ByteString -> [Word32] -> [Word32]
toNumbers chunk rest = chunkNumbers ++ rest
where
getNumberList = replicateM (BStrict.length chunk `div` 4) getWord32le
chunkNumbers = runGet getNumberList (BS.fromStrict chunk)
main :: IO()
main = do
-- every operation below is done lazily, consuming input as necessary
input <- BS.readFile "in.dat"
let inNumbers = BS.foldrChunks toNumbers [] input
let outNumbers = map (\x -> if x < 1000 then 2*x else x) inNumbers
let output = runPut (mapM_ putWord32le outNumbers)
-- There lazy bytestring output is evaluated and saved chunk
-- by chunk, pulling data from input file, decoding, processing
-- and encoding it back one chunk at a time
BS.writeFile "out.dat" output
Here is a loop to process one line at a time from stdin:
import System.IO
loop = do b <- hIsEOF stdin
if b then return ()
else do str <- hGetLine stdin
let str' = ...process str...
hPutStrLn stdout str'
Now just replace hGetLine with something that reads 4 bytes, etc.
Here is the I/O section for Data.ByteString:
https://hackage.haskell.org/package/bytestring-0.10.6.0/docs/Data-ByteString.html#g:29

Haskell requires more memory than Python when I read map from file. Why?

I have this simple code in Python:
input = open("baseforms.txt","r",encoding='utf8')
S = {}
for i in input:
words = i.split()
S.update( {j:words[0] for j in words} )
print(S.get("sometext","not found"))
print(len(S))
It requires 300MB for work. "baseforms.txt" size is 123M.
I've wrote the same code in Haskell:
{-# LANGUAGE OverloadedStrings #-}
import qualified Data.Map as M
import qualified Data.ByteString.Lazy.Char8 as B
import Data.Text.Lazy.Encoding(decodeUtf8)
import qualified Data.Text.Lazy as T
import qualified Data.Text.Lazy.IO as I
import Control.Monad(liftM)
main = do
text <- B.readFile "baseforms.txt"
let m = (M.fromList . (concatMap (parseLine.decodeUtf8))) (B.lines text)
print (M.lookup "sometext" m)
print (M.size m)
where
parseLine line = let base:forms = T.words line in [(f,base)| f<-forms]
It requires 544 MB and it's slower than Python version. Why? Is it possible to optimise Haskell version?
There is a lot happening in the Haskell version that's not happening in the Python version.
readFile uses lazy IO, which is a bit weird in general. I would generally avoid lazy IO.
The file, as a bytestring, is broken into lines which are then decoded as UTF-8. This seems a little unnecessary, given the existence of Text IO functions.
The Haskell version is using a tree (Data.Map) whereas the Python version is using a hash table.
The strings are all lazy, which is probably not necessary if they're relatively short. Lazy strings have a couple words of overhead per string, which can add up. You could fuse the lazy strings, or you could read the file all at once, or you could use something like conduit.
GHC uses a copying collector, whereas the default Python implementation uses malloc() with reference counting and the occasional GC. This fact alone can account for large differences in memory usage, depending on your program.
Who knows how many thunks are getting created in the Haskell version.
It's unknown whether you've enabled optimizations.
It's unknown how much slower the Haskell version is.
We don't have your data file so we can't really test it ourselves.
It's a bit late, but I studied this a little and think Dietrich Epp's account is right, but can be simplified a little. Notice that there doesn't seem to be any real python programming going on in the python file: it is orchestrating a very simple sequence of calls to C string operations and then to a C hash table implementation. (This is often a problem with really simple python v. Haskell benchmarks.) The Haskell, by contrast, is building an immense persistent Map, which is a fancy tree. So the main points of opposition here are C vs Haskell, and hashtable-with-destructive-update vs persistent map. Since there is little overlap in the input file, the tree you are constructing includes all the information in the input string, some of it repeated, and then rearranged with a pile of Haskell constructors. This is I think the source of the alarm you are experiencing, but it can be explained.
Compare these two files, one using ByteString:
import qualified Data.Map as M
import qualified Data.ByteString.Char8 as B
main = do m <- fmap proc (B.readFile "baseforms.txt")
print (M.lookup (B.pack "sometext") m)
print (M.size m)
proc = M.fromList . concatMap (\(a:bs) -> map (flip (,) a) bs)
. map B.words . B.lines
and the other a Text-ified equivalent:
import qualified Data.Map as M
import qualified Data.ByteString.Char8 as B
import Data.Text.Encoding(decodeUtf8)
import qualified Data.Text as T
main = do
m <- fmap proc (B.readFile "baseforms.txt")
print (M.lookup (T.pack "sometext") m)
print (M.size m)
proc = M.fromList . concatMap (\(a:bs) -> map (flip (,) a) bs)
. map T.words . T.lines . decodeUtf8
On my machine, the python/C takes just under 6 seconds, the bytestring file takes 8 seconds, and the text file just over 10.
The bytestring implementation seems to use a bit more memory than the python, the text implementation distinctly more. The text implementation takes more time because, of course, it adds a conversion to text and then uses text operations to break the string and text comparisons to build the map.
Here is a go at analyzing the memory phenomena in the text case. First we have the bytestring in memory (130m). Once the text is constructed (~250m, to judge unscientifically from what's going on in top), the bytestring is garbage collected while we construct the tree. In the end the text tree (~380m it looks like) uses more memory than the bytestring tree (~260m) because the text fragments in the tree are bigger. The program as a whole uses more because the text held in memory during the tree construction is itself bigger. To put it crudely: each bit of white-space is being turned into a tree constructor and two text constructors together with the text version of whatever the first 'word' of the line was and whatever the text representation next word is. The weight of the constructors seems in either case to be about 130m, so at the last moment of the construction of the tree we are using something like 130m + 130m + 130m = 390m in the bytestring case, and 250m + 130m + 250m = 630m in the text case.

Building a histogram with haskell, many times slower than with python

I was going to test naive bayes classification. One part of it was going to be building a histogram of the training data. The problem is, I am using a large training data, the haskell-cafe mailing list since a couple of years back, and there are over 20k files in the folder.
It takes a while over two minutes to create the histogram with python, and a little over 8 minutes with haskell. I'm using Data.Map (insertWith'), enumerators and text. What else can I do to speed up the program?
Haskell:
import qualified Data.Text as T
import qualified Data.Text.IO as TI
import System.Directory
import Control.Applicative
import Control.Monad (filterM, foldM)
import System.FilePath.Posix ((</>))
import qualified Data.Map as M
import Data.Map (Map)
import Data.List (foldl')
import Control.Exception.Base (bracket)
import System.IO (Handle, openFile, hClose, hSetEncoding, IOMode(ReadMode), latin1)
import qualified Data.Enumerator as E
import Data.Enumerator (($$), (>==>), (<==<), (==<<), (>>==), ($=), (=$))
import qualified Data.Enumerator.List as EL
import qualified Data.Enumerator.Text as ET
withFile' :: (Handle -> IO c) -> FilePath -> IO c
withFile' f fp = do
bracket
(do
h ← openFile fp ReadMode
hSetEncoding h latin1
return h)
hClose
(f)
buildClassHistogram c = do
files ← filterM doesFileExist =<< map (c </> ) <$> getDirectoryContents c
foldM fileHistogram M.empty files
fileHistogram m file = withFile' (λh → E.run_ $ enumHist h) file
where
enumHist h = ET.enumHandle h $$ EL.fold (λm' l → foldl' (λm'' w → M.insertWith' (const (+1)) w 1 m'') m' $ T.words l) m
Python:
for filename in listdir(root):
filepath = root + "/" + filename
# print(filepath)
fp = open(filepath, "r", encoding="latin-1")
for word in fp.read().split():
if word in histogram:
histogram[word] = histogram[word]+1
else:
histogram[word] = 1
Edit: Added imports
You could try using imperative hash maps from the hashtables package: http://hackage.haskell.org/package/hashtables
I remember I once got a moderate speedup compared to Data.Map. I wouldn't expect anything spectacular though.
UPDATE
I simplified your python code so I could test it on a single big file (100 million lines):
import sys
histogram={}
for word in sys.stdin.readlines():
if word in histogram:
histogram[word] = histogram[word]+1
else:
histogram[word] = 1
print histogram.get("the")
Takes 6.06 seconds
Haskell translation using hashtables:
{-# LANGUAGE OverloadedStrings #-}
import qualified Data.ByteString.Char8 as T
import qualified Data.HashTable.IO as HT
main = do
ls <- T.lines `fmap` T.getContents
h <- HT.new :: IO (HT.BasicHashTable T.ByteString Int)
flip mapM_ ls $ \w -> do
r <- HT.lookup h w
case r of
Nothing -> HT.insert h w (1::Int)
Just c -> HT.insert h w (c+1)
HT.lookup h "the" >>= print
Run with a large allocation area: histogram +RTS -A500M
Takes 9.3 seconds, with 2.4% GC. Still quite a bit slower than Python but not too bad.
According to the GHC user guide, you can change the RTS options while compiling:
GHC lets you change the default RTS options for a program at compile
time, using the -with-rtsopts flag (Section 4.12.6, “Options affecting
linking”). A common use for this is to give your program a default
heap and/or stack size that is greater than the default. For example,
to set -H128m -K64m, link with -with-rtsopts="-H128m -K64m".
Your Haskell and Python implementations are using maps with different complexities. Python dictionaries are hash maps so the expected time for each operation (membership test, lookup, and insertion) is O(1). The Haskell version uses Data.Map which is a balanced binary search tree so the same operations take O(lg n) time. If you change your Haskell version to use a different map implementation, say a hash table or some sort of trie, it should get a lot quicker. However, I'm not familiar enough with the different modules implementing these data structures to say which is best. I'd start with the Data category on Hackage and look for one that you like. You might also look for a map that allows destructive updates like STArray does.
We need more information:
How long does it take both programs to process the words from the input, with no data structure for maintaining counts?
How many distinct words are there, so we can judge whether the extra log N cost for balanced trees is a consideration?
What does GHC's profiler say? In particular, how much time is spent in allocation? It's possible that the Haskell version is spending most of its time allocating tree nodes that quickly become obsolete.
UPDATE: I missed that lowercase "text" might mean Data.Text. You may be comparing applies and oranges. Python's Latin1 encoding uses one byte per char. Although it tries to be efficient, Data.Text must allow for the possiblity of more than 256 characters. What happens if you switch to String, or better, Data.ByteString?
Depending on what these indicators say, here are a couple of things to try:
If analyzing the input is a bottleneck, try driving all your I/O and analysis from Data.ByteString instead of Text.
If the data structure is a bottleneck, Bentley and Sedgewick's ternary search trees are purely functional but perform competetively with hash tables. There is a TernaryTrees package on Hackage.

Mysterious word ("LPS") appears in a list of Haskell output

I am new to Haskell and trying to fiddle with some test cases I usually run into in the real world. Say I have the text file "foo.txt" which contains the following:
45.4 34.3 377.8
33.2 98.4 456.7
99.1 44.2 395.3
I am trying to produce the output
[[45.4,34.3,377.8],[33.2,98.4,456.7],[99.1,44.2,395.3]]
My code is below, but I'm getting some bogus "LPS" in the output... not sure what it represents.
import qualified Data.ByteString.Lazy.Char8 as BStr
import qualified Data.Map as Map
readDatafile = (map (BStr.words) . BStr.lines)
testFunc path = do
contents <- BStr.readFile path
print (readDatafile contents)
When invocated with testFunc "foo.txt" the output is
[[LPS ["45.4"],LPS ["34.3"],LPS ["377.8"]],[LPS ["33.2"],LPS ["98.4"],LPS ["456.7"]],[LPS ["99.1"],LPS ["44.2"],LPS ["395.3"]]]
Any help is appreciated! Thanks. PS: Using ByteString as this will be used on massive files in the future.
EDIT:
I am also puzzled as to why the output list is grouped as above (with each number bound in []), when in ghci the below line gives a different arrangment.
*Main> (map words . lines) "45.4 34.3 377.8\n33.2 98.4 456.7\n99.1 44.2 395.3"
[["45.4","34.3","377.8"],["33.2","98.4","456.7"],["99.1","44.2","395.3"]]
What you're seeing is indeed a constructor. When you read the file, the result is of course a list of lists of Bytestrings, but what you want is a list of lists of Floats.
What you could do :
readDatafile :: BStr.ByteString -> [[Float]]
readDatafile = (map ((map (read . BStr.unpack)) . BStr.words)) . BStr.lines
This unpacks the Bytestring (i.e. converts it to a string). The read converts the string to a float.
Not sure if using bytestrings here even helps your performance though.
This is an indication of the internal lazy bytestring representation type pre-1.4.4.3 (search the page for "LPS"). LPS is a constructor.
readDatafile is returning a [[ByteString]], and what you are seeing is the 'packed' representation of all those characters you read.
readDatafile = map (map Bstr.unpack . bStr.words) . Bstr.lines
Here's an example ghci run demonstrating the problem. My output is different than yours because I'm using GHC 6.10.4:
*Data.ByteString.Lazy.Char8> let myString = "45.4"
*Data.ByteString.Lazy.Char8> let myByteString = pack "45.4"
*Data.ByteString.Lazy.Char8> :t myString
myString :: [Char]
*Data.ByteString.Lazy.Char8> :t myByteString
myByteString :: ByteString
*Data.ByteString.Lazy.Char8> myString
"45.4"
*Data.ByteString.Lazy.Char8> myByteString
Chunk "45.4" Empty
*Data.ByteString.Lazy.Char8> unpack myByteString
"45.4"
This is just the lazy bytestring constructor. You're not parsing those strings into integers yet, so you'll see the underlying string. Note that lazy bytestrings are not the same as String, so they have a different printed representation when 'Show'n.
LPS was the old constructor for the old Lazy ByteString newtype. It has since been replaced with an explicit data type, so the current behavior is slightly different.
When you call Show on a Lazy ByteString it prints out the code that would generate approximately the same lazy bytestring you gave it. However, the usual import for working with ByteStrings doesn't export the LPS -- or in later revisions, the Chunk/Empty constructors. So it shows it with the LPS constructor wrapped around a list of strict bytestring chunks, which print themselves as strings.
On the other hand, I wonder if the lazy ByteString Show instance should do the same thing that most other show instances for complicated data structures do and say something like:
fromChunks ["foo","bar","baz"]
or even:
fromChunks [pack "foo",pack "bar", pack "baz"]
since the former seems to rely on {-# LANGUAGE OverloadedStrings #-} for the resulting code fragment to be really parseable as Haskell code. On the other-other hand, printing bytestrings as if they were strings is really convenient. Alas, both options are more verbose than the old LPS syntax, but they are more terse than the current Chunk "Foo" Empty. In the end, Show just needs to be left invertible by Read, so its probably best not to muck around changing things lest it randomly break a ton of serialized data. ;)
As for your problem, you are getting a [[ByteString]] instead of [[Float]] by mapping words over your lines. You need to unpack that ByteString and then call read on the resulting string to generate your floating point numbers.

Resources