I'm trying to make large TSV files with JSON in the 5th column suitable for import to mongoDB.
In particular I want to change top level and only top level key fields to _id. This is what I have so far, it seems to work but is slow:
{-# LANGUAGE OverloadedStrings #-}
import System.Environment (getArgs)
import Data.Conduit.Binary (sourceFile, sinkFile)
import Data.Conduit
import qualified Data.Conduit.Text as CT
import qualified Data.Conduit.List as CL
import qualified Data.Text as T
import Data.Monoid ((<>))
import Data.Attoparsec.Text as APT
import Control.Applicative
main = do
(inputFile : outputFile : _) <- getArgs
runResourceT $ sourceFile inputFile
$= CT.decode CT.utf8 $= CT.lines $= CL.map jsonify
$= CT.encode CT.utf8 $$ sinkFile outputFile
jsonify :: T.Text -> T.Text
jsonify = go . T.splitOn "\t"
where
go (_ : _ : _ : _ : content : _) = case parseOnly keyTo_id content of
Right res -> res <> "\n"
_ -> ""
go _ = ""
keyTo_id :: Parser T.Text
keyTo_id = skipWhile(/='{') >> T.snoc <$>
(T.cons <$> (char '{')
<*> (T.concat <$> many1 ( bracket
<|> (string "\"key\":" >> return "\"_id\":")
<|> APT.takeWhile1(\x -> x /= '{' && x /= '}' && x/= '"')
<|> T.singleton <$> satisfy (/= '}')
)))
<*> char '}'
bracket :: Parser T.Text
bracket = T.cons <$> char '{'
<*> scan 1 test
where
test :: Int -> Char -> Maybe Int
test 0 _ = Nothing
test i '}'= Just (i-1)
test i '{' = Just (i+1)
test i _ = Just i
According to the profiler 58.7% of the time is spent in bracket, 19.6% in keyTo_id, 17.1% in main.
Surely there's a better way to return bracketed terms unchanged if the brackets match up?
I briefly looked at attoparsec-conduit, but I have no idea how to use that library and can't even tell whether this is the sort of thing it can be used for.
EDIT: Updated the code. The data is from openlibrary.org, e. g. http://openlibrary.org/data/ol_dump_authors_latest.txt.gz
Use the scan function. It allows you to scan over a string maintaing a state. In your case the state will be a number — the difference of opening and closing braces that you've encountered so far.
When your state is 0, that means that braces match inside the current substring.
The trick is that you don't deconstruct and reconstruct the string this way, so it should be faster.
Also, you could gain some performance even with your current algorithm by using lazy Text — the concat function would work more efficiently.
Related
I wanted to replace sed and awk with Parsec. For example, extract number from strings like unknown structure but containing the number 42 and maybe some other stuff.
I run into "unexpected end of input". I'm looking for equivalent of non-greedy .*([0-9]+).*.
module Main where
import Text.Parsec
parser :: Parsec String () Int
parser = do
_ <- many anyToken
x <- read <$> many1 digit
_ <- many anyToken
return x
main :: IO ()
main = interact (show . parse parser "STDIN")
This can be easily done with my library regex-applicative. It gives you both the combinator interface and the features of regular expressions that you seem to want.
Here's a working version that's closest to your example:
{-# LANGUAGE ApplicativeDo #-}
import Text.Regex.Applicative
import Text.Regex.Applicative.Common (decimal)
parser :: RE Char Int
parser = do
_ <- few anySym
x <- decimal
_ <- many anySym
return x
main :: IO ()
main = interact (show . match parser)
Here's an even shorter version, using findFirstInfix:
import Text.Regex.Applicative
import Text.Regex.Applicative.Common (decimal)
main :: IO ()
main = interact (snd3 . findFirstInfix decimal)
where snd3 (_, r, _) = r
If you want to perform actual tokenization (e.g. skip 93 in foo93bar), then take a look at lexer-applicative, a tokenizer based on regex-applicative.
Replacing sed and awk with parsers is what the
replace-megaparsec
library is all about.
Extract numbers from unstructured strings with the
sepCap
parser combinator.
import Replace.Megaparsec
import Text.Megaparsec
import Text.Megaparsec.Char.Lexer
parseTest (sepCap (decimal :: Parsec Void String Int))
$ "unknown structure but containing the number 42 and maybe some other stuff"
[ Left "unknown structure but containing the number "
, Right 42
, Left " and maybe some other stuff"
]
This cannot work, since anyToken accepts and consumes - as its names says - any token, including digits. And you apply it many times. Therefore the attempt to read digits with the second parser must fail. There simply cannot be any tokens left.
Instead make your first parser accept any character, that is not a digit (using isDigit from module Data.Char):
parser :: Parsec String () Int
parser = do
_ <- many $ satisfy (not . isDigit)
x <- read <$> many1 digit
_ <- many anyToken
return x
I'm very new to Haskell and I'm trying to parse a map file, just for practice. My code will compile, but it gives me the wrong result. All I get is "Right []" - which I don't understand.
My code is very similar to the tutorial here, but I rewrote it to serve my needs.
My file looks like this (I removed most of the lines to save space here):
#test map 2
0,0:1;
1,0:1;
2,0:1;
3,0:1;
My code:
import Data.Word
import Data.Time
import Data.Attoparsec.Char8
import Control.Applicative
import qualified Data.ByteString as B
-- Types --
data Tile = Tile Int Int Int deriving Show
data MapLine =
MapLine { tile :: Tile } deriving Show
-- Parsing --
parseTile :: Parser Tile
parseTile = do
x <- decimal
char ','
y <- decimal
char ':'
t <- decimal
char ';'
return $ Tile x y t
mapLineParser :: Parser MapLine
mapLineParser = do
t <- parseTile
return $ MapLine t
fileParser :: Parser [MapLine]
fileParser = many $ mapLineParser <* endOfLine
-- Main --
main :: IO()
--main = B.readFile "map.hexmap" >>= print . parseOnly fileParser
main = do
print "Parsing map..."
let x = B.readFile "map.hexmap"
x >>= print . parseOnly fileParser
print "Done."
Thanks for the help.
Your parser "successfully parses" a list of MapLines of length zero before failing at the first line. Remove that line (and make sure your file doesn't include any non-parsable bytes at the start like a BOM) and it should work. Or write a parser for lines starting with a # that ignores the result, then combine.
I want to parse text like "John","Kate","Ruddiger" into list of Strings.
I tried to start with parsing "John", to Name (alias for String) but it already fails with Fail "\"," [","] "Failed reading: satisfyWith".
Question A: Why does this error occur and how can I fix it? (I didn't find call to satisfyWith in attoparsec's source code)
Question B: How can I make the parser to not require a comma after the last name?
{-# LANGUAGE OverloadedStrings #-}
import Data.Attoparsec.Char8 as P
import qualified Data.ByteString.Char8 as BS
import Control.Applicative(many)
data Name = Name String deriving Show
readName = P.takeWhile (/='"')
entryParser :: Parser Name
entryParser = do
P.char '"'
name <- readName
P.char ','
return $ Name (BS.unpack name)
someEntry :: IO BS.ByteString
someEntry = do
return $ BS.pack "\"John\","
main :: IO()
main = do
someEntry >>= print . parse entryParser
I am using GHC 7.6.3 and attoparsec-0.11.3.4.
Question A: Why does this error occur and how can I fix it? (I didn't find call to satisfyWith in attoparsec's source code)
readName = P.takeWhile (/='"')
takeWhile consumes as long as the predicate is true. Therefor, after you read the name, " hasn't been consumed. This is easy to see if we remove P.char ',' from the entryParser:
entryParser = P.char '"' >> fmap (Name . BS.unpack) readName
$ runhaskell SO.hs
Done "\"," Name "John"
You need to consume the ":
entryParser :: Parser Name
entryParser = do
P.char '"'
name <- readName
P.char '"' -- <<<<<<<<<<<<<<<<<<<<<<
P.char ','
return $ Name (BS.unpack name)
Question B: How can I make the parser to not require a comma after the last name?
Use sepBy.
Now your questions has been cleared up, lets make things a little bit easier. Don't consume the , at all in entryParser, instead, only take the name:
entryParser = P.char '"' *> fmap ( Name . BS.unpack ) readName <* P.char '"'
In case you don't know (*>) and (<*), they're both from Control.Applicative, and they basically mean "discard whatever is on the asterisks side".
Now, in order to parse all comma separated entries, we use sepBy entryParser (P.char ','). However, this will lead into attoparsec returning a Partial:
$ runhaskell SO.hs
Partial _
That's actually a feature of attoparsec you have to keep in mind:
Attoparsec supports incremental input, meaning that you can feed it a bytestring that represents only part of the expected total amount of data to parse. If your parser reaches the end of a fragment of input and could consume more input, it will suspend parsing and return a Partial continuation.
If you do want to use incremental input, use parse and feed. Otherwise use parseOnly. The complete code for your example would be something like
{-# LANGUAGE OverloadedStrings #-}
import Data.Attoparsec.Char8 as P
import qualified Data.ByteString.Char8 as BS
import Control.Applicative(many, (*>), (<*))
data Name = Name String deriving Show
readName = P.takeWhile (/='"')
entryParser :: Parser Name
entryParser = P.char '"' *> fmap ( Name . BS.unpack ) readName <* P.char '"'
allEntriesParser = sepBy entryParser (P.char ',')
testString = "\"John\",\"Martha\",\"test\""
main = print . parseOnly allEntriesParser $ testString
$ runhaskell SO.hs
Right [Name "John",Name "Martha",Name "test"]
While I can use reify to get information about most other syntactic constructs, I couldn't find anything that would give some information about a module.
Unfortunately Template Haskell currently has no such capabilities. All the solutions involve parsing of the module's source-code. However the location and loc_filename functions of TH make it easy to locate the module with the calling splice.
Here is a solution extracted from the source code of one of my projects:
{-# LANGUAGE LambdaCase, TupleSections #-}
import Language.Haskell.TH
import qualified Data.Attoparsec.Text as AP
import qualified Data.Text.IO as Text
import qualified Data.Text as Text
import qualified Data.Char as Char
import Data.Maybe
import Data.List
import Control.Applicative
import Data.Traversable
import Prelude hiding (mapM)
reifyLocalFunctions :: Q [(Name, Type)]
reifyLocalFunctions =
listTopLevelFunctionLikeNames >>=
mapM (\name -> reifyFunction name >>= mapM (return . (name, ))) >>=
return . catMaybes
where
listTopLevelFunctionLikeNames = do
loc <- location
text <- runIO $ Text.readFile $ loc_filename loc
return $ map (mkName . Text.unpack) $ nub $ parse text
where
parse text =
either (error . ("Local function name parsing failure: " ++)) id $
AP.parseOnly parser text
where
parser =
AP.sepBy (optional topLevelFunctionP <* AP.skipWhile (not . AP.isEndOfLine))
AP.endOfLine >>=
return . catMaybes
where
topLevelFunctionP = do
head <- AP.satisfy Char.isLower
tail <- many (AP.satisfy (\c -> Char.isAlphaNum c || c `elem` ['_', '\'']))
return $ Text.pack $ head : tail
reifyFunction :: Name -> Q (Maybe Type)
reifyFunction name = do
tryToReify name >>= \case
Just (VarI _ t _ _) -> return $ Just $ t
_ -> return Nothing
tryToReify :: Name -> Q (Maybe Info)
tryToReify n = recover (return Nothing) (fmap Just $ reify n)
I need to parse and process a text file that is a nested list of integer. The file is about 250mb large. This already leads to performace problems my naive solution takes 20GB or more of RAM.
The question is related to another question.
I have written about the memory problems and the suggestion was to use Data.Vector to get rtid of the memory problems.
So the goal is to process a nested list of integers and, say, filter the values so that only values larger than 30 get printed out.
Test file "myfile.tx":
11,22,33,44,55
66,77,88,99,10
Here is my code using Attoparsec, adapted from attoparsec-csv:
{-# Language OverloadedStrings #-}
-- adapted from https://github.com/robinbb/attoparsec-csv
module Text.ParseCSV
(
parseCSV
) where
import Prelude hiding (concat, takeWhile)
import Control.Applicative ((<$>), (<|>), (<*>), (<*), (*>), many)
import Control.Monad (void, liftM)
import Data.Attoparsec.Text
import qualified Data.Text as T (Text, concat, cons, append, pack, lines)
import qualified Data.Text.IO as IO (readFile, putStr)
import qualified Data.ByteString.Char8 as BSCH (readInteger)
lineEnd :: Parser ()
lineEnd =
void (char '\n') <|> void (string "\r\n") <|> void (char '\r')
<?> "end of line"
parserInt :: Parser Integer
parserInt = (signed decimal)
record :: Parser [Integer]
record =
parserInt `sepBy1` char ','
<?> "record"
file :: Parser [[Integer]]
file =
(:) <$> record
<*> manyTill (lineEnd *> record)
(endOfInput <|> lineEnd *> endOfInput)
<?> "file"
parseCSV :: T.Text -> Either String [[Integer]]
parseCSV =
parseOnly file
getValues :: Either String [[Integer]] -> [Integer]
getValues (Right [x]) = x
getValues _ = []
getLines :: FilePath -> IO [T.Text]
getLines = liftM T.lines . IO.readFile
parseAndFilter :: T.Text -> [Integer]
parseAndFilter = ((\x -> filter (>30) x) . getValues . parseCSV)
main = do
list <- getLines "myfile.txt"
putStr $ show $ map parseAndFilter list
But instead of using a list [Integer] I would like to use Data.Vector.
I found a relevant part in the Data.Vector tutorial:
--The simplest way to parse a file of Int or Integer types is with a strict or lazy --ByteString, and the readInt or readInteger functions:
{-# LANGUAGE BangPatterns #-}
import qualified Data.ByteString.Lazy.Char8 as L
import qualified Data.Vector as U
import System.Environment
main = do
[f] <- getArgs
s <- L.readFile f
print . U.sum . parse $ s
-- Fill a new vector from a file containing a list of numbers.
parse = U.unfoldr step
where
step !s = case L.readInt s of
Nothing -> Nothing
Just (!k, !t) -> Just (k, L.tail t)
However, this is regular, not a nested list of integers.
I tried to adapt my code but it did not work.
How can I change my code to
use a nested Vector (or Vector of Vectors) instead of [Integer] (i.e., while also running the Filter of >30 on the Vector).
There is an important question you don't mention in the posting.... Do you need everything in memory at once. If the processing is local, or if you can summarize all the data up to a point in the file with a few values, you can solve the performance problems by streaming the data through and throwing away all but the current line. This will usually run way faster and allow you to process orders of magnitude larger files. And it usually doesn't even matter (as much) what data structure you use to parse the values.
Here is an example:
import Text.Regex
process::[Int]->String
process = (++"\n") . show . sum --put whatever you want here.
main = interact (concat . map (process . map read . splitRegex (mkRegex ",")) . lines)
The whole program runs lazily, so it processes line by line as the data comes in and frees up the memory for old data (you can check this by typing in data by hand and watch the output come out). There is a performance hit by using the unpacked structures, but this isn't as big a problem as pulling everything into memory.
Many problems that don't seem to fit this criteria at first can be modified to do so (you may have to sort the data first, but there are many performance effective ways to do this).... I rewrote the full online stats system for a gaming company once following this principle, and was able to take a stats crunching time from hours to a couple of minutes (with even more metrics).
Because of its lazy nature, Haskell is a good language to stream data through.
I found a post that there is no easy way to parse with attoparsec to a vector.
See this forum post and thread.
But the good new is that the overhead of Data.Vector.fromList isn't so bad.
Attoparsec seems to be quite fast for parsing.
I keep the whole data in memory and this doesn't seem a speed overhead. It's more flexible, as perhaps later I need to have the whole data in memory, altough currently it is not needed per se for my problem.
Currently the code runs in ~30 seconds and about 1.5GB RAM for a 150MB text file. Now the memory consumption is quite little versus 20GB of before and I only need to focus on improving the speed.
Here are the changes from the code of my question my post, commented out code is using lists, functions with Vector in the type are new (this is not production code or meant to be good code yet):
{-
getValues :: Either String [[Integer]] -> [Integer]
getValues (Right [x]) = x
getValues _ = []
-}
getValues :: Either String [[Integer]] -> Vector Integer
getValues (Right [x]) = V.fromList x
getValues _ = V.fromList [999999,9999999,99999,999999] --- represents an ERROR
getLines :: FilePath -> IO [T.Text]
getLines = liftM T.lines . IO.readFile
{-
parseAndFilter :: T.Text -> [Integer]
parseAndFilter = ((\x -> filter (>30) x) . getValues . parseCSV)
-}
filterLarger :: Vector Integer -> Vector Integer
filterLarger = \x -> V.filter (>37) x
parseVector :: T.Text -> Vector Integer
parseVector = (getValues . parseCSV)
-- mystr = T.pack "3, 6, 7" --, 13, 14, 15, 17, 21, 22, 23, 24, 25, 28, 29, 30, 32, 33, 35, 36"
main = do
list <- getLines "mydata.txt"
--putStr $ show $ parseCSV $ mystr
putStr $ show $ V.map filterLarger $ V.map parseVector $ V.fromList list
--show $ parseOnly parserInt $ T.pack "123"
Thanks to jamshidh and all the comments that pointed me to the right direction.
Here is the final solution. Switching to ByteString and Int in the code, it now runs twice as fast and a bit less memory consumtion (time is now ~14 Seconds).
{-# Language OverloadedStrings #-}
-- adapted from https://github.com/robinbb/attoparsec-csv
module Main
(
parseCSV, main
) where
import Data.Vector as V (Vector, fromList, map, head, filter)
import Prelude hiding (concat, takeWhile)
import Control.Applicative ((<$>), (<|>), (<*>), (<*), (*>), many)
import Control.Monad (void, liftM)
import Data.Attoparsec.Char8
import qualified Data.ByteString.Char8 as B
lineEnd :: Parser ()
lineEnd =
void (char '\n') <|> void (string "\r\n") <|> void (char '\r')
<?> "end of line"
parserInt :: Parser Int
parserInt = skipSpace *> signed decimal
record :: Parser [Int]
record =
parserInt `sepBy1` char ','
<?> "record"
file :: Parser [[Int]]
file =
(:) <$> record
<*> manyTill (lineEnd *> record)
(endOfInput <|> lineEnd *> endOfInput)
<?> "file"
parseCSV :: B.ByteString -> Either String [[Int]]
parseCSV =
parseOnly file
getValues :: Either String [[Int]] -> Vector Int
getValues (Right [x]) = V.fromList x
getValues _ = error "ERROR in getValues function!"
filterLarger :: Vector Int -> Vector Int
filterLarger = \x -> V.filter (>36) x
parseVector :: B.ByteString -> Vector Int
parseVector = (getValues . parseCSV)
-- MAIN
main = do
fContent <- B.readFile "myfile.txt"
putStr $ show $ V.map filterLarger $ V.map parseVector $ V.fromList $ B.lines fContent