Haskell manipulating file contents - string

I'm a student from Portugal having some doubts about a project I have to handle.
My final objective is to create a pdf catalog with LaTeX that stores information about files using exiftool.
So far, i've managed to separate audio from video files and store their exiftool'ed information on a file, but it bulks them up.
for example:
======== Cartoon Battle.mp3
-ExifToolVersion=8.60
-FileName=Cartoon Battle.mp3
-Directory=.
-FileSize=4.0 MB
-FileModifyDate=2011:12:13 09:46:25+00:00
-FilePermissions=rw-rw-r--
-FileType=MP3
-MIMEType=audio/mpeg
-MPEGAudioVersion=1
-AudioLayer=3
-AudioBitrate=320 kbps
-SampleRate=48000
-ChannelMode=Stereo
-MSStereo=Off
-IntensityStereo=Off
-CopyrightFlag=False
-OriginalMedia=False
-Emphasis=None
-ID3Size=113441
-Title=Cartoon Battle
-Artist=Kevin MacLeod
-Year=2007
-BeatsPerMinute=130
-Genre=Unclassifiable
-Comment=(iTunPGAP) 0
-EncodedBy=iTunes v7.0.2.16
-Comment=(iTunNORM) 000001F7 0000014B 00001DBD 00000B18 000154C8 00000780 00008169 00008180 00000780 00000780
-Comment=(iTunSMPB) 00000000 00000210 00000A84 00000000004A606C 00000000 003DE780 00000000 00000000 00000000 00000000 00000000 00000000
-Album=Far East
-Composer=Kevin MacLeod
-PictureFormat=JPG
-PictureType=Other
-PictureDescription=
-Picture=(Binary data 91855 bytes, use -b option to extract)
-DateTimeOriginal=2007
-Duration=0:01:41 (approx)
======== Comic Plodding.mp3
-ExifToolVersion=8.60
-FileName=Comic Plodding.mp3
-Directory=.
-FileSize=3.8 MB
-FileModifyDate=2011:12:13 09:46:24+00:00
-FilePermissions=rw-rw-r--
-FileType=MP3
-MIMEType=audio/mpeg
-MPEGAudioVersion=1
-AudioLayer=3
-AudioBitrate=320 kbps
-SampleRate=44100
-ChannelMode=Joint Stereo
-MSStereo=Off
-IntensityStereo=Off
-CopyrightFlag=False
-OriginalMedia=False
-Emphasis=None
-ID3Size=105099
-EncoderSettings=Logic Pro 8.0.1
-Comment=(iTunNORM) 000001AE 00000181 000026DF 0000365B 0001100A 00016CE5 00007D33 00007ECF 00010FF0 00016CE5
-Comment=(iTunSMPB) 00000000 00000210 000009D6 000000000040DA1A 00000000 003ABCBC 00000000 00000000 00000000 00000000 00000000 00000000
-Artist=Kevin MacLeod
-Composer=Kevin MacLeod
-Year=2008
-Genre=Silent Film Score
-PictureFormat=JPG
-PictureType=Other
-PictureDescription=
-Picture=(Binary data 84880 bytes, use -b option to extract)
-Album=Scoring - Silent Film: Dark
-DateTimeOriginal=2008
-Duration=0:01:36 (approx)
What i'd like to do is :
first, try to split the two songs on that file in some kind of list of some sort.
then, try and pick up some of the information inside a file, like the FileName, Size and all that.
So far, i've come up with this piece of code, but it isnt correct:
mymain = do{
a <- readFile "audio.txt" ; -- file that has all the infos collected by exiftool
ml <- splitRegex (mkRegex "========") a ; -- I expect this to separate each song and place their corresponding information on a single string
Can anyone give me a hint? I want to store some information on a File structure i've created, but first, i need to split it up by songs, then pick up what I want, right?
THanks for the help and excuse me for my bad french!
PS: I'm not that used to haskell (just starting)

A minimal fix is:
import Text.Regex
main = do {
a <- readFile "audio.txt" ;
print $ splitRegex (mkRegex "========") a ;
}
The arrow extracts a value from a monadic value - from a value of type m a where m is a monad and a is an arbitrary type. readFile returns a monadic value (of type IO String) but splitRegex accepts a plain value of type String. So arrow can be used to extract a String from IO String. But splitRegex returns a non-monadic value so <- cannot extract anything from it.
I suggest to split your code into IO code and non-IO code and use the syntax without ; and {}:
import Text.Regex
processData text = x where
x = splitRegex (mkRegex "========") y
y = text
...
main = do
a <- readFile "audio.txt"
print $ processData a
So IO code will use do and <- and non-IO code will use where and =.

Related

How to save, append and read a List of tuple including Lists into a File using Data.Serialize and ByteString

Hello i am having problems reading after saving and appending a List of Tuple Lists inside a File.
Saving something into a File works without problems.
I am saving into a file with
import qualified Data.ByteString as BS
import qualified Data.Serialize as S (decode, encode)
import Data.Either
toFile path = do
let a = take 1000 [100..] :: [Float]
let b = take 100 [1..] :: [Float]
BS.appendFile path $ S.encode (a,b)
and reading with
fromFile path = do
bstr<-BS.readFile path
let d = S.decode bstr :: Either String ([Float],[Float])
return (Right d)
but reading from that file with fromFileonly gives me 1 Element of it although i append to that file multiple times.
Since im appending to the file it should have multiple Elements inside it so im missing something like map on my fromFile function but i couldnt work out how.
I appreciate any help or any other solutions so using Data.Serialize and ByteString is not a must. Other possibilities i thought of are json files with Data.Aeson if i cant get it to work with Serialize
Edit :
I realized that i made a mistake on the decoding type in fromFile
let d = S.decode bstr :: Either String ([Float],[Float])
it should be like this
let d = S.decode bstr :: Either String [([Float],[Float])]
The Problem In Brief The default format used by serialize (or binary) encoding isn't trivially append-able.
The Problem (Longer)
You say you appended:
S.encode (a,b)
to the same file "multiple times". So the format of the file is now:
[ 64 bit length field | # floats encoded | 64 length field | # floats encoded ]
Repeated however many times you appended to the file. That is, each append will add new length fields and list of floats while leaving the old values in place.
After that you returned to read the file and decode some floats using, morally, S.decode <$> BS.readFile path. This will decode the first two lists of floats by first reading the length field (of the first time you wrote to the file) then the following floats and the second length field followed by its related floats. After reading the stated length worth of floats the decoder will stop.
It should now be clear that just because you appended more data does not make your encoding or decoding script look for any additional data. The default format used by serialize (or binary) encoding isn't trivially append-able.
Solutions
You mentioned switching to Aeson, but using JSON to encode instead of binary won't help you. Decoding two appended JSON strings like { "first": [1], "second": [2]}{ "first": [3], "second": [4]} is logically the same as your current problem. You have some unknown number of interleaved chunks of lists - just write a decoder to keep trying:
import Data.Serialize as S
import Data.Serialize.Get as S
import Data.ByteString as BS
fromFile path = do
bstr <- BS.readFile path
let d = S.runGet getMultiChunks bstr :: Either String ([Float],[Float])
return (Right d)
getMultiChunks :: Get ([Float],[Float])
getMultiChunks = go ([], [])
where
go (l,r) = do
b <- isEmpty
if b then pure ([],[])
else do (lNext, rNext) <- S.get
go (l ++ lNext, r ++ rNext) -- inefficient
So we've written our own getter (untested) that will look to see if byte remain and if so decode another pair of lists of floats. Each time it decodes a new chunk it prepends the old chunk (which is inefficient, use something like a dlist if you want it to be respectable).

Haskell 2D character array

I have question related Haskell language.i need to store bunch of characters in 2D array.How can i store it??I have characters in 10 X 10 format in text file and i want to store it in 2D character array in haskell language.Please help me as soon as possible..thank you..
Here is the code which i tried and in this code i am trying to store value of x in the list named listofchar::
module TreasureFile where
import System.IO
main = do
hdl <- openFile "map.txt" ReadMode
readbychar hdl
readbychar hdl = do
t <- hIsEOF hdl
if t
then return()
else do
let listofchar=[]
x <- hGetChar hdl
if x =='\n'
then putChar '!'--return()
else listofchar x
readbychar hdl
Try this:
import System.IO
main = do
textContents <- readFile "map.txt"
let map = format textContents
print $ map
format text = lines text
Lets step through this program:
First, readFile reads us the file and binds the contents to textContents.
Next we format the contents by splitting the list every time we encounter a newline delimiter and then remove the eventually remaining empty strings.
Done! Now we can do whatever we want with our "map".
A small note on the side:
It will seem strange that our map will be displayed like this:
["aaaaaaaaaa","bbbbbbbbbbb",..] -- doesn't look like 2D map
which is just syntatic sugar for:
[['a','a','a',..],['b','b','b',..],..] -- looks more like a map now

Parsing an input file file in Haskell

Is there any fast way in Haskell to cast an input file like that into corresponding types? For example a function that takes a string and produces a list of Ints? Or do I need to parse it manually using getLine and parse the string?
10.
10.
[4, 3, 2, 1].
[(5,8,'~'), (6,4,'*'), (7,10,'~'), (8,2,'o')].
[4,0,9,4,7,5,7,4,6,4].
[4,10,0,6,6,5,6,5,6,2].
Yes, the read function.
Once you read in the file with readFile for example, you can read each line to convert it to the type you want. You'll have to get rid of the periods first, though. So for example:
main = do
text <- readFile "test.txt"
let cases = lines text
-- to get rid of the periods at the end of each line
strs = map init cases
lastLine = read $ last strs
print $ show (map (+5) lastLine)
This will take your example file and read in a list of Ints from the last line, and the add 5 to the whole list and print it.
If every line were the same type, you could just map read over all the lines to get all of them. If there are different types, like in your example, you'd have to put in some logic to figure out what type is on each line, and then call an appropriate function to deal with that type.
To build on Jeff Burka's answer, here's the specific code you would use for your particular file:
main = do
[l1, l2, l3, l4, l5, l6] <- fmap (map init . lines) $ readFile "myFile.txt"
let myVal :: (Int, Int, [Int], [(Int, Int, Char)], [Int], [Int])
myVal = (read l1, read l2, read l3, read l4, read l5, read l6)
print myVal
This will print out the parsed tuple.
The init part is to get rid of the trailing period you have at the end of each line.

Haskell read first n lines

I'm trying to learn Haskell to get used to functional programming languages. I've decided to try a few problems at interviewstreet to start out. I'm having trouble reading from stdin and doing io in general with haskell's lazy io.
Most of the problems have data coming from stdin in the following form:
n
data line 1
data line 2
data line 3
...
data line n
where n is the number of following lines coming from stdin and the next lines are the data.
How do I run my program on each of the n lines one at a time and return the solution to stdout?
I know the stdin input won't be very large but I'm asking about evaluating each line one at a time pretending the input is larger than what can fit in memory just to learn how to use haskell.
You can use interact, in conjunction with lines to process data from stdin one line at a time. Here's an example program that uses interact to access stdin, lines to split the data on each newline, a list comprehension to apply the function perLine to each line of the input, and unlines to put the output from perLine back together again.
main = interact processInput
processInput input = unlines [perLine line | line <- lines input]
perLine line = reverse line -- do whatever you want to 'line' here!
You don't need to worry about the size of the data you're getting over stdin; Haskell's laziness ensures that you only keep the parts you're actually working on in memory at any time.
EDIT: if you still want to work on only the first n lines, you can use the take function in the above example, like this:
processInput input = unlines [perLine line | line <- take 10 (lines input)]
This will terminate the program after the first ten lines have been read and processed.
You can also use a simple recursion:
getMultipleLines :: Int -> IO [String]
getMultipleLines n
| n <= 0 = return []
| otherwise = do
x <- getLine
xs <- getMultipleLines (n-1)
return (x:xs)
And then use it in your main:
main :: IO ()
main = do
line <- getLine
let numLines = read line :: Int
inputs <- getMultipleLines numLines

How to get good performance when writing a list of integers from 1 to 10 million to a file?

question
I want a program that will write a sequence like,
1
...
10000000
to a file. What's the simplest code one can write, and get decent performance? My intuition is that there is some lack-of-buffering problem. My C code runs at 100 MB/s, whereas by reference the Linux command line utility dd runs at 9 GB/s 3 GB/s (sorry for the imprecision, see comments -- I'm more interested in the big picture orders-of-magnitude though).
One would think this would be a solved problem by now ... i.e. any modern compiler would make it immediate to write such programs that perform reasonably well ...
C code
#include <stdio.h>
int main(int argc, char **argv) {
int len = 10000000;
for (int a = 1; a <= len; a++) {
printf ("%d\n", a);
}
return 0;
}
I'm compiling with clang -O3. A performance skeleton which calls putchar('\n') 8 times gets comparable performance.
Haskell code
A naiive Haskell implementation runs at 13 MiB/sec, compiling with ghc -O2 -optc-O3 -optc-ffast-math -fllvm -fforce-recomp -funbox-strict-fields. (I haven't recompiled my libraries with -fllvm, perhaps I need to do that.) Code:
import Control.Monad
main = forM [1..10000000 :: Int] $ \j -> putStrLn (show j)
My best stab with Haskell runs even slower, at 17 MiB/sec. The problem is I can't find a good way to convert Vector's into ByteString's (perhaps there's a solution using iteratees?).
import qualified Data.Vector.Unboxed as V
import Data.Vector.Unboxed (Vector, Unbox, (!))
writeVector :: (Unbox a, Show a) => Vector a -> IO ()
writeVector v = V.mapM_ (System.IO.putStrLn . show) v
main = writeVector (V.generate 10000000 id)
It seems that writing ByteString's is fast, as demonstrated by this code, writing an equivalent number of characters,
import Data.ByteString.Char8 as B
main = B.putStrLn (B.replicate 76000000 '\n')
This gets 1.3 GB/s, which isn't as fast as dd, but obviously much better.
Some completely unscientific benchmarking first:
All programmes have been compiled with the default optimisation level (-O3 for gcc, -O2 for GHC) and run with
time ./prog > outfile
As a baseline, the C programme took 1.07s to produce a ~76MB (78888897 bytes) file, roughly 70MB/s throughput.
The "naive" Haskell programme (forM [1 .. 10000000] $ \j -> putStrLn (show j)) took 8.64s, about 8.8MB/s.
The same with forM_ instead of forM took 5.64s, about 13.5MB/s.
The ByteString version from dflemstr's answer took 9.13s, about 8.3MB/s.
The Text version from dflemstr's answer took 5.64s, about 13.5MB/s.
The Vector version from the question took 5.54s, about 13.7MB/s.
main = mapM_ (C.putStrLn . C.pack . show) $ [1 :: Int .. 10000000], where C is Data.ByteString.Char8, took 4.25s, about 17.9MB/s.
putStr . unlines . map show $ [1 :: Int .. 10000000] took 3.06s, about 24.8MB/s.
A manual loop,
main = putStr $ go 1
where
go :: Int -> String
go i
| i > 10000000 = ""
| otherwise = shows i . showChar '\n' $ go (i+1)
took 2.32s, about 32.75MB/s.
main = putStrLn $ replicate 78888896 'a' took 1.15s, about 66MB/s.
main = C.putStrLn $ C.replicate 78888896 'a' where C is Data.ByteString.Char8, took 0.143s, about 530MB/s, roughly the same figures for lazy ByteStrings.
What can we learn from that?
First, don't use forM or mapM unless you really want to collect the results. Performancewise, that sucks.
Then, ByteString output can be very fast (10.), but if the construction of the ByteString to output is slow (3.), you end up with slower code than the naive String output.
What's so terrible about 3.? Well, all the involved Strings are very short. So you get a list of
Chunk "1234567" Empty
and between any two such, a Chunk "\n" Empty is put, then the resulting list is concatenated, which means all these Emptys are tossed away when a ... (Chunk "1234567" (Chunk "\n" (Chunk "1234568" (...)))) is built. That's a lot of wasteful construct-deconstruct-reconstruct going on. Speed comparable to that of the Text and the fixed "naive" String version can be achieved by packing to strict ByteStrings and using fromChunks (and Data.List.intersperse for the newlines). Better performance, slightly better than 6., can be obtained by eliminating the costly singletons. If you glue the newlines to the Strings, using \k -> shows k "\n" instead of show, the concatenation has to deal with half as many slightly longer ByteStrings, which pays off.
I'm not familiar enough with the internals of either text or vector to offer more than a semi-educated guess concerning the reasons for the observed performance, so I'll leave them out. Suffice it to say that the performance gain is marginal at best compared to the fixed naive String version.
Now, 6. shows that ByteString output is faster than String output, enough that in this case the additional work of packing is more than compensated. However, don't be fooled by that to believe that is always so. If the Strings to pack are long, the packing can take more time than the String output.
But ten million invocations of putStrLn, be it the String or the ByteString version, take a lot of time. It's faster to grab the stdout Handle just once and construct the output String in non-IO code. unlines already does well, but we still suffer from the construction of the list map show [1 .. 10^7]. Unfortunately, the compiler didn't manage to eliminate that (but it eliminated [1 .. 10^7], that's already pretty good). So let's do it ourselves, leading to 8. That's not too terrible, but still takes more than twice as long as the C programme.
One can make a faster Haskell programme by going low-level and directly filling ByteStrings without going through String via show, but I don't know if the C speed is reachable. Anyway, that low-level code isn't very pretty, so I'll spare you what I have, but sometimes one has to get one's hands dirty if speed matters.
Using lazy byte strings gives you some buffering, because the string will be written instantly and more numbers will only be produced as they are needed. This code shows the basic idea (there might be some optimizations that could be made):
import qualified Data.ByteString.Lazy.Char8 as ByteString
main =
ByteString.putStrLn .
ByteString.intercalate (ByteString.singleton '\n') .
map (ByteString.pack . show) $
([1..10000000] :: [Int])
I still use Strings for the numbers here, which leads to horrible slowdowns. If we switch to the text library instead of the bytestring library, we get access to "native" show functions for ints, and can do this:
import Data.Monoid
import Data.List
import Data.Text.Lazy.IO as Text
import Data.Text.Lazy.Builder as Text
import Data.Text.Lazy.Builder.Int as Text
main :: IO ()
main =
Text.putStrLn .
Text.toLazyText .
mconcat .
intersperse (Text.singleton '\n') .
map Text.decimal $
([1..10000000] :: [Int])
I don't know how you are measuring the "speed" of these programs (with the pv tool?) but I imagine that one of these procedures will be the fastest trivial program you can get.
If you are going for maximum performance, then it helps to take a holistic view; i.e., you want to write a function that maps from [Int] to series of system calls that write chunks of memory to a file.
Lazy bytestrings are good representation for a sequence of chunks of memory. Mapping a lazy bytestring to a series of systems calls that write chunks of memory is what L.hPut is doing (assuming an import qualified Data.ByteString.Lazy as L). Hence, we just need a means to efficiently construct the corresponding lazy bytestring. This is what lazy bytestring builders are good at. With the new bytestring builder (here is the API documentation), the following code does the job.
import qualified Data.ByteString.Lazy as L
import Data.ByteString.Lazy.Builder (toLazyByteString, charUtf8)
import Data.ByteString.Lazy.Builder.ASCII (intDec)
import Data.Foldable (foldMap)
import Data.Monoid (mappend)
import System.IO (openFile, IOMode(..))
main :: IO ()
main = do
h <- openFile "/dev/null" WriteMode
L.hPut h $ toLazyByteString $
foldMap ((charUtf8 '\n' `mappend`) . intDec) [1..10000000]
Note that I output to /dev/null to avoid interference by the disk driver. The effort of moving the data to the OS remains the same. On my machine, the above code runs in 0.45 seconds, which is 12 times faster than the 5.4 seconds of your original code. This implies a throughput of 168 MB/s. We can squeeze out an additional 30% speed (220 MB/s) using bounded encodings].
import qualified Data.ByteString.Lazy.Builder.BasicEncoding as E
L.hPut h $ toLazyByteString $
E.encodeListWithB
((\x -> (x, '\n')) E.>$< E.intDec `E.pairB` E.charUtf8)
[1..10000000]
Their syntax looks a bit quirky because a BoundedEncoding a specifies the conversion of a Haskell value of type a to a bounded-length sequence of bytes such that the bound can be computed at compile-time. This allows functions such as E.encodeListWithB to perform some additional optimizations for implementing the actual filling of the buffer. See the the documentation of Data.ByteString.Lazy.Builder.BasicEncoding in the above link to the API documentation (phew, stupid hyperlink limit for new users) for more information.
Here is the source of all my benchmarks.
The conclusion is that we can get very good performance from a declarative solution provided that we understand the cost model of our implementation and use the right datastructures. Whenever constructing a packed sequence of values (e.g., a sequence of bytes represented as a bytestring), then the right datastructure to use is a bytestring Builder.

Resources