Using BioHaskell, how can I read a FASTA file containing aminoacid sequences?
I want to be able to:
Get a list of String sequences
Get a Map String String (from Data.Map ) from the FASTA comment (assumed to be unique) to the sequence String
Use the sequences in algorithms implemented in BioHaskell.
Note: This question intentionally does not show research effort as it was immediately answered in a Q&A-style manner.
Extracting raw sequence strings
We will assume from now on that the file aa.fa contains some aminoacid FASTA sequences. Let's start with a simple example that extracts a list of sequences.
import Bio.Sequence.Fasta (readFasta)
import Bio.Sequence.SeqData (seqdata)
import qualified Data.ByteString.Lazy.Char8 as LB
main = do
sequences <- readFasta "aa.fa"
let listOfSequences = map (LB.unpack . seqdata) sequences :: [String]
-- Just for show, we will print one sequence per line here
-- This will basically execute putStrLn for each sequence
mapM_ putStrLn listOfSequences
readFasta returns IO [Sequence Unknown]. Basically that means there is no information about whether the sequences contain Aminoacids or nucleotides.
Note that we use LB.unpack instead of show here, because show adds double quotes (") at the beginning and the end of the resulting String. Using LB.unpack works, because in the current BioHaskell version 0.5.3., SeqData is just defined as lazy ByteString.
We can fix this by using castToAmino or castToNuc:
Converting to AA/Nucleotide sequences
let aaSequences = map castToAmino sequences :: [Sequence Amino]
Note that those function currently (BioHaskell version 0.5.3) do not perform any validity checks. You can use the [Sequence Amino] or [Sequence Nuc] in the BioHaskell algorithms.
Lookup sequence by FASTA header
We will now assume that our aa.fa contains a sequence
>abc123
MGLIFARATNA...
Now, we will build a Map String String (we will use Data.Map.Strict in this example) from the FASTA file. We can use this map to lookup the sequence.
The lookup will yield a Maybe String. The intended behaviour in this example is to print the sequence if it was found, or not to print anything if nothing was found in the Map.
As Data.Maybe is a Monad, we can use Data.Foldable.mapM_ for this task.
import Bio.Sequence.Fasta (readFasta)
import Bio.Sequence.SeqData (Sequence, seqdata, seqheader)
import qualified Data.ByteString.Lazy.Char8 as LB
import Data.Foldable (mapM_)
import qualified Data.Map.Strict as Map
-- | Convert a Sequence to a String tuple (sequence label, sequence)
sequenceToMapTuple :: Sequence a -> (String, String)
sequenceToMapTuple s = (LB.unpack $ seqheader s, LB.unpack $ seqdata s)
main = do
sequences <- readFasta "aa.fa"
-- Build the sequence map (by header)
let sequenceMap = Map.fromList $ map sequenceToMapTuple sequences
-- Lookup the sequence for the "abc123" header
mapM_ print $ Map.lookup "abc123" sequenceMap
Edit: Thanks to #GabrielGonzalez suggestion, the final example now uses Data.Foldable.mapM_ instead of Data.Maybe.fromJust
Related
Hello i am having problems reading after saving and appending a List of Tuple Lists inside a File.
Saving something into a File works without problems.
I am saving into a file with
import qualified Data.ByteString as BS
import qualified Data.Serialize as S (decode, encode)
import Data.Either
toFile path = do
let a = take 1000 [100..] :: [Float]
let b = take 100 [1..] :: [Float]
BS.appendFile path $ S.encode (a,b)
and reading with
fromFile path = do
bstr<-BS.readFile path
let d = S.decode bstr :: Either String ([Float],[Float])
return (Right d)
but reading from that file with fromFileonly gives me 1 Element of it although i append to that file multiple times.
Since im appending to the file it should have multiple Elements inside it so im missing something like map on my fromFile function but i couldnt work out how.
I appreciate any help or any other solutions so using Data.Serialize and ByteString is not a must. Other possibilities i thought of are json files with Data.Aeson if i cant get it to work with Serialize
Edit :
I realized that i made a mistake on the decoding type in fromFile
let d = S.decode bstr :: Either String ([Float],[Float])
it should be like this
let d = S.decode bstr :: Either String [([Float],[Float])]
The Problem In Brief The default format used by serialize (or binary) encoding isn't trivially append-able.
The Problem (Longer)
You say you appended:
S.encode (a,b)
to the same file "multiple times". So the format of the file is now:
[ 64 bit length field | # floats encoded | 64 length field | # floats encoded ]
Repeated however many times you appended to the file. That is, each append will add new length fields and list of floats while leaving the old values in place.
After that you returned to read the file and decode some floats using, morally, S.decode <$> BS.readFile path. This will decode the first two lists of floats by first reading the length field (of the first time you wrote to the file) then the following floats and the second length field followed by its related floats. After reading the stated length worth of floats the decoder will stop.
It should now be clear that just because you appended more data does not make your encoding or decoding script look for any additional data. The default format used by serialize (or binary) encoding isn't trivially append-able.
Solutions
You mentioned switching to Aeson, but using JSON to encode instead of binary won't help you. Decoding two appended JSON strings like { "first": [1], "second": [2]}{ "first": [3], "second": [4]} is logically the same as your current problem. You have some unknown number of interleaved chunks of lists - just write a decoder to keep trying:
import Data.Serialize as S
import Data.Serialize.Get as S
import Data.ByteString as BS
fromFile path = do
bstr <- BS.readFile path
let d = S.runGet getMultiChunks bstr :: Either String ([Float],[Float])
return (Right d)
getMultiChunks :: Get ([Float],[Float])
getMultiChunks = go ([], [])
where
go (l,r) = do
b <- isEmpty
if b then pure ([],[])
else do (lNext, rNext) <- S.get
go (l ++ lNext, r ++ rNext) -- inefficient
So we've written our own getter (untested) that will look to see if byte remain and if so decode another pair of lists of floats. Each time it decodes a new chunk it prepends the old chunk (which is inefficient, use something like a dlist if you want it to be respectable).
I was wondering if there's an easy way to get lines one at a time out of a file without eventually loading the whole file in memory. I'd like to do a fold over the lines with an attoparsec parser. I tried using Data.Text.Lazy.IO with hGetLine and that blows through my memory. I read later that eventually loads the whole file.
I also tried using pipes-text with folds and view lines:
s <- Pipes.sum $
folds (\i _ -> (i+1)) 0 id (view Text.lines (Text.fromHandle handle))
print s
to just count the number of lines and it seems to be doing some wonky stuff "hGetChunk: invalid argument (invalid byte sequence)" and it takes 11 minutes where wc -l takes 1 minute. I heard that pipes-text might have some issues with gigantic lines? (Each line is about 1GB)
I'm really open to any suggestions, can't find much searching except for newbie readLine how-tos.
Thanks!
The following code uses Conduit, and will:
UTF8-decode standard input
Run the lineC combinator as long as there is more data available
For each line, simply yield the value 1 and discard the line content, without ever read the entire line into memory at once
Sum up the 1s yielded and print it
You can replace the yield 1 code with something which will do processing on the individual lines.
#!/usr/bin/env stack
-- stack --resolver lts-8.4 --install-ghc runghc --package conduit-combinators
import Conduit
main :: IO ()
main = (runConduit
$ stdinC
.| decodeUtf8C
.| peekForeverE (lineC (yield (1 :: Int)))
.| sumC) >>= print
This is probably easiest as a fold over the decoded text stream
{-#LANGUAGE BangPatterns #-}
import Pipes
import qualified Pipes.Prelude as P
import qualified Pipes.ByteString as PB
import qualified Pipes.Text.Encoding as PT
import qualified Control.Foldl as L
import qualified Control.Foldl.Text as LT
main = do
n <- L.purely P.fold (LT.count '\n') $ void $ PT.decodeUtf8 PB.stdin
print n
It takes about 14% longer than wc -l for the file I produced which was just long lines of commas and digits. IO should properly be done with Pipes.ByteString as the documentation says, the rest is conveniences of various sorts.
You can map an attoparsec parser over each line, distinguished by view lines, but keep in mind that an attoparsec parser can accumulate the whole text as it pleases and this might not be a great idea over a 1 gigabyte chunk of text. If there is a repeated figure on each line (e.g. word separated numbers) you can use Pipes.Attoparsec.parsed to stream them.
I need to manipulate the binary encoding as '0' and '1' of simple strings given as input, using ascii 7-bits.
For the encoding I have used the function Data.ByteString.Lazy.Builder.string7 :: String -> Builder
However, I have not found a way to convert back the resulting Builder object into a string of '0' and '1'. Is it possible ? Is there another way ?
Subsidiary question: And if I wanted it in hexadecimal form as text ?
There's an unpackChars function in Data.ByteString.Lazy.Internal. There's also a non-lazy counterpart in Data.ByteString.Internal.
import qualified Data.ByteString.Lazy.Builder as Build
import qualified Data.ByteString.Lazy as BS
import qualified Data.ByteString.Lazy.Internal as BSI
--> BSI.unpackChars $ Build.toLazyByteString $ Build.string7 "010101"
--"010101"
You can also use map (chr . fromIntegral) . BS.unpack instead of unpackChars, but unpackChars is probably faster.
Alternatively, as Michael Snoyman commented below, you could use Data.ByteString.Char8 or its lazy version and you'll get the right conversions to begin with.
I am using the System.FilePath.Find module of filemanip to recursively find all files I need to process (here I will be using just printing to console as the action to perform, in order not confuse things). Now, this code:
import System.Environment (getArgs)
import System.FilePath (FilePath)
import System.Directory (doesDirectoryExist, getDirectoryContents,doesFileExist)
import Control.Monad
import System.FilePath.Find (find,always,fileType,(==?),FileType(..),(&&?),extension)
main= do
[dbFile,input]<- getArgs
files <- findFiles input
mapM_ putStrLn files
return ()
searchExtension :: String
searchExtension = ".hs"
findFiles :: FilePath -> IO [String]
findFiles = find (always) ( fileType ==? RegularFile &&? extension ==? searchExtension)
works well with this call
./myprog tet .
In this case, the get argument is ignored (will be the output database file later) and the second argument is searched recursively for matching files. It also allows me to specify just a single file, which is just perfect!
BUT, I would like to be able to specify
./myprog tet path1 path2 path4 file1
but this of course fails in the pattern matching:
./myprog tet . .
myprogt: user error (Pattern match failure in do expression at myprog.hs:11:9-22)
Now, how do I make this program more flexible, so that I can take more than two arguments?
Sorry for asking this, actually, but my Haskell knowledge is limited but increasing for every new thing I have to do in my first project.
Well, you can use a different pattern like:
(dbFile:inputs) <- getArgs
where dbFile will match the first argument passed while inputs will match any number of file names (even 0. If you want at least one path name use inputs#(_:_) instead of the simple inputs).
Then you can use mapM to call findFiles for each path in inputs:
files <- mapM findFiles input
mapM_ putStrLn $ concat files
Instead of mapM you could modify findFiles to accept a [FilePath] argument instead of a simple FilePath.
Note that to parse command arguments you could consider using some module like getopt. You should also read this page about argument handling.
** old**
Suppose we have a pattern ex. "1101000111001110".
Now I have a pattern to be searched ex. "1101". I am new to Haskell world, I am trying it at my end. I am able to do it in c but need to do it in Haskell.
Given Pattern := "1101000111001110"
Pattern To Be Searched :- "110
Desired Output:-"Pattern Found"`
** New**
import Data.List (isInfixOf)
main = do x <- readFile "read.txt"
putStr x
isSubb :: [Char] -> [Char] -> Bool
isSubb sub str = isInfixOf sub str
This code reads a file named "read", which contains the following string 110100001101. Using isInfixOf you can check the pattern "1101" in the string and result will be True.
But the problem is i am not able to search "1101" in the string present in "read.txt".
I need to compare the "read.txt" string with the user provided string. i.e
one string is their in the file "read.txt"
and second string user will provid (user defined) and we will perform search and find whether user defined string is present in the string present in "read.txt"
Answer to new:
To achieve this, you have to use readLn:
sub <- readLn
readLn accepts input until a \n is encountered and <- binds the result to sub. Watch out that if the input should be a string you have to explicitly type the "s around your string.
Alternatively if you do not feel like typing the quotation marks every time, you can use getLine in place of readLn which has the type IO String which becomes String after being bound to sub
For further information on all functions included in the standard libraries of Haskell see Hoogle. Using Hoogle you can search functions by various criteria and will often find functions which suit your needs.
Answer to old:
Use the isInfixOf function from Data.List to search for the pattern:
import Data.List (isInfixOf)
isInfixOf "1101" "1101000111001110" -- outputs true
It returns true if the first sequence exists in the second and false otherwise.
To read a file and get its contents use readFile:
contents <- readFile "filename.txt"
You will get the whole file as one string, which you can now perform standard functions on.
Outputting "Pattern found" should be trivial then.