Using `FilePattern`/wildcards in the Shake functions `want` and `need` - haskell

The functions want and need both require that their input is of type FilePath rather than FilePattern. The filenames of my outputs and inputs follow a distinct pattern, outputs are _build/*checker.sh and the inputs are of ./*.py. Therefore I would rather do a want of the form:
want ['_build/*checkers.sh']
than
want ['_build/dbchecker.sh', '_build/henk_checker.sh', ..., '_build/derp_checker.sh']
I tried building a more complex want by combining getDirectoryFiles, action, need but that doesn't work since getDirectoryFiles returns Action [FilePath] rather than FilePath.
What would be the proper solution to this problem?

Action, which is what getDirectoryFiles returns, seems to be a Monad, so you could use do-notation:
do
paths <- getDirectoryFiles "" ["_build//*checkers.sh"]
want paths
or just
getDirectoryFiles "" ["_build//*checkers.sh"] >>= want
or
want =<< getDirectoryFiles "" ["_build//*checkers.sh"]
EDIT: as per Neil Mitchell's remarks, want needs to be replaced with need.

Erik's remark about Action being a monad has proven to be very helpful. The problems is that action already resides within the Rules monad - not the other way around (Sorry Eric, for not being specific enough). The following code is on what I eventually settled.
import Development.Shake
import Development.Shake.Command
import Development.Shake.FilePath
import Development.Shake.Util
main :: IO ()
main = shakeArgs shakeOptions{shakeFiles="_build"} (do
action $ do dependencies <- getDirectoryFiles "" ["*checker.py"]
let scripts = map (\file -> "_build" </> file -<.> "sh") dependencies
need scripts)

Related

Capture group names with regex

I'm trying to use the regex package (with TDFA) and the only reference I can find for named captures group is a quasi quoter (cp) and no explanation.
Basically I have regexs in a config files, compiled at runtime with compileRegex, applied to lines and I'm looking to extract a couple of specific captures group out of that, with names I do know at compile time.
So I'm looking for a function that would take the Matches a, a string (the capture group name) and would return I guess a Maybe a, depending on whether that capture group did match or not ? Or even just a way to extract the capture group names from the Match a, that would be enough.
The doc mentions the quasi quoter cp, but no explanation as to how it's supposed to be used, and I'm not even sure I could use if if I knew because I compile my regex at runtime.
Would anyone have examples with named capture groups ?
Thanks
You'll find the API you need in the Text.RE.Replace docs; I think one of the functions with names that start with capture will be what you're looking for. Here's an example:
module Main where
import Data.Text
import Text.RE.Replace
import Text.RE.TDFA.String
needle :: CaptureID
needle = IsCaptureName . CaptureName . pack $ "needle"
main :: IO ()
main = do
re <- compileRegex "-${needle}([a-z]+)-"
let match = "haystack contains the -word- you want" ?=~ re
if matched match
then print $ captureText needle match
else print "(no match)"

How to use extract the hidden layer features in H2ODeepLearningEstimator?

I found H2O has the function h2o.deepfeatures in R to pull the hidden layer features
https://www.rdocumentation.org/packages/h2o/versions/3.20.0.8/topics/h2o.deepfeatures
train_features <- h2o.deepfeatures(model_nn, train, layer=3)
But I didn't find any example in Python? Can anyone provide some sample code?
Most Python/R API functions are wrappers around REST calls. See http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/_modules/h2o/model/model_base.html#ModelBase.deepfeatures
So, to convert an R example to a Python one, move the model to be the this, and all other args should shuffle along. I.e. the example from the manual becomes (with dots in variable names changed to underlines):
prostate_hex = ...
prostate_dl = ...
prostate_deepfeatures_layer1 = prostate_dl.deepfeatures(prostate_hex, 1)
prostate_deepfeatures_layer2 = prostate_dl.deepfeatures(prostate_hex, 2)
Sometimes the function name will change slightly (e.g. h2o.importFile() vs. h2o.import_file() so you need to hunt for it at http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/index.html

decision tree in R- extract data from a specific branch

I am trying to build a classify decision tree using rpart and partykit, and I am wondering is there any function within those packages (or any packages, for that matter) to allow me to create a dataset containing data from a specific subtree or branch?
I know that I can manually create the subset from original data set with DT rules, but I am trying to automate certain process and finding that function will help me immensely.
Example:
library (rpart)
library(partykit)
data("Titanic", package = "datasets")
ttnc <- as.data.frame(Titanic)
ttnc <- ttnc[rep(1:nrow(ttnc), ttnc$Freq), 1:4]
names(ttnc)[2] <- "Gender"
rp <- rpart(Survived ~ Gender + Age + Class, data = ttnc)
prp <- as.party(rp)
prp[5]
Lets say that I wanna extract data from the subtree #5, is there any function within those packages that allow me to do that?
Thank you!
In addition to the solution posted by #JakobGepp you can use the data_party() function provided by partykit:
data_party(prp, id = 5)
Essentially, this does the same thing internally that Jakob did explicitly by hand.
I don't know if you meant this by using the DT rules, but you could use the predict() function of the partykit package to predict the node / branches and then split the data according to your subtree.
ttnc$Node <- predict(prp, newdata = ttnc, type = "node")
subtree <- subset(ttnc, Node == 5)

Dynamic linking Haskell

I'm looking for a way of dynamic linking.
Outline is:
Lets have an app with many data filters that have all the same outlines (function names, internally used datatypes, some exported datatypes of single class).
It would be great for me to check present .so files, load only those needed, based on command line arguments, and run then.
I dont want to change or recompile app everytime new module is added.
Is something like this possible today?
Tried some hacking with System.Plugins, failed every time. Sometimes one hate strong typechecking.
EDIT
If I wrote something like this directly and gave him type hint on calling makeChunks, it is ok, otherwise nothing
--SharedClass --should create common interface
class (Eq c, Show c) => Chunks c where
makeChunks :: String -> [c]
--Plugin --one concrete implementation
import SharedClass
data MyChunk = SmallChunk Char deriving (Eq)
instance Chunks MyChunk where
makeChunks _ = [SmallChunk 's']
instance Show MyChunk where
show (SmallChunk s) = [s]
--main
import SharedClass
--load plugin somehow
print $ makeChunks "abcd"

R text mining - Combining paragraphs one after the other without sentences mixing up

beginner in R and text mining. Using the tm package currently.
I am trying to add the texts of two different documents in a corpora together.
when I use a statement like
c(corpus.doc[[1]],corpus.doc[[2]])
or the paste statement
paste(corpus.doc[[1]],corpus.doc[[2]])
I get a result of texts combined for every line.
For example:
if
> corpus.doc[[1]]
He visits very often
and
sometimes more
> corpus.doc[[2]])
She also
stays
What I get with these statements is something like
He visits very often She also
and stays
sometimes more
How can I prevent that and instead get
He visits very often
and
sometimes more
She also
stays
Or is there an easy way to combine documents in the R tm package? Thank you in advance!
Additional info
When I use
a <- c( corpus.doc[[1]], corpus.doc[[2]], recursive=TRUE)
I get that a becomes a corpus with two documents, so the texts of each of these documents are still not combined. I would like it that
a[[1]]
gives me the combined text of corpus.doc[[1]] and corpus.doc[[2]].
str(corpus.doc)
Shows something like this
List of 4270
$ CREC-2011-01-05-pt1-PgE1-2.htm :Classes 'PlainTextDocument', 'TextDocument',
'character' atomic [1:74] html head titlecongression record volume issue
head ...
.. ..- attr(*, "Author")= chr(0)
.. ..- attr(*, "DateTimeStamp")= POSIXlt[1:1], format: "2009-01-17 15:45:25"
.. ..- attr(*, "Description")= chr(0)
. . ..- attr(, "Heading")= chr(0) .. ..- attr(, "ID")= chr "CREC-2011-01-05-pt1-PgE1- 2.htm"
And it keeps going on...
The help in pkg:tm says there is a c.Corpus function whose default setting for 'recursive' is FALSE but if set to TRUE may result in an "intelligent" merger. If you think copus.doc is a list of corpus-class objects, you might try:
c( corpus.doc[[1]], corpus.doc[[2]], recursive=TRUE)
... but it is not clear that you really do have "Corpus"-class objects.
str(corpus.doc) # see above
So the first element in that very long list is not a Corpus-classed object, but rather a PlaintextDocument.
Further to my comment, how about if you combine your plain text documents in R before creating the corpus? For example, if 1.txt, 2.txt and 3.txt are plain text files, you can read them into R like so
a <- readLines(file("C:/Users/X/Desktop/1.txt"))
b <- readLines(file("C:/Users/X/Desktop/2.txt"))
c <- readLines(file("C:/Users/X/Desktop/3.txt"))
and then you could combine them, similar to your example above
abc <- c(a, b, c)
That will stack the documents up in order and preserve line-by-line format in a single data object, as you request. However, if you then make this into a corpus with
abc.corpus <- Corpus(VectorSource(abc)) # not what you want
then you'll get a corpus with as many documents as lines, which doesn't sound like what you want. Instead what you need to do is combine the text objects like this
abc.paste <- paste(a,b,c, collapse=' ') # this is what you want
so that the resulting abc.paste object is a single line. Then when you make a corpus using
abc.corpus <- Corpus(VectorSource(abc.paste))
the result will be A corpus with 1 text document which you can then analyse with functions in the tm package.
It should be straightforward to extend this into a function to efficiently concatenate your 7000+ plain text documents and then make a corpus from the resulting data object. Does that get you any closer to what you want to do?

Resources