How to run a Haskell code by its name? - haskell

First of all, I'm new in Haskell and I'm curious how can I implement something that I have working in Java.
A bit of prehistory:
I have a Java-based project with workers. Workers can be started periodically and do some work. A worker - is just a Java-class with some functionality implemented. New workers can be added and configured. Workers' names and parameters are stored in a data base. Periodically, the application gathers workers' details (names, params) from DB and starts them. In Java, to start worker by its name, I do the following:
Class<?> myClass = Class.forName("com.mycompany.superworker");
Constructor<?> constructor = myClass.getConstructor();
MyClass myInstance = (MyClass) constructor.newInstance();
myInstance.run("param1", "param2");
Hence, the application gets worker's name (the actual class name) from DB, gets class and constructor, creates new instance of the class and then just runs it.
Now, the question: how I could implement something similar in Haskell? I mean, if I have some function/module/class whatever implemented in Haskell and I have its name stored in plain text, - then how can I run this code by its name (from the main Haskell-based application, of course)?
Thanks
UPDATE
A bit more about the app...
I have an application that grabs some data from Internet, does some parsing work and puts the result in DB. We have a vary of websites we grab data from, so we have a vary of parsers. These parsers are workers. User can implement its own worker (a java class) and put its details into DB via UI. So we store names of workers (and their params) in DB. And when it's time, we go to DB, gather workers' class names and instantiate and start every worker.
Workers do not need communication between each other. Application also don't need to communicate to workers. Application just start a worker, worker grabs data from web, does some parsing, and puts result into DB. That's it.
So, worker can be launched as a separate process.
The main problem (as for me) is that we don't have some constant amount of workers. User can implement its own worker, compile it, restart the application, and the application should know how to start this new worker. So, we store workers' class names in DB and use Java reflection to launch them.
I'm looking for how such an app could be written in Haskell - in a Haskell way, not necessary to just copy the existing Java way.

Related

What is the intended usage of Qt threads in conjunction with dependency injection?

Let's have a worker thread which is accessed from a wide variety of objects. This worker object has some public slots, so anyone who connects its signals to the worker's slots can use emit to trigger the worker thread's useful tasks.
This worker thread needs to be almost global, in the sense that several different classes use it, some of them are deep in the hierarchy (child of a child of a child of the main application).
I guess there are two major ways of doing this:
All the methods of the child classes pass their messages upwards the hierarchy via their return values, and let the main (e.g. the GUI) object handle all the emitting.
All those classes which require the services of the worker thread have a pointer to the Worker object (which is a member of the main class), and they all connect() to it in their constructors. Every such class then does the emitting by itself. Basically, dependency injection.
Option 2. seems much more clean and flexible to me, I'm only worried that it will create a huge number of connections. For example, if I have an array of an object which needs the thread, I will have a separate connection for each element of the array.
Is there an "official" way of doing this, as the creators of Qt intended it?
There is no magic silver bullet for this. You'll need to consider many factors, such as:
Why do those objects emit the data in the first place? Is it because they need to do something, that is, emission is a “command”? Then maybe they could call some sort of service to do the job without even worrying about whether it's going to happen in another thread or not. Or is it because they inform about an event? In such case they probably should just emit signals but not connect them. Its up to the using code to decide what to do with events.
How many objects are we talking about? Some performance tests are needed. Maybe it's not even an issue.
If there is an array of objects, what purpose does it serve? Perhaps instead of using a plain array some sort of “container” class is needed? Then the container could handle the emission and connection and objects could just do something like container()->handle(data). Then you'd only have one connection per container.

Using Google map objects within a web worker?

The situation:
Too much stuff is running in the main thread of a page making a google map with overlays representing ZIP territories coming from US census data and stuff the client has asked for grouping territories into discreet groups. While there is no major issue on desktops, mobile devices (iPad) decide that the thread is taking too long (max of 6 seconds after data returns) and therefore must have crashed.
Solution: Offload the looping function to gather the points for the shape from each row to a web worker that can work as fast or slow as resources allow on a mobile device. (Three for loops, 1st to select row, 2nd to select column, 3rd for each point within the column. Execution time: matter of 3-6 seconds total for over 2000+ rows with numerous points)
The catch: In order for this to be properly efficient, the points must be made into a shape (polygon) within the web worker. HOWEVER since it is a google.maps.polygon object made up of google.maps.latlng objects it [the web worker] needs to have some knowledge of what those items are within the web worker. Web workers require you to not use window or the DOM so it must import the script and the intent was to pass back just the object as a JSON encoded item. The code fails on any reference of google objects even with importScript() due to the fact those items rely on the window element.
Further complications: Google's API is technically proprietary. The web app code that this is for is bound by NDA so pointed questions could be asked but not a copy/paste of all code.
The solution/any vague ideas:???
TLDR: Need to access google.maps.latlng object and create new instances of (minimally) within a web worker. Web worker should either return Objects ready to be popped into a google.maps.polygon object or should return a google.maps.polygon object. How do I reference the google maps API if I cannot use the default method of importing scripts due to an issue requiring the window object?
UPDATE: Since this writing Ive managed to offload the majority of the grunt work from the main thread to the web worker allowing it to parse through the data asynchronously and assign the data to custom made latlng object.
The catch now is getting the returned values to run the function in the proper context to see if the custom latlng is sufficient for google.maps.polygon to work its magic.
Excerpt from the file that calls the web worker and listens for its response (Coffeescript)
#shapeWorker.onmessage= (event)->
console.log "--------------------TESTING---------------"
data=JSON.parse(event.data)
console.log data
#generateShapes(data.poly,data.center,data.zipNum)
For some reason, its trying to evaluate GenerateShapes in the context of the web worker rather than in the context of the class its in.
Once again it was a complication of too many things going on at once. The scope was restricted due to the usage of -> rather than => which expands the scope to allow the parent class functions.
Apparently the issue resided with the version of iOS this web app needed to run on and a bug with the storage being set arbitrarily low (a tenth of its previous size). With some shrinking of the data and a fix to the iOS version in question I was able to get it running without the usage of web workers. One day I may be able to come back to it with web workers to increase efficiency.

Storm Bolt Database Connection

I am using Storm (java) with Cassandra.
One of my Bolts inserts data in to Cassandra. Is there any way to hold the connection to Cassandra open between instantiations of this bolt?
The write speed of my application is fast. The bolt need to run several times a second, and the performance is being hindered by the fact that it is connecting to Cassandra each time.
It would run a lot faster if I could have a static connection that was held open, but I am not sure to achieve this in storm.
To clarify the question:
what is the scope of a static connection in a storm topology?
Unlike other messaging systems which have workers where the "work" goes on in a loop or callback which can make use of a variable (maybe a static connection) outside this loop, storms bolts seem to be instantiated each time they are called and can not have parameters passed in to them, so how can I use the same connection to cassandra?
Unlike other messaging systems which have workers where the "work" goes on in a loop or callback which can make use of a variable (maybe a static connection) outside this loop, storms bolts seem to be instantiated each time they are called and can not have parameters passed in to them
Its not exactly right to say that storm bolts get instantiated each time they called. For example the prepare method only get called during the initialization phase i.e only once. from the doc it says it is Called when a task for this component is initialized within a worker on the cluster. It provides the bolt with the environment in which the bolt executes.
So the best bet would be to put the initialization code in the prepare or open (in case of spouts) method as they will be called when the tasks are starting. But you need make it thread safe as it will be called by every tasks concurrently in its own thread.
The execute(Tuple tuple) method on the other hand is actually responsible for processing the logic and called every time it receives a tuple from the corresponding spouts or bolts.(so this is actually what get called every single time the bolt runs)
The cleanup method is called when an IBolt is going to be shutdown, the documentation says
There is no guarentee that cleanup will be called, because the
supervisor kill -9's worker processes on the cluster.The one context
where cleanup is guaranteed to be called is when a topology is killed
when running Storm in local mode
So its not true that you can't pass a variable to it, you can instantiate any instance variables with the prepare method and then use it during the processing.
Regarding the DB connection I am not exactly sure about your use cases as you have not put any code but maintaining a pool of resource sounds like a good choice to me.

Storing job-specific data

In my play application I have several Jobs and I have a singleton class.
What I would like to do is for each Job to store data in the singleton class and I would like to be able to retrieve from this singleton class the data that corresponds to the current Job via yet another class.
In other words I would like to have something like this:
-Job 1 stores "Job1Data" in the singleton class
-Job 2 stores "Job2Data" in the singleton class
-Another class asks the singleton class data for the currently executing job (in the current thread I guess) and use it
To perform this I assumed each Job is run on a different thread. Then what I did is that data from each Job stored in the singleton class is stored in a Map that maps the current thread id with the data.
However I'm not sure this is the way I should do it because it may not be thread safe (although Hashtable is said to be thread-safe) and maybe another thread is created each time the Job is executed which would make my Map grow a lot and never clear it-self.
I thought of another way to do what I want. Maybe I could use the ThreadLocal class in my singleton to be sure it's thread-safe and that I store thread-specific data. However I don't know if it will work well if another thread is used each time a Job is executing. Furthermore, I read somewhere that ThreadLocal creates memory leaks if the data is not remove, and the problem is that I don't know when I can remove the data.
So, would anybody have a solution for my issue ? I would like to be sure data I would like to store during Job execution is stored in a global class and can be accessed by another class (with an access to the data of the correct Job, thus the correct thread I guess).
Thank you for your help

How to process rows of a CSV file using Groovy/GPars most efficiently?

The question is a simple one and I am surprised it did not pop up immediately when I searched for it.
I have a CSV file, a potentially really large one, that needs to be processed. Each line should be handed to a processor until all rows are processed. For reading the CSV file, I'll be using OpenCSV which essentially provides a readNext() method which gives me the next row. If no more rows are available, all processors should terminate.
For this I created a really simple groovy script, defined a synchronous readNext() method (as the reading of the next line is not really time consuming) and then created a few threads that read the next line and process it. It works fine, but...
Shouldn't there be a built-in solution that I could just use? It's not the gpars collection processing, because that always assumes there is an existing collection in memory. Instead, I cannot afford to read it all into memory and then process it, it would lead to outofmemory exceptions.
So.... anyone having a nice template for processing a CSV file "line by line" using a couple of worker threads?
Concurrently accessing a file might not be a good idea and GPars' fork/join-processing is only meant for in-memory data (collections). My sugesstion would be to read the file sequentially into a list. When the list reaches a certain size, process the entries in the list concurrently using GPars, clear the list and then move on with reading lines.
This might be a good problem for actors. A synchronous reader actor could hand off CSV lines to parallel processor actors. For example:
#Grab(group='org.codehaus.gpars', module='gpars', version='0.12')
import groovyx.gpars.actor.DefaultActor
import groovyx.gpars.actor.Actor
class CsvReader extends DefaultActor {
void act() {
loop {
react {
reply readCsv()
}
}
}
}
class CsvProcessor extends DefaultActor {
Actor reader
void act() {
loop {
reader.send(null)
react {
if (it == null)
terminate()
else
processCsv(it)
}
}
}
}
def N_PROCESSORS = 10
def reader = new CsvReader().start()
(0..<N_PROCESSORS).collect { new CsvProcessor(reader: reader).start() }*.join()
I'm just wrapping up an implementation of a problem just like this in Grails (you don't specify if you're using grails, plain hibernate, plain JDBC or something else).
There isn't anything out of the box that you can get that I'm aware of. You could look at integrating with Spring Batch, but the last time I looked at it, it felt very heavy to me (and not very groovy).
If you're using plain JDBC, doing what Christoph recommends probably is the easiest thing to do (read in N rows and use GPars to spin through those rows concurrently).
If you're using grails, or hibernate, and want your worker threads to have access to the spring context for dependency injection, things get a bit more complicated.
The way I solved it is using the Grails Redis plugin (disclaimer: I'm the author) and the Jesque plugin, which is a java implementation of Resque.
The Jesque plugin lets you create "Job" classes that have a "process" method with arbitrary parameters that are used to process work enqueued on a Jesque queue. You can spin up as many workers as you want.
I have a file upload that an admin user can post a file to, it saves the file to disk and enqueues a job for the ProducerJob that I've created. That ProducerJob spins through the file, for each line, it enqueues a message for a ConsumerJob to pick up. The message is simply a map of the values read from the CSV file.
The ConsumerJob takes those values and creates the appropriate domain object for it's line and saves it to the database.
We already were using Redis in production so using this as a queueing mechanism made sense. We had an old synchronous load that ran through file loads serially. I'm currently using one producer worker and 4 consumer workers and loading things this way is over 100x faster than the old load was (with much better progress feedback to the end user).
I agree with the original question that there is probably room for something like this to be packaged up as this is a relatively common thing.
UPDATE: I put up a blog post with a simple example doing imports with Redis + Jesque.

Resources