Does sklearn have any model type metadata either in the project or outside? - scikit-learn

For example, it could be useful to have information in the library that allows one to select all tree-based ensemble models that work on regression/classifier tasks with more than one output.
I think users could gradually create this meta-data in the library if it doesn't already exist.
So something like:
[model_entry for model_entry in sklearn.meta_info if model_entry.2d_y and model_entry.ensemble]
but with better names.

You can always make use of the estimator tags to get such information: https://scikit-learn.org/dev/developers/develop.html#estimator-tags

Related

Django Wagtail dynamically create form without new model

How would I allow my primary user to dynamically create forms they can issue to their end clients. Each of my primary users has their own unique information they would like to collect that I do not know before hand. I would like to avoid creating new models in code for their dynamic needs and then having to migrate the models.
I came across this which had an interesting response but it starts with disclaimer
The flexibility of Python and Django allow developers to dynamically create models to store and access data using Django’s ORM. But you need to be careful if you go down this road, especially if your models are set to change at runtime. This documentation will cover a number of things to consider when making use of runtime dynamic models.
Which leads me to believe a lot can go wrong.
However because I'm using wagtail I believe there is probably a way to use StructBlocks & StreamFields to accomplish it.
Any guidance would be helpful.
Wagtail provides a form builder module for this purpose.
I have two possible solutions for you, although it should be said that there is probably some library with Django that I don't know about that does this, but that being said.
Prompt your user for which fields they want and the field type.
Pass this as a dictionary to some function that would generate the HTML code for the form.
When this form is used, instead of worrying about storing the fields seperately, store a dictionary in the Models. There are two ways to do that here
Another way that you could do this, albeit more convoluted but more suited to your needs, is to use MongoDB for the database for Django instead. Because it is unstructured, it might be better suited for your use case. Instructions on using MongoDB for Django are here

Ontology Populating

Hello everyone,
because of my lack of experience with ontologies and web semantics, I have a conceptual misunderstanding. When we refer to 'ontology population', do we make clones of the ontology with our concrete data or do we map our concrete data to the ontology? If so, how is it done? My intention is to build a knowledge graph using an ontology (FIBO ontology for the loans domain) and I have also an excel file with loans data. Not every entry in my excel file corresponds to the ontology classes predefined. However, that is not a major problem I suppose. So, to make myself more clear, I want to know how do I practically populate the ontology?
Also, I would like to note that I am using neo4j as a graph database and python as my implementation language, so the process of the population of the ontology would have been done using its libraries.
Thanks in advance for your time!
This video could inform your understanding about modelling and imports for graph database design: https://www.youtube.com/watch?v=oXziS-PPIUA
He steps through importing a CSV in to Neo4j and uses python.
The terms ontology and web semantics (OWL) are probably not what you're asking about (being loans/finance domain, rather than web). Further web semantics is not taken very seriously by professionals these days.
"Graph database modelling" is probably a useful area of research to solve your problem.
I can recommend you use Apache Jena to populate your ontology with the data source. You can use either Java or Python. The first step begins with extracting triples from the loaded data depending on the RDF schema, which is the basis of triple extraction. The used parser in this step may differ to be compatible with the data source in your case it is the excel file. After extracting triples, an intermediate data model (IDM) is used for mapping from the triple format. IDM could be in any useful format for mapping, like JSON. After mapping, the next step will be loading the individuals from the intermediate data model to the RDF schema that was previously used. Now the RDF schema is updated to contain the individuals too. At this phase, you should review the updated schema to check whether it needs more data, then run the logic reasoner to evaluate and correct the possible problems and inconsistencies. If the Reasoner runs without any error, the RDF schema now contains all the possible individuals and you could use it for visualisation using Neo4j

Are the new fields on Processes and Tasks in Viewflow 1.6.0 for library users, or for internal use only

Viewflow 1.6.0 introduces new fields ("data" is a JSON field, and "artifact" support for a generic foreign key). They are present on both Processes and Tasks.
Are these intended to be available to library users, or are they Viewflow internal-use-only? I did not see anything in the docs or the github issues list to clarify the matter, so a pointer would be appreciated if I missed it.
Yep, it's for library users, that allows using proxy models instead of real tables for keeping process-only data
Data field is the JSON. So it could be used with jsonstore field - https://github.com/viewflow/jsonstore that makes JSON data exposed as a real Django field. So it could be used with ModelForms as usual
Ex: https://github.com/viewflow/viewflow/blob/master/demo/helloworld/models.py#L6
Articact allow to link process and your data models, without creating a separate table for that.
All of those allow avoiding joins to build all tasks from different flows for a user.

Building a collaborative filtering recommendation engine using Spark mlLib

I am trying to build a recommendation engine based on collaborative filtering using apache Spark. I have been able to run the recommendation_example.py on my data, with quite good result. (MSE ~ 0.9). Some of the specific questions that I have are:
How to make recommendation for the users who have not done any activity on the site. Isn't there some API call for popular items, which would give me the most popular items based on user actions. One way to do is to identify the popular items by ourselves, and catch the java.util.NoSuchElementException exception, and return those popular items.
How to reload the model, after some data has been added in the input file. I am trying to reload the model using another function, which tries to save the model, but it gives error as org.apache.hadoop.mapred.FileAlreadyExistsException. One way to do is to listen for the incoming data on a parallel thread, save it using model.save(sc, "target/tmp/<some target>") and then reload the model after significant data has been received. I am lost here, how to achieve that.
It would be very helpful, if I could get some direction here.
For the first part, you can find item_id, Number of times that item_id appeared. You can use map and reduceByKey functions of spark for that. After that find the top 10/20 items having max count. You can also give the weightage depending on recency of the items.
For the second part, you can save the model with new name every time. I generally create a folder name on the go using the current date and time and use the same name to reload the model from the saved folder. You will always have to train the model again, using past data and the new data received and then use the model to predict.
Independent of using platforms like Spark, there are some very good techniques(for ex. non-negative matrix factorization) of link prediction which predicts link between 2 sets.
Other very effective(and good) techniques of recommendations are:-
1. Thompson Sampling, 2.MAB (Multi Arm Bandits). A lot depends on the raw dataset. How is your raw dataset distributed. I would recommend to apply above methods on 5% raw dataset, build a hypothesis, use A/B testing, predicts links and move forward.
Again, all these techniques are independent of platform. I would also recommend of moving from scratch instead of using platforms like spark which are only useful for large datasets. You can always move to these platforms in future for scalability.
Hope it helps!

Data relationships as a context for search in Marklogic

I using marklogic's search functionality to create a search page. As of right now, I'm running an XQuery to get search results through search:search. As a bare bones example, see this code:
xquery version "1.0-ml";
import module namespace search = "http://marklogic.com/appservices/search"
at "/MarkLogic/appservices/search/search.xqy";
search:search('test',
<options xmlns='http://marklogic.com/appservices/search'></options>)
This search searches all content in the database, which is fine in many cases. In other cases, I search based on collections with cts:collection-query. The collections serve as great contexts for my searches.
Now, I would like to limit my search results based on a relationship of data in a "main" document. This "main" document has all the relationships in an object model. If that object model has a reference to a document, I want that document included in the search. Essentially, the "main"/model document is the context of the search.
I was trying to brainstorm some ideas of the best way to to this. Here's what I've come up with thus far, but I was hoping someone more familiar with Marklogic (I've only been working with it for 6 months) could lead me in a good direction:
Add all documents referenced in the model document to a unique collection. Then query search based on that collection. However, the collections would have to be updated as the model changed.
Load the model document into my code and get a list of all the references and add them to a query by cts:document-query (or the like).
Restructure my concept of a "model" somehow in my XML documents.
Thanks for any input or suggestions.
I would start with (2) and see if the performance is good enough. That will depend on your use-case, but I expect it should be fine for thousands or even hundreds of thousands of references.
Be sure to use a single-term cts:document-query($list-of-references). That will be faster than cts:or-query(for $ref in $list-of-references return cts:document-query($ref)), because the index lookup can be a single pass instead of N separate lookups.
All of these ideas would work fine. Deciding which to use depends on particulars of your application such as how often the main document is changed (and are you in control of it),
how hard it is to remodel your XML.
Another thing to consider is you can set a trigger on document updates which could perform the collection changes automatically.
-David Lee

Resources