What is the best way to use Apache Open NLP with node.js?
Specifically, I want to use Name Entity Extraction API. Here is what is says about it - the documentation is terrible (new project, I think):
http://opennlp.apache.org/documentation/manual/opennlp.html#tools.namefind
From the docs:
To use the Name Finder in a production system its strongly recommended
to embed it directly into the application instead of using the command
line interface. First the name finder model must be loaded into memory
from disk or an other source. In the sample below its loaded from
disk.
InputStream modelIn = new FileInputStream("en-ner-person.bin");
try {
TokenNameFinderModel model = new TokenNameFinderModel(modelIn);
}
catch (IOException e) {
e.printStackTrace();
}
finally {
if (modelIn != null) {
try {
modelIn.close();
}
catch (IOException e) {
}
}
}
There is a number of reasons why the model loading can fail:
Issues with the underlying I/O
The version of the model is not compatible with the OpenNLP version
The model is loaded into the wrong component, for example a tokenizer
model is loaded with TokenNameFinderModel class.
The model content is not valid for some other reason
After the model is loaded the NameFinderME can be instantiated.
NameFinderME nameFinder = new NameFinderME(model);
The initialization is now finished and the Name Finder can be used.
The NameFinderME class is not thread safe, it must only be called from
one thread. To use multiple threads multiple NameFinderME instances
sharing the same model instance can be created. The input text should
be segmented into documents, sentences and tokens. To perform entity
detection an application calls the find method for every sentence in
the document. After every document clearAdaptiveData must be called to
clear the adaptive data in the feature generators. Not calling
clearAdaptiveData can lead to a sharp drop in the detection rate after
a few documents. The following code illustrates that:
for (String document[][] : documents) {
for (String[] sentence : document) {
Span nameSpans[] = find(sentence);
// do something with the names
}
nameFinder.clearAdaptiveData()
}
the following snippet shows a call to find
String sentence = new String[]{
"Pierre",
"Vinken",
"is",
"61",
"years"
"old",
"."
};
Span nameSpans[] = nameFinder.find(sentence);
The nameSpans arrays contains now exactly one Span which marks the
name Pierre Vinken. The elements between the begin and end offsets are
the name tokens. In this case the begin offset is 0 and the end offset
is 2. The Span object also knows the type of the entity. In this case
its person (defined by the model). It can be retrieved with a call to
Span.getType(). Additionally to the statistical Name Finder, OpenNLP
also offers a dictionary and a regular expression name finder
implementation.
Checkout this NodeJS library.
https://github.com/mbejda/Node-OpenNLP
https://www.npmjs.com/package/opennlp
Just do NPM install opennlp
And look at the examples on the Github.
var nameFinder = new openNLP().nameFinder;
nameFinder.find(sentence, function(err, results) {
console.log(results)
});
Related
I have a configuration application in Nodejs. It has a Component with name and uuid. A Component can have many Schemas. A Schema has a uuid, name, componentId, json. A Schema can have many Configurations. A Configuration has name, schemaId, json and uuid. A Schema can contain reference of many other Schemas in it. Now I want to create a functionality of exporting all the data from one instance of the application and import it in another. What should be the simplest way to do it? a few questions are
How to tell application what to export. for now i think there should be separate arrays for components, schemas and configurations. Like
{
components: ['id1', 'id2'],
schemas: ['s1', 's2'],
configuration: ['c1', 'c2'],
}
this data should be sent to application to return a file with all information that will later be used for importing in another instance
The real question is how should my export file look like keeping in mind that dependencies are also involved and dependencies can also overlap. for example a schema can have many other schemas referenced in its json field. eg schema1 has schema2 and schema4 as its dependencies. so there is another schema schema5 that also require schema2. so while importing we have to make sure that schema2 should be saved first before saving schema1 and schema5. how to represent such file that requires order as well as overlapped dependencies, making sure that schema2 is not saved twice while importing. json of schema1 is shown below as an example
{
"$schema": "http://json-schema.org/draft-04/schema#",
"p1": {
"$ref": "link-to-schema2"
},
"p2": {
"$ref": "link-to-schema4"
},
}
What should be the step wise sudo algorithm i should follow while importing.
This is a perfect occasion for a topological sort.
Taking away components, schemas and configurations terminology, what you have is objects (of various kinds) which depend on other objects existing first. A topological sort will create an order that has only forward dependencies (assuming you don't have circular ones, in which case it is impossible).
But the complication is that you have dependency information in a mix of directions. A component has to be created before its schema. A schema has to be created after the schemas that it depends on. It is not impossible that those schemas may belong to other components that have to be created as well.
The first step is to write a function that takes an object and returns a set of dependency relationships discoverable from the object itself. So we want dependencyRelations(object1 to give something like [[object1, object2], [object3, object1], [object1, object4]]. Where object1 depends on object2 existing. (Note, object1 will be in each pair but can be first or second.)
If every object has a method named uniqueName that uniquely identifies it then we can write a method that works something like this (apologies, all code was typed here and not tested, there are probably syntax errors but the idea is right):
function dependencyInfo (startingObject) {
const nameToObject = {};
const dependencyOf = {};
const todo = [startingObject];
const visited = {};
while (0 < todo.length) {
let obj = todo.pop();
let objName = obj.uniqueName();
if (! visited[ objName ]) {
visited[ objName ] = true;
nameToObject[objName] = obj;
dependencyRelations(obj).forEach((pair) => {
const [from, to] = pair;
// It is OK to put things in todo that are visited, we just don't process again.
todo.push(from);
todo.push(to);
if (! dependencyOf[from.uniqueName()]) {
dependencyOf[from.uniqueName()] = {}
}
dependencyOf[from.uniqueName()] = to.uniqueName();
});
}
}
return [nameToObject, dependencyOf];
}
This function will construct the dependency graph. But we still need to do a topological sort to get dependencies first.
function objectsInOrder (nameToObject, dependencyOf) {
const answer = [];
visited = {};
// Trick for a recursive function local to my environment.
let addObject = undefined;
addObject = function (objName) {
if (! visited[objName]) {
visited[objName] = true; // Only process once.
// Add dependencies
Object.keys(dependencyOf[objName]).forEach(addObject);
answer.push(nameToObject[objName]);
}
};
Object.keys(dependencyOf).forEach(addObject);
return answer;
}
And now we have an array of objects such that each depends on the previous ones only. Send that, and at the other end you just inflate each object in turn.
I am specifically using breezejs and the server code for breeze js converts the dbcontext into a form which is useable on the clientside using EdmxWriter.WriteEdmx. There are many properties which I have added JsonIgnore attributes to so that they don't get passed to the client side. However, the metadata that is generated (and passed to the clientside) from EdmxWriter.WriteEdmx still has those properties. Is there any additional attribute that I can add to those properties that I want ignored so that they are ignored by EdmxWriter.WriteEdmx? Or, would I need to make a separate method so as not to have any other unintended side effects.
You can sub-class your DbContext with a more restrictive variant that you use solely for metadata generation. You can continue to use your base context for persistence purposes.
The DocCode sample illustrates this technique with its NorthwindMetadataContext which hides the UserSessionId property from the metadata.
It's just a few extra lines of code that do the trick.
public class NorthwindMetadataContext : NorthwindContext
{
protected override void OnModelCreating(DbModelBuilder modelBuilder)
{
base.OnModelCreating(modelBuilder);
// Hide from clients
modelBuilder.Entity<Customer>().Ignore(t => t.CustomerID_OLD);
// Ignore UserSessionId in metadata (but keep it in base DbContext)
modelBuilder.Entity<Customer>().Ignore(t => t.UserSessionId);
modelBuilder.Entity<Employee>().Ignore(t => t.UserSessionId);
modelBuilder.Entity<Order>().Ignore(t => t.UserSessionId);
// ... more of the same ...
}
}
The Web API controller delegates to the NorthwindRepository where you'll see that the Metadata property gets metadata from the NorthwindMetadataContext while the other repository members reference an EFContextProvider for the full NorthwindContext.
public class NorthwindRepository
{
public NorthwindRepository()
{
_contextProvider = new EFContextProvider<NorthwindContext>();
}
public string Metadata
{
get
{
// Returns metadata from a dedicated DbContext that is different from
// the DbContext used for other operations
// See NorthwindMetadataContext for more about the scenario behind this.
var metaContextProvider = new EFContextProvider<NorthwindMetadataContext>();
return metaContextProvider.Metadata();
}
}
public SaveResult SaveChanges(JObject saveBundle)
{
PrepareSaveGuard();
return _contextProvider.SaveChanges(saveBundle);
}
public IQueryable<Category> Categories {
get { return Context.Categories; }
}
// ... more members ...
}
Pretty clever, eh?
Just remember that the UserSessionId is still on the server-side class model and could be set by a rogue client's saveChanges requests. DocCode guards against that risk in its SaveChanges validation processing.
You can sub-class your DbContext with a more restrictive variant that you use solely for metadata generation. You can continue to use your base context for persistence purposes.
The DocCode sample illustrates this technique with its NorthwindMetadataContext which hides the UserSessionId property from the metadata.
It's just a few extra lines of code that do the trick.
The Web API controller delegates to the NorthwindRepository where you'll see that the Metadata property gets metadata from the NorthwindMetadataContext while the other repository members reference an EFContextProvider for the full NorthwindContext.
Pretty clever, eh?
If you use the [NotMapped] attribute on a property, then it should be ignored by the EDMX process.
I have a widget with list of last news, how to cache only widget output?
OutputCache module caches whole page and for anonymous users, but in fact I need to cache only one shape output.
What solution can be here?
It's not a good idea to cache the Shape object itself, but you can capture the HTML output from a Shape and cache that.
Every Orchard Shape has a corresponding object called the Metadata. This object contains, among other things, some event handlers that can run when the Shape is displaying or after it has been displayed. By using these event handlers, it is possible to cache the output of the Shape on the first call to a driver. Then for future calls to the driver, we can display the cached copy of the output instead of running through the expensive parts of the driver or template rendering.
Example:
using System.Web;
using DemoModule.Models;
using Orchard.Caching;
using Orchard.ContentManagement.Drivers;
using Orchard.DisplayManagement.Shapes;
namespace DemoModule.Drivers {
public class MyWidgetPartDriver : ContentPartDriver<MyWidgetPart> {
private readonly ICacheManager _cacheManager;
private readonly ISignals _signals;
public MyWidgetPartDriver(
ICacheManager cacheManager,
ISignals signals
) {
_cacheManager = cacheManager;
_signals = signals;
}
public class CachedOutput {
public IHtmlString Output { get; set; }
}
protected override DriverResult Display(MyWidgetPart part, string displayType, dynamic shapeHelper) {
return ContentShape("Parts_MyWidget", () => {
// The cache key. Build it using whatever is needed to differentiate the output.
var cacheKey = /* e.g. */ string.Format("MyWidget-{0}", part.Id);
// Standard Orchard cache manager. Notice we get this object by reference,
// so we can write to its field to save our cached HTML output.
var cachedOutput = _cacheManager.Get(cacheKey, ctx => {
// Use whatever signals are needed to invalidate the cache.
_signals.When(/* e.g. */ "ExpireCache");
return new CachedOutput();
});
dynamic shape;
if (cachedOutput.Output == null) {
// Output has not yet been cached, so we are going to build the shape normally
// and then cache the output.
/*
... Do normal (potentially expensive) things (call DBs, call services, etc.)
to prep shape ...
*/
// Create shape object.
shape = shapeHelper.Parts_MyWidget(/*...*/);
// Hook up an event handler such that after rendering the (potentially expensive)
// shape template, we capture the output to the cached output object.
((ShapeMetadata) shape.Metadata).OnDisplayed(displayed => cachedOutput.Output = displayed.ChildContent);
} else {
// Found cached output, so simply output it instead of building
// the shape normally.
// This is a dummy shape, the name doesn't matter.
shape = shapeHelper.CachedShape();
// Hook up an event handler to fill the output of this shape with the cached output.
((ShapeMetadata)shape.Metadata).OnDisplaying(displaying => displaying.ChildContent = cachedOutput.Output);
// Replacing the ChildContent of the displaying context will cause the display manager
// to simply use that HTML output and skip template rendering.
}
return shape;
});
}
}
}
EDIT:
Note that this only caches the HTML that is generated from your shape output. Things like Script.Require(), Capture(), and other side effects that you perform in your shape templates will not be played back. This actually bit me because I tried to cache a template that required its own stylesheet, but the stylesheets would only be brought in the first time.
Orchard supplies a service called the CacheManager, which is awesome and cool and makes caching super easy. It is mentioned in the docs, but it isn't a particularly helpful description of how to use it (http://docs.orchardproject.net/Documentation/Caching). Best place to see examples would be in the Orchard core code and third party modules such as Favicon and the twitter widgets (all of them one would hope).
Luckily other nice people have gone to the effort of searching orchards code for you and writing nice little blog posts about it. The developer of the LatestTwitter widget wrote a neat post: http://blog.maartenballiauw.be/post/2011/01/21/Writing-an-Orchard-widget-LatestTwitter.aspx . So did Richard of NogginBox: http://www.nogginbox.co.uk/blog/orchard-caching-by-time . And of course Bertrand has a helpful post on the subject as well: http://weblogs.asp.net/bleroy/archive/2011/02/16/caching-items-in-orchard.aspx
I have a DSL in Xtext, and I would like to reuse the rules, terminals, etc. defined in my .xtext file to generate a configuration file for some other tool involved in the project. The config file uses syntax similar to BNF, so it is very similar to the actual Xtext content and it requires minimal transformations. In theory I could easily write a script that would parse Xtext and spit out my config...
The question is, how do I go about implementing it so that it fits with the whole ecosystem? In other words - how to do a Model to Model transform in Xtext/EMF?
If you have both metamodels(ecore,xsd,...), your best shot is to use ATL ( http://www.eclipse.org/atl/ ).
If I understand you correct you want to go from an xtext model to its EMF model. Here is a code example that achieves this, substitute your model specific where necessary.
public static BeachScript loadScript(String file) throws BeachScriptLoaderException {
try {
Injector injector = new BeachStandaloneSetup().createInjectorAndDoEMFRegistration();
XtextResourceSet resourceSet = injector.getInstance(XtextResourceSet.class);
resourceSet.addLoadOption(XtextResource.OPTION_RESOLVE_ALL, Boolean.TRUE);
Resource resource = resourceSet.createResource(URI.createURI("test.beach"));
InputStream in = new ByteArrayInputStream(file.getBytes());
resource.load(in, resourceSet.getLoadOptions());
BeachScript model = (BeachScript) resource.getContents().get(0);
return model;
} catch (Exception e) {
throw new BeachScriptLoaderException("Exception Loading Beach Script " + e.toString(),e );
}
I have an interesting need for an extension method on the IEumerable interface - the same thing as List.ConvertAll. This has been covered before here and I found one solution here. What I don't like about that solution is he builds a List to hold the converted objects and then returns it. I suspect LINQ wasn't available when he wrote his article, so my implementation is this:
public static class IEnumerableExtension
{
public static IEnumerable<TOutput> ConvertAll<T, TOutput>(this IEnumerable<T> collection, Func<T, TOutput> converter)
{
if (null == converter)
throw new ArgumentNullException("converter");
return from item in collection
select converter(item);
}
}
What I like better about this is I convert 'on the fly' without having to load the entire list of whatever TOutput's are. Note that I also changed the type of the delegate - from Converter to Func. The compilation is the same but I think it makes my intent clearer - I don't mean for this to be ONLY type conversion.
Which leads me to my question: In my repository layer I have a lot of queries that return lists of ID's - ID's of entities. I used to have several classes that 'converted' these ID's to entities in various ways. With this extension method I am able to boil all that down to code like this:
IEnumerable<Part> GetBlueParts()
{
IEnumerable<int> keys = GetBluePartKeys();
return keys.ConvertAll<Part>(PartRepository.Find);
}
where the 'converter' is really the repository's Find-by-ID method. In my case, the 'converter' is potentially doing quite a bit. Does anyone see any problems with this approach?
The main issue I see with this approach is it's completely unnecessary.
Your ConvertAll method is nothing different than Enumerable.Select<TSource,TResult>(IEnumerable<TSource>, Func<TSource,TResult>), which is a standard LINQ operator. There's no reason to write an extension method for something that already is in the framework.
You can just do:
IEnumerable<Part> GetBlueParts()
{
IEnumerable<int> keys = GetBluePartKeys();
return keys.Select<int,Part>(PartRepository.Find);
}
Note: your method would require <int,Part> as well to compile, unless PartRepository.Find only works on int, and only returns Part instances. If you want to avoid that, you can probably do:
IEnumerable<Part> GetBlueParts()
{
IEnumerable<int> keys = GetBluePartKeys();
return keys.Select(i => PartRepository.Find<Part>(i)); // I'm assuming that fits your "Find" syntax...
}
Why not utilize the yield keyword (and only convert each item as it is needed)?
public static class IEnumerableExtension
{
public static IEnumerable<TOutput> ConvertAll<T, TOutput>
(this IEnumerable<T> collection, Func<T, TOutput> converter)
{
if(null == converter)
throw new ArgumentNullException("converter");
foreach(T item in collection)
yield return converter(item);
}
}