I am new to this field, this question might look dumb to some of you but please bear with it.
I have created a keras model which works well. I can save it in .hdf5 format using model.save("model_name.hdf5) which is good.
Main question is if there is any other format i can save my model in so that it can be used in c++/android/JavaScript. Is this even possible to do so?? If you are thinking why I am asking for all 3 languages, its because I have 3 projects each of them use respective language.
Thanks for any help in advance.
The answer depends on what you are going to save and use in another language.
If you just need the architecture of the model to be saved, you may save it as JSON, which can later be used in any other platform and language you are going to use.
model_json = model.to_json()
If you also need the weights and biases, I do not know any specific tool, but you can simply read the stored data in python, create a multidimensional array, and then save it in a file appropriate for any of the languages you need. For example, the weights of the second layer can be found in model.layers[2].get_weights().
Finally, if you want to run the model in another language, you need to implement the for-loops that make the processing. You might find some conversion tools for your target language. For example, for C, this tool can help.
Related
I'm new to NLP. I am looking for recommendations for an Annotation tool to create a labeled NER dataset from raw texts.
In details:
I'm trying to create a labeled data set for specific types of Entities in order to develop my own NER project (rule based at first).
I assumed there will be some friendly frameworks that allows create tagging projects, tag text data, create a labeled dataset, and even share projects so several people could work on the same project, but I'm struggling to find one (I admit "friendly" or "intuitive" are subjective, yet this is my experience).
So far I've tried several Frameworks:
I tried LightTag. It makes the tagging itself fast and easy (i.e. marking the words and giving them labels) but the entire process of creating a useful dataset is not as intuitive as I expected (i.e. uploading the text files, split to different tagging objects, save the tags, etc.)
I've installed and tried LabelStudio and found it less mature then LightTag (don't mean to judge here :))
I've also read about spaCy's Prodigy, which offers a paid annotation tool. I would consider purchasing it, but their website only offers a live demo of the the tagging phase and I can't access if their product is superior to the other two products above.
Even in StackOverflow the latest question I found on that matter is over 5 years ago.
Do you have any recommendation for a tool to create a labeled NER dataset from raw text?
⚠️ Disclaimer
I am the author of Acharya. I would limit my answers to the points raised in the question.
Based on your question, Acharya would help you in creating the project and upload your raw text data and annotate them to create a labeled dataset.
It would allow you to mark records individually for train or test in the dataset and would give data-centric reports to identify and fix annotation/labeling errors.
It allows you to add different algorithms (bring your own algorithm) to the project and train the model regularly. Once trained, it can give annotation suggestions from the trained models on untagged data to make the labeling process faster.
If you want to train in a different setup, it allows you to export the labeled dataset in multiple supported formats.
Currently, it does not support sharing of projects.
Acharya community edition is in alpha release.
github page (https://github.com/astutic/Acharya)
website (https://acharya.astutic.com/)
Doccano is another open-source annotation tool that you can check out https://github.com/doccano/doccano
I have used both DOCCANO (https://github.com/doccano/doccano) and BRAT (https://brat.nlplab.org/).
Find the latter very good and it supports more functions. Both are free to use.
I have used gensim.utils.simple_preprocess(str(sentence) to create a dictionary of words that I want to use for topic modelling. However, this is also filtering important numbers (house resolutions, bill no, etc) that I really need. How did I overcome this? Possibly by replacing digits with their word form. How do i go about it, though?
You don't have to use simple_preprocess() - it's not doing much, it's not that configurable or sophisticated, and typically the other Gensim algorithms just need lists-of-tokens.
So, choose your own tokenization - which in some cases, depnding on your source data, could be as simple as a .split() on whitespace.
If you want to look at what simple_preprocess() does, as a model, you can view its Python source at:
https://github.com/RaRe-Technologies/gensim/blob/351456b4f7d597e5a4522e71acedf785b2128ca1/gensim/utils.py#L288
First a little bit of context: I'm trying to identify street addresses in a corpus of documents and we decided that the obvious solution for this would be to use an NLP (Apache OpenNLP in this case) tool to achieve this and so far everything looks great although we still need to train the model with a lot of documents, but that's not really an issue. We improved the solution by adding a extra step for address validation by using the USAddress parser from Datamade. My biggest issue is the fact that the addresses by themselves are nothing without a location next to them, sometimes the location is specified in the text and we will assume that this happens quite often.
Here comes my question: Is there someway to use coreference to associate the entities in the text? Or better yet is there a way to annotate arbitrary words in the text and identify them as being one entity?
I've been looking at the Apache OpenNLP documentation but...it's pretty thin and I think it still needs some work.
If you want to use coreference for this problem, you can have a look at this blog
But a simpler solution would be using a sentence detector+ RegEx or a location NER+ sentence detector(presuming addresses are in a single line)
I think the US addresses can be identified using a Regular Expression and once the regex matches, you can use opennlp's sentence detector to print the whole address line.
Similarly you can use NER model provided by opennlp to find locations and print the sentence you want.
Hope this helps!
edit
this Github Repo made it simple for us. Check it out!
OpenNLP does not provide a coreference resolution module. You have to use either Stanford or Illinois or Berkeley system to accomplish the task. They may not work out of the box, you may have to do some parameter tuning or supervised training to achieve reasonable performance.
#edit
Thanks #Alaye for pointing out that OpenNLP does have a coref module, for more details see his answer.
Thanks
Ok, several months later! It wasn't Coref what I was after... what I as actually looking for was Relation Extraction (Information Extraction). I used MITIE (BinaryRelation) and that did the trick, I trained my own model using Brat annotation tool and I got an F1 score of 0.81. Pretty neat...
Don't know where to start on this one so hopefully you guys can clear up my question. I have project where email will be searched for specific words/patterns and stored in a structured manner. Something that is done with Trip it.
The article states that they developed a DataMapper
The DataMapper is responsible for taking inbound email messages
addressed to plans [at] tripit.com and transforming them from the
semi-structured format you see in your mail reader into a highly
structured XML document.
There is a comment that also states
If you're looking to build this yourself, reading a little bit about
Wrappers and Wrapper Induction might be helpful
I Googled and read about wrapper induction but it was just too broad of a definition and didn't help me understand how one would go about solving such problem.
Is there some open source project out there that does similar things?
There are a couple of different ways and things you can do to accomplish this.
The first part, which involves getting access to the email content I'll not answer here. Basically, I'll assume that you have access to the text of emails, and if you don't there are some libraries that allow you to connect java to an email box like camel (http://camel.apache.org/mail.html).
So now you've got the email so then what?
A handy thing that could help is that lingpipe (http://alias-i.com/lingpipe/) has an entity recognizer that you can populate with your own terms. Specifically, look at some of their extraction tutorials and their dictionary extractor (http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html) So inside of the lingpipe dictionary extractor (http://alias-i.com/lingpipe/docs/api/com/aliasi/dict/ExactDictionaryChunker.html) you'd simply import the terms you're interested in and use that to associate labels with an email.
You might also find the following question helpful: Dictionary-Based Named Entity Recognition with zero edit distance: LingPipe, Lucene or what?
Really a very broad question, but I can try to give you some general ideas, which might be enough to get started. Basically, it sounds like you're talking about an elaborate parsing problem - scanning through the text and looking to apply meaning to specific chunks. Depending on what exactly you're looking for, you might get some good mileage out of a few regular expressions to start - things like phone numbers, email addresses, and dates have fairly standard structures that should be matchable. Other data points might benefit from some indicator words - the phrase "departing from" might indicate that what follows is an address. The natural language processing community also has a large tool set available for text processing - check out things like parts of speech taggers and semantic analyzers if they're appropriate to what you're trying to do.
Armed with those techniques, you can follow a basic iterative development process: For each data point in your expected output structure, define some simple rules for how to capture it. Then, run the application over a batch of test data and see which samples didn't capture that datum. Look at the samples and revise your rules to catch those samples. Repeat until the extractor reaches an acceptable level of accuracy.
Depending on the specifics of your problem, there may be machine learning techniques that can automate much of that process for you.
We've been working with the NLTK library in a recent project where we're
mainly interested in the named entities part.
In general we're getting good results using the NEChunkParser class.
However, we're trying to find a way to provide our own terms to the
parser, without success.
For example, we have a test document where my name (Shay) appears in
several places. The library finds me as GPE while I'd like it to find
me as PERSON...
Is there a way to provide some kind of a custom file/
code so the parser will be able to interpret the named entity as I
want it to?
Thanks!
The easy solution is to compile a list of entities that you know are misclassified, then filter the NEChunkParser output in a postprocessing module and replace these entities' tags with the tags you want them to have.
The proper solution is to retrain the NE tagger. If you look at the source code for NLTK, you'll see that the NEChunkParser is based on a MaxEnt classifier, i.e. a machine learning algorithm. You'll have to compile and annotate a corpus (dataset) that is representative for the kind of data you want to work with, then retrain the NE tagger on this corpus. (This is hard, time-consuming and potentially expensive.)