Tabular data using spacy - nlp

I'm using Spacy and need some help to train our model with custom entities given in tabular format in a word/pdf document.
I'm able to train it with a custom entity based on an example of ANIMAL and it's working fine. In this case, we are providing the start and the end index of the aforementioned custom entity in a given text.
("Horses are too tall and they pretend to care about your feelings", {
'entities': [(0, 6, 'ANIMAL')]
}),
My question comes in case of Tabular format:
How can I give indexes like ANIMAL example?
Can anyone please guide and assist?

After a lots of research and article, I found a way to pass it through.
Convert this table as text.
As you convert this as text. this will add lots of white spaces etc.
Replace them with spaces.
This will convert you table as paragraph.
Now you can give indexes as sentences, and train your model.
Further, you can use dependency parser algorithm to find correct values linked with head ( in case, a values belongs to multiple key)

You can also simply use pd.read_html([[pass your html here]]) and this will return list of dataframes which you can use.
Thanks.

Related

finding organization and industry/sector from string in dbpedia

I am generating a short list of 10 to 20 strings which I want to lookup on dbpedia to see if they have an organization tag and if so return the industry/sector tag. I have been looking at the SPARQLwrapper queries on their website but am having trouble constructing one that returns organization and sector/industry for my string. Is there a way to do this?
If I use the code below I get a list of industry types I think rather than the industry of the company.
from SPARQLWrapper import SPARQLWrapper, JSON
sparql = SPARQLWrapper("http://dbpedia.org/sparql")
sparql.setQuery("""
SELECT ?industry WHERE
{ <http://dbpedia.org/resource/IBM> a ?industry}
""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
Instead of looking at queries which are meant to help you understand the querying tool, you should start by looking at the data which is being queried. For instance, just click http://dbpedia.org/resource/IBM, and look at the properties (the left hand column) to see its rdf:type values (of which there are MANY)!
Note that IBM is not described as a ?industry. IBM is described as a <http://dbpedia.org/resource/Public_company> (among other things). On the other hand, IBM is also described as having three values for <http://dbpedia.org/ontology/industry> --
<http://dbpedia.org/resource/Cloud_computing>
<http://dbpedia.org/resource/Information_technology>
<http://dbpedia.org/resource/Cognitive_computing>
I don't know whether these are what you're actually looking for or not, but hopefully what I've done above will start you down the right path to whatever you do want to get out of DBpedia.

Tokenization not working the same for both case.

I have a document
doc = nlp('x-xxmessage-id:')
When I want to extract the tokens of this one I get 'x', 'xx', 'message' and 'id', ':'. Everything goes well.
Then I create a new document
test_doc = nlp('id')
If I try to extract the tokens of test_doc, I will get 'i' and 'd'. Is there any way to get past this problem? Because I want to get the same token as above and this is creating problems in the text processing.
Just like language itself, tokenization is context-dependent and the language-specific data defines rules that tell spaCy how to split the text based on the surrounding characters. spaCy's defaults are also optimised for general-purpose text, like news text, web texts and other modern writing.
In your example, you've come across an interesting case: the abstract string "x-xxmessage-id:" is split on punctuation, while the isolated lowercase string "id" is split into "i" and "d", because in written text, it's most commonly an alternate spelling of "I'd" or "i'd" ("I could", "I would" etc.). You can find the respective rules here.
If you're dealing with specific texts that are substantially different from regular natural language texts, you usually want to customise the tokenization rules or possibly even add a Language subclass for your own custom "dialect". If there's a fixed number of cases you want to tokenize differently that can be expressed by rules, another option would be to add a component to your pipeline that merges the split tokens back together.
Finally, you could also try using the language-independent xx / MultiLanguage class instead. It still includes very basic tokenization rules, like splitting on punctuation, but none of the rules specific to the English language.
from spacy.lang.xx import MultiLanguage
nlp = MultiLanguage()

how to use pos tag as feature in Stanford NER training?

I am trying to use the useTags and related features in training Stanford NER CRF model. However, although I have specified in the .prop file that I will use this feature, CoreAnnotations.PartOfSpeechAnnotation.class does not seem to return anything and hence the training does not use this feature at all. Is there something I did wrong that it wasn't using this feature? Thanks!
You need to specify which column in your training/test data has the pos tag and add the pos tags to the CoNLL.
You specify that column in this part of the properties:
map = word=0,answer=1,tag=2
(for example if you added the tags in the 3rd column)

RDF vocabulary to define filters over a dataset?

I have a totally arbitrary data set with objects and their properties. The data set can contain pretty much anything. I want to explicitly mark some of the properties as searchable/filterable (I will use it when generating user interface on top of the data set). For example, let's say my data set contains people:
<http://www.jonhdoe.com> a schema:Person ;
schema:name "John Doe" .
Now I want to state that in my data set, objects can be searched using schema:name. So, something like this:
schema:name a filters:Filter ;
rdfs:label "Name of a person" .
Based on this definition, I can now generate a form field with given label and let the user search the data set using this field.
Is there an existing vocabulary that would allow me define such meta data over my data set? I tried several vocabulary searches but they weren't giving me nice results.
It's not a 100% fit, but I think the Fresnel vocabulary might be close to what you're looking for. It allows you to specify information on how to display RDF data, using the notion of 'lenses' and 'formats'. Lenses define which properties for a given resource/class should be considered for display, formats in turn define how things should be rendered/displayed.
You can use this to define a 'searchable' lens, which defines the properties you want to allow search on.

Extracting Important words from a sentence using Node

I admit that I havent searched extensively in the SO database. I tried reading the natural npm package but doesnt seem to provide the feature. I would like to know if the below requirement is somewhat possible ?
I have a database that has list of all cities of a country. I also have rating of these cities (best place to live, worst place to live, best rated city, worsrt rated city etc..). Now from the User interface, I would like to enable the user to enter free text and from there I should be able to search my database.
For e.g Best place to live in California
or places near California
or places in California
From the above sentence, I want to extract the nouns only (may be ) as this will be name of the city or country that I can search for.
Then extract 'best' means I can sort is a particular order etc...
Any suggestions or directions to look for?
I risk a chance that the question will be marked as 'debatable'. But the reason I posted is to get some direction to proceed.
[I came across this question whilst looking for some use cases to test a module I'm working on. Obviously the question is a little old, but since my module addresses the question I thought I might as well add some information here for future searchers.]
You should be able to do what you want with a POS chunker. I've recently released one for Node that is modelled on chunkers provided by the NLTK (Python) and Standford NLP (Java) libraries (the chunk() and TokensRegex() methods, resepectively).
The module processes strings that already contain parts-of-speech, so first you'll need to run your text through a parts-of-speech tagger, such as pos:
var pos = require('pos');
var words = new pos.Lexer().lex('Best place to live in California');
var tags = new pos.Tagger()
.tag(words)
.map(function(tag){return tag[0] + '/' + tag[1];})
.join(' ');
This will give you:
Best/JJS place/NN to/TO live/VB in/IN California/NNP ./.
Now you can use pos-chunker to find all proper nouns:
var chunker = require('pos-chunker');
var places = chunker.chunk(tags, '[{ tag: NNP }]');
This will give you:
Best/JJS place/NN to/TO live/VB in/IN {California/NNP} ./.
Similarly you could extract verbs to understand what people want to do ('live', 'swim', 'eat', etc.):
var verbs = chunker.chunk(tags, '[{ tag: VB }]');
Which would yield:
Best/JJS place/NN to/TO {live/VB} in/IN California/NNP ./.
You can also match words, sequences of words and tags, use lookahead, group sequences together to create chunks (and then match on those), and other such things.
You probably don't have to identify what is a noun. Since you already have a list of city and country names that your system can handle, you just have to check whether the user input contains one of these names.
Well firstly you'll need to find a way to identify nouns. There is no core node module or anything that can do this for you. You need to loop through all words in the string and then compare them against some kind of dictionary database so you can find each word and check if it's a noun.
I found this api which looks pretty promising. You query the API for a word and it sends you back a blob of data like this:
<?xml version="1.0" encoding="UTF-8"?>
<results>
<result>
<term>consistent, uniform</term>
<definition>the same throughout in structure or composition</definition>
<partofspeech>adj</partofspeech>
<example>bituminous coal is often treated as a consistent and homogeneous product</example>
</result>
</results>
You can see that it includes a partofspeech member which tells you that the word "consistent" is an adjective.
Another (and better) option if you have control over the text being stored is to use some kind of markup language to identify important parts of the string before you save it. Something like BBCode. I even found a BBCode node module that will help you do this.
Then you can save your strings to the database like this:
Best place to live in [city]California[/city] or places near [city]California[/city] or places in [city]California[/city].
or
My name is [first]Alex[/first] [last]Ford[/last].
If you're letting user's type whole sentences of text and then you're trying to figure out what parts of those sentences is data you should use in your app then you're making things very unnecessarily hard on yourself. You should either ask them to input important pieces of data into their own text boxes or you should give the user a formatting language such as the aforementioned BBCode syntax so they can identify important bits for you. The job of finding out which parts of a string are important is going to be a huge one for you I think.

Resources