Storing data obtained from Information Extration - nlp

I have some experience with java and I am a student doing my final year project.
I need to work on a project in Natural language processing , well I am currently trying to work on stanford-nlp libraries (but am not locked to it , i can change my tool) so answers can be for any tool proper for my problem.
I have planned to work on Information Extraction IE , and have seen some page/pdf that explain how it works with various NLP techniques. Data will be processed with NLP and i need to perform Information Retrieval IR on the processed data
My problem now is: What data-structure or storage medium should I use to store the data I have retrieved by using NLP techniques
that data-store must have a capacity to support query
XML,JSON does not look an ideal candidate . (i could be wrong) : if they can be then some help/guidance on best way to do it will be helpful.
my current view is to convert/store the parse tree into a data format that can be directly read for query .(parse tree:a diagrammatic representation of the parsed structure of a sentence or string)
a sample of type of data need to be stored , for the text "My project is based on NLP." the Dependency would be as below
root(ROOT-0, based-4)
poss(project-2, My-1)
nsubjpass(based-4, project-2)
auxpass(based-4, is-3)
prep(based-4, on-5)
pobj(on-5, NLP-6)

Have you already extracted the information or are you trying to store the parse tree? If the former, this is still an open question in NLP. See, for example, the book by Jurafsky and Martin, which discusses many ways to do this.
Basically, we can't answer until we know what you're trying to store. If it's super simple information, you might be able to get away with a simple relational database.

Related

Transformation of tabular data into natural language for indexing for a search engine

How to transform tabular data that has various columns / rows as shown below into a more readable (natural language) so that it can be indexed for the downstream tasks of a search engine. I am aware that we have TAPAS (TAPAS: Weakly Supervised Table Parsing via Pre-training), a variant of BERT (Google) that is specifically designed for tabular data QnA (Question answering). But, the problem is we have an existing search service hosted in cloud that is capable of reading natural language and answer text based on that. Therefore, while indexing whole data (text, tables), we are losing valuable information in tables as the inherent relationships between rows & columns is lost. Result is poor quality answers for the information inside the table or no answer at all.
Following is an example:
Which transformation is better for the tabular data into a readable (natural language) format for the semantic search without losing context. Currently, we do have a working solution, but the context is lost as the relationship inherent within the elements of columns / rows is lost. Therefore, producing poor quality / no answers. If we could somehow, preserve this inherent relationship while feeding as a natural language to semantic search, it will improve the answer quality.
Please refer to the below table example.
Sample 1:
Question: How much of a feature 2 is allowed at PREMIUM_COMPANY for Name 4
Answer: Integer value
Sample 2:
Question: Is feature 2 allowed at PREMIUM_COMPANY for Name 7 / Name 8
Answer: Allowed in a list 1 / Not allowed at Name 8
While answering manually, we are able to preserve the relationship between two parameters within a column/row whereas it is lost when we convert these html tables into normal text for indexing. Our problem here is to address that. There is a considerable amount of tabular data that is valuable.
Possible idea, but tough to integrate in existing service is to create a separate data structure (index) for the tabular data and apply TAPAS on it to retrieve the answers. We still need to know how to flag tabular data to trigger it when there is a possible answer exist for a question.
Could you please answer if you have any expertise in this area.

Ontology Populating

Hello everyone,
because of my lack of experience with ontologies and web semantics, I have a conceptual misunderstanding. When we refer to 'ontology population', do we make clones of the ontology with our concrete data or do we map our concrete data to the ontology? If so, how is it done? My intention is to build a knowledge graph using an ontology (FIBO ontology for the loans domain) and I have also an excel file with loans data. Not every entry in my excel file corresponds to the ontology classes predefined. However, that is not a major problem I suppose. So, to make myself more clear, I want to know how do I practically populate the ontology?
Also, I would like to note that I am using neo4j as a graph database and python as my implementation language, so the process of the population of the ontology would have been done using its libraries.
Thanks in advance for your time!
This video could inform your understanding about modelling and imports for graph database design: https://www.youtube.com/watch?v=oXziS-PPIUA
He steps through importing a CSV in to Neo4j and uses python.
The terms ontology and web semantics (OWL) are probably not what you're asking about (being loans/finance domain, rather than web). Further web semantics is not taken very seriously by professionals these days.
"Graph database modelling" is probably a useful area of research to solve your problem.
I can recommend you use Apache Jena to populate your ontology with the data source. You can use either Java or Python. The first step begins with extracting triples from the loaded data depending on the RDF schema, which is the basis of triple extraction. The used parser in this step may differ to be compatible with the data source in your case it is the excel file. After extracting triples, an intermediate data model (IDM) is used for mapping from the triple format. IDM could be in any useful format for mapping, like JSON. After mapping, the next step will be loading the individuals from the intermediate data model to the RDF schema that was previously used. Now the RDF schema is updated to contain the individuals too. At this phase, you should review the updated schema to check whether it needs more data, then run the logic reasoner to evaluate and correct the possible problems and inconsistencies. If the Reasoner runs without any error, the RDF schema now contains all the possible individuals and you could use it for visualisation using Neo4j

Coding convention: Matching backend with frontend labels

In my backend, I have data attributes labeled in camelcase:
customerStats: {
ownedProducts: 100,
usedProducts: 50,
},
My UI code is set up in a way that an array of ["label", data] works best most of the time i.e. most convenient for frontend coding. In my frontend, I need these labels to be in proper english spelling so they can be used as is in the UI:
customerStats: [
["Owned products", 100],
["Used products", 50],
],
My question is about best practices or standards in web development. I have been inconsistent in my past projects where I would convert the data at random places, sometimes client-side, sometimes right on the backend, sometimes converting it one way and then back again because I needed the JSON data structure.
Is there a coding convention how the data should be supplied to the frontend?
Right now all my data is transfered as JSON to the frontend. Is it best practice to convert the data to the form that is need on the frontend or backend? What if I need the JSON attributes to do further calculations right on the client?
Technologies I am using:
Frontend: Javascript / React
Backend: Javascript / Node.js + Java / Java Spring
Is there a coding convention for how to transfer data to the front end
If your front end is JavaScript based, then JSON (Java Script Object Notation) is the simplest form to consume, it is a stringified version of the objects in memory. See this healthy discussion for more information on JSON
Given that the most popular front end development language is JavaScript these days, (see the latest SO Survey on technology) It is very common and widely accepted to use JSON format to transfer data between the back and front end of solutions. The decision to use JSON in non-JavaScript based solutions is influenced by the development and deployment tools that you use, seeing more developers are using JavaScript, most of our tools are engineered to support JavaScript in some capacity.
It is however equally acceptable to use other structured formats, like XML.
JSON is generally more light-weight than XML as there is less provision made to transfer meta-data about the schema. For in-house data streams, it can be redundant to transfer a fully specced XML schema with each data transmission, so JSON is a good choice where the structure of the data is not in question between the sender and receiver.
XML is a more formal format for data transmission, it can include full schema documentation that can allow receivers to utilize the information with little or no additional documentation.
CSV or other custom formats can reduce the bytes sent across the wire, but make it hard to visually inspect the data when you need to, and there is an overhead at both the sending and receiving end to format and parse the data.
Is it best practice to convert the data to the form that is need on the frontend or backend?
The best practice is to reduce the number of times that a data element needs to be converted. Ideally you never have to convert between a label and the data property name... This is also the primary reason for using JSON as the data transfer format.
Because JSON can be natively interpreted in a JavaScript front end, in a JavaScript front end we can essentially reduce conversion to just the server-side boundary where data is serialized/deserialized. There is in effect no conversion in the front end at all.
How to refer to data by the property name and not the visual label
The general accepted convention in this space is to separate the concerns between the data model and the user experience, the view. Importantly the view is the closest layer to the user, it represents a given 'point of view' of the data model.
It is hard to tailor a code solution for OP without any language or code provided for context, in an abstract sense, to apply this concept means to not try and have the data model carry the final information about how the data should be displayed, instead you have another piece of code that provides the information needed to present the data.
In different technologies and platforms we refer to this in different ways but the core concept of separating the Model from the View or Presentation is consistently represented through these design patterns:
Exploring the MVC, MVP, and MVVM design patterns
MVP vs MVC vs MVVM vs VIPER
For OP's specific scenario, this might involve a mapping structure like the following:
customerStatsLabels: {
ownedProducts: "Owned products",
usedProducts: "Used products",
}
If this question is updated with some code around how the UI is constructed I will update this response with something more specific.
NOTE:
In JavaScript, objects are simply arrays of arrays, and as such it is very easy to tweak existing code that is based on arrays, into code based on objects and vice-versa.

How to collect RDF triples for a simple knowledge graph?

When building a knowledge graph, the first step (if I understand it correctly), is to collect structured data, mainly RDF triples written by using some ontology, for example, Schema.org. Now, what is the best way to collect these RDF triples?
Seems two things we can do.
Use a crawler to crawls the web content, and for a specific page, search for RDF triples on this page. If we find them, collect them. If not, move on to the next page.
For the current page, instead of looking for existing RDF triples, use some NLP tools to understand the page content (such as using NELL, see http://rtw.ml.cmu.edu/rtw/).
Now, is my understanding above (basically/almost) correct? If so, why do we use NLP? why not just rely on the existing RDF triples? Seems like NLP is not as good/reliable as we are hoping… I could be completely wrong.
Here is another try of asking the same question
Let us say we want to create RDF triples by using the 3rd method mentioned by #AKSW, i.e., extract RDF triples from some web pages (text).
For example, this page. If you open it and use "view source", you can see quite some semantic mark-ups there (using OGP and Schema.org). So my crawler can simply do this: ONLY crawl/parse these mark-ups, and easily change these mark-ups into RDF triples, then declare success, move on to the next page.
So what the crawler has done on this text page is very simple: only collect semantic markups and create RDF triples from these markup. It is simple and efficient.
The other choice, is to use NLP tools to automatically extract structured semantic data from this same text (maybe we are not satisfied with the existing markups). Once we extract the structured information, we then create RDF triples from them. This is obviously a much harder thing to do, and we are not sure about its accuracy either (?).
What is the best practice here, what is the pros/cons here? I would prefer the easy/simple way - simply collect the existing markup and change that into RDF content, instead of using NLP tools.
And I am not sure how many people would agree with this? And is this the best practice? Or, it is simply a question of how far our requirements lead us?
Your question is unclear, because you did not state your data source, and all the answers on this page assumed it to be web markup. This is not necessarily the case, because if you are interested in structured data published according to best practices (called Linked Data), you can use so-called SPARQL endpoints to query Linked Open Data (LOD) datasets and generate your knowledge graph via federated queries. If you want to collect structured data from website markup, you have to parse markup to find and retrieve lightweight annotations written in RDFa, HTML5 Microdata, or JSON-LD. The availability of such annotations may be limited on a large share of websites, but for structured data expressed in RDF you should not use NLP at all, because RDF statements are machine-interpretable and easier to process than unstructured data, such as textual website content. The best way to create the triples you referred to depends on what you try to achieve.

What (in_memory) graph DB if modeling data is focused

I am out of ideas and hope to get some useful input. I am using this question to compress my experiences and share them, hoping to inspire some distributors to go the next step with modeling graph databases as a first class question/way.
I've been validating some graph database solutions usable by node.js for a few weeks. My use case is to save interactions of different social user network accounts. The need is to use CPU and memory in the most efficient way.
My most important requirements are:
in_memory (at least for indexing)
open source (and free to use)
same JavaScript/Node.js performance as first class citizen
comfortable query and modeling language
Neo4J
I really like cypher so my best choice would be Neo4j.
But the major issue about Neo4j is the JavaScript access is non-native. It uses the REST-API which is about ten times (10x) slower than direct Java access. So I took a look at node-neo4j-embedded, but it has been inactive for more than two years. It looks like its author isn't active at all (bad sign).
ArangoDB
The really nice core developers of ArangoDB answered to my question about internals. Finally it means JavaScript is first class citizen because native queries can be pushed out of JS. Looking at the open source benchmarks, I think it is fair. But I am afraid they didn't use node-neo4j-embedded for their benchmark. The benchmarks compare the REST-APIs (Edited because of #weinberger comment). I wished they compare the native APIs (maybe someone is snoopy enough and give it a try! - let us know!). Update: As I noticed now, OrientDB has answered the benchmark with a new node.js driver (using Command Cache by starting the server with -Dcommand.cache.enabled=true -Dcommand.cache.minExecutionTime=3, what isn't fair, because it wasn't a query caches benchmark!)
Because I like to use ArangoDB as a graph database I would have 3 choices (source: FAQ):
traverse JS objects
using AQLs graph functions
using the REST API
In general it isn't comfortable like cypher. And I am not sure how to compare and what is the right way modeling data (like Neo4J explains very well). I'd love to have something like this for ArangoDB Graphs. It feels like ArangoDB is focused on graph operations and Neo4J fits more the needs of using graphs if you have more relations than rows (the reason to use graphs instead of relations with joins).
MongoDB
The document based MongoDB isn't optimized for graph operations but latterly has gotten an experimental in_memory storage engine. Also there are some projects either in_memory or graph related but nothing is really compelling. And at this discussion it looks like MongoDB isn't what I like to use.
OrientDB
Because there is a comparison about OrientDB vs. MongoDB available (from OrientDB) I though about to use this one. "OrientDB has a hybrid Document-Graph engine" using SQL. I am a former PHP/MySQL expert. But where is the modeling part ? Their chapter working with graphs is not cypher like. It is like using SQL for Graphs. There is nothing wrong with that, but using cypher before I miss the modeling like feeling.
If someone did a modeling process with OrientDB and Graphs maybe you could write a tutorial like Neo4J had done.
Update: About JavaScript access like first citizen there are news:
"In the next release the speed of this driver will be comparable to the native Java one" The forked node.js driver had bin fixed last days.
Update: Before choosing OrientDB one might want to read article about some issues and discussions linked from there. The article is touching a sensitive issue and should be approached with critical mind. Note from author of this update: I'm new to editing SO and don't have enough reputation to put this to comments. I believe this information is a valid point to discussion, not sure how to place it here according to SO rules.
LokiJS
Before I was looking at Neo4J, ArangoDB and MongoDB, I played around with that JavaScript based in_memory database called LokiJS, what seams to follow the strategy to ignore everything what slows down performance and efficiency. LokiJS is trying to complete the Mongo-Style (RoadMap). The major issue is the bad ability to scale. Of cause it isn't a graph database but it was an interesting solution while the beginning of my project. Also it wasn't a perfect feeling to find all the distributed documentation (maybe they should reboot with GitBook).
Finally LokiJS is a very interesting project at all and I hope they will go forward!
LevelDB
Previously when I wrote my degree paper I was looking at levelDB. Remembering this while writing this post, I searched for LevelDB in_memory and got a promising result called MemDown (see also). I haven't tested this find, but maybe someone has experiences working and modeling for this solution. Maybe it would be the most efficient way if all the others will not fit because I would simply write a lightweight cypher clone with the goal to stay much lightweight as I can do.
Edit: Due to comment, here is a link to LevelGraph. As an idea to implement a CYPHER parser for LevelGraph/LevelDB your starting point would be to compare
Cypher:
CREATE (SUBJECT:"a") - [b:PREDICATE] -> (OBJECT:"c")
RETURN, subject, predicate, object
LevelGraph:
var RETURN = { SUBJECT: "a", PREDICATE: "b", OBJECT: "c" };
db.put(RETURN, function(err) {
// ..
});
Conclusion
As you likely noticed I am not the super hero about graphs. But this is my initial dive into this and I'm trying to get an overview. I assume there are a lot people out there who want to ask the same questions as me but haven't the time. I hope this post will help a lot people and will change by comments and answers to become a well done overview how to modeling data for graphs.
#editors: You are welcome.
#commenters: This is the result of my personal research - if you also have done a journey like me, please answer with a short summary like I have done for each DB I've evaluated (don't forget to target my 4 goals).
The idea to combine node-style performance through any of the native features (e.g. streams) and a high level query language like CYPHER is actually quite neat.
What you likely won't get is any kind of low level API, since this is rather rare with DB authors and, supposedly, not wanted in their design patterns. So, long running tcp connections shall just serve fine.
cypher-stream since to incorporate all of this, while (superficially judged) maintaining a good style.
Since you likely won't get any further with the search, I'd suggest sending him a pull request if any other features are needed :)
You should take a look at Gundb https://github.com/amark/gun
It's open source and has a very active and helpfull lead developer.
Join us at https://gitter.im/amark/gun

Resources