Ontology Populating - python-3.x

Hello everyone,
because of my lack of experience with ontologies and web semantics, I have a conceptual misunderstanding. When we refer to 'ontology population', do we make clones of the ontology with our concrete data or do we map our concrete data to the ontology? If so, how is it done? My intention is to build a knowledge graph using an ontology (FIBO ontology for the loans domain) and I have also an excel file with loans data. Not every entry in my excel file corresponds to the ontology classes predefined. However, that is not a major problem I suppose. So, to make myself more clear, I want to know how do I practically populate the ontology?
Also, I would like to note that I am using neo4j as a graph database and python as my implementation language, so the process of the population of the ontology would have been done using its libraries.
Thanks in advance for your time!

This video could inform your understanding about modelling and imports for graph database design: https://www.youtube.com/watch?v=oXziS-PPIUA
He steps through importing a CSV in to Neo4j and uses python.
The terms ontology and web semantics (OWL) are probably not what you're asking about (being loans/finance domain, rather than web). Further web semantics is not taken very seriously by professionals these days.
"Graph database modelling" is probably a useful area of research to solve your problem.

I can recommend you use Apache Jena to populate your ontology with the data source. You can use either Java or Python. The first step begins with extracting triples from the loaded data depending on the RDF schema, which is the basis of triple extraction. The used parser in this step may differ to be compatible with the data source in your case it is the excel file. After extracting triples, an intermediate data model (IDM) is used for mapping from the triple format. IDM could be in any useful format for mapping, like JSON. After mapping, the next step will be loading the individuals from the intermediate data model to the RDF schema that was previously used. Now the RDF schema is updated to contain the individuals too. At this phase, you should review the updated schema to check whether it needs more data, then run the logic reasoner to evaluate and correct the possible problems and inconsistencies. If the Reasoner runs without any error, the RDF schema now contains all the possible individuals and you could use it for visualisation using Neo4j

Related

Transformation of tabular data into natural language for indexing for a search engine

How to transform tabular data that has various columns / rows as shown below into a more readable (natural language) so that it can be indexed for the downstream tasks of a search engine. I am aware that we have TAPAS (TAPAS: Weakly Supervised Table Parsing via Pre-training), a variant of BERT (Google) that is specifically designed for tabular data QnA (Question answering). But, the problem is we have an existing search service hosted in cloud that is capable of reading natural language and answer text based on that. Therefore, while indexing whole data (text, tables), we are losing valuable information in tables as the inherent relationships between rows & columns is lost. Result is poor quality answers for the information inside the table or no answer at all.
Following is an example:
Which transformation is better for the tabular data into a readable (natural language) format for the semantic search without losing context. Currently, we do have a working solution, but the context is lost as the relationship inherent within the elements of columns / rows is lost. Therefore, producing poor quality / no answers. If we could somehow, preserve this inherent relationship while feeding as a natural language to semantic search, it will improve the answer quality.
Please refer to the below table example.
Sample 1:
Question: How much of a feature 2 is allowed at PREMIUM_COMPANY for Name 4
Answer: Integer value
Sample 2:
Question: Is feature 2 allowed at PREMIUM_COMPANY for Name 7 / Name 8
Answer: Allowed in a list 1 / Not allowed at Name 8
While answering manually, we are able to preserve the relationship between two parameters within a column/row whereas it is lost when we convert these html tables into normal text for indexing. Our problem here is to address that. There is a considerable amount of tabular data that is valuable.
Possible idea, but tough to integrate in existing service is to create a separate data structure (index) for the tabular data and apply TAPAS on it to retrieve the answers. We still need to know how to flag tabular data to trigger it when there is a possible answer exist for a question.
Could you please answer if you have any expertise in this area.

Resource files as repositories in DDD

I have a system that needs compute some operations based on complex formulas. Some of them require to choose values based on some table data (like an excel). Coding this table data is a mess, so I have decided to maintain de data tables in csv files and search their values when I need. This data not represent any entity, so I have the doubt if this files have to be coded as a DDD repository.
The only reason to use csv files is for code clarity. If I have to code each files like a repository interface and inject them to entity each time that I have to compute something, the code lose readability.
On the other hand, in DDD entities have to be independent of any architecture implementation. Doing this I'm coupling entities with an "external resource".
Putting CSV reading code in a repository is exactly what you should do. I don’t understand the part about this making the code less readable – it should make things more readable because there would be a single method call on a repository whose purpose is well known.
Using a repository has other advantages as well. For unit tests you do not have to create CSV files specifically for a particular test. Instead you can mock the repository to return any value necessary for the test. Also, should the implementation of that data change at some point in the future it will be much easier to update to the new data source if the code is in a repository.

How to collect RDF triples for a simple knowledge graph?

When building a knowledge graph, the first step (if I understand it correctly), is to collect structured data, mainly RDF triples written by using some ontology, for example, Schema.org. Now, what is the best way to collect these RDF triples?
Seems two things we can do.
Use a crawler to crawls the web content, and for a specific page, search for RDF triples on this page. If we find them, collect them. If not, move on to the next page.
For the current page, instead of looking for existing RDF triples, use some NLP tools to understand the page content (such as using NELL, see http://rtw.ml.cmu.edu/rtw/).
Now, is my understanding above (basically/almost) correct? If so, why do we use NLP? why not just rely on the existing RDF triples? Seems like NLP is not as good/reliable as we are hoping… I could be completely wrong.
Here is another try of asking the same question
Let us say we want to create RDF triples by using the 3rd method mentioned by #AKSW, i.e., extract RDF triples from some web pages (text).
For example, this page. If you open it and use "view source", you can see quite some semantic mark-ups there (using OGP and Schema.org). So my crawler can simply do this: ONLY crawl/parse these mark-ups, and easily change these mark-ups into RDF triples, then declare success, move on to the next page.
So what the crawler has done on this text page is very simple: only collect semantic markups and create RDF triples from these markup. It is simple and efficient.
The other choice, is to use NLP tools to automatically extract structured semantic data from this same text (maybe we are not satisfied with the existing markups). Once we extract the structured information, we then create RDF triples from them. This is obviously a much harder thing to do, and we are not sure about its accuracy either (?).
What is the best practice here, what is the pros/cons here? I would prefer the easy/simple way - simply collect the existing markup and change that into RDF content, instead of using NLP tools.
And I am not sure how many people would agree with this? And is this the best practice? Or, it is simply a question of how far our requirements lead us?
Your question is unclear, because you did not state your data source, and all the answers on this page assumed it to be web markup. This is not necessarily the case, because if you are interested in structured data published according to best practices (called Linked Data), you can use so-called SPARQL endpoints to query Linked Open Data (LOD) datasets and generate your knowledge graph via federated queries. If you want to collect structured data from website markup, you have to parse markup to find and retrieve lightweight annotations written in RDFa, HTML5 Microdata, or JSON-LD. The availability of such annotations may be limited on a large share of websites, but for structured data expressed in RDF you should not use NLP at all, because RDF statements are machine-interpretable and easier to process than unstructured data, such as textual website content. The best way to create the triples you referred to depends on what you try to achieve.

Storing data obtained from Information Extration

I have some experience with java and I am a student doing my final year project.
I need to work on a project in Natural language processing , well I am currently trying to work on stanford-nlp libraries (but am not locked to it , i can change my tool) so answers can be for any tool proper for my problem.
I have planned to work on Information Extraction IE , and have seen some page/pdf that explain how it works with various NLP techniques. Data will be processed with NLP and i need to perform Information Retrieval IR on the processed data
My problem now is: What data-structure or storage medium should I use to store the data I have retrieved by using NLP techniques
that data-store must have a capacity to support query
XML,JSON does not look an ideal candidate . (i could be wrong) : if they can be then some help/guidance on best way to do it will be helpful.
my current view is to convert/store the parse tree into a data format that can be directly read for query .(parse tree:a diagrammatic representation of the parsed structure of a sentence or string)
a sample of type of data need to be stored , for the text "My project is based on NLP." the Dependency would be as below
root(ROOT-0, based-4)
poss(project-2, My-1)
nsubjpass(based-4, project-2)
auxpass(based-4, is-3)
prep(based-4, on-5)
pobj(on-5, NLP-6)
Have you already extracted the information or are you trying to store the parse tree? If the former, this is still an open question in NLP. See, for example, the book by Jurafsky and Martin, which discusses many ways to do this.
Basically, we can't answer until we know what you're trying to store. If it's super simple information, you might be able to get away with a simple relational database.

Concerns about Core Data

I'm getting ready to dive into my first Core Data adventure. While evaluating the framework two questions came up that really got me thinking about using Core Data at all for this project or to stick with SQLite.
My app will heavily rely upon importing data from an external source. I'm aware that one can import into Core Data but handling complex relationships seems complicated and tedious. Is there an easy way to accomplish complex imports?
The app has to be able to execute complex queries spanning multiple tables or having multiple conditions. Building these predicates and expressions simply scares me...
Is it worth to take the plunge and use Core Data or should I stick with SQLite?
As I and others have said before, Core Data is really an object-graph management framework. It manages the relationships between model objects, including constraints on their cardinality, and manages cascading deletes etc. It also manages constraints on individual attributes. Core Data just happens to also be able to persist that object graph to disk. It can do this in a number of formats, including XML, binary, and via SQLite. Thus, Core Data is really orthogonal to SQLite. If your task is dealing with an embedded SQL-compatible database, go with SQLite. If your task is managing the model layer of an MVC app, go with Core Data. In specific answers to your questions:
There is no magic that can automatically import complex data into any model. That said, it is relatively easy in Core Data. Taking a multi-pass approach and using the SQLite backend can help with memory consumption by allowing you to keep only a subset of the data in memory at a time. If the data sets can be kept in memory, you can write a custom persistent store format that reads/writes directly to your legacy data format from within Core Data (see the Atomic Store Programming Guide).
Building a complex NSPredicate declaratively is somewhat verbose but shouldn't scare you. The Predicate Programming Guide is a good place to start. You can, of course, also write predicates using a string format, much like a string-formatted SQL statement. It's worth noting that, as described above, the predicates in Core Data are on the objects and object graph, not on the SQL tables. If you really want to think at the level of tables, stick with SQLite and write your own wrapper.
I can't really speak to your first point.
However, regarding your second point, using Core Data means you don't have to really worry about complex queries since you can just pretend that all the relationships are properly established in memory already (Apple's implementation details aside). It doesn't matter how complex a join it might be in a database environment because you really aren't in a database environment. If you need to get the fourth child of the grandparent of your current object and then find that child's pet's name and breed, all you do is traverse up the object tree in code using a series of messages or properties. No worries about joins or anything. The only problem is it might be really slow depending on your objects' relationships, but I can't really speak accurately to that since I haven't actually implemented anything using Core Data (I've just read about it extensively on Apple's and others' websites).
If the data importer from an external source is written based on the same core data model (for the targeted/destination side of the import) - nothing will be conceptually different as compare to using/updating the same data (through the core data stack from your actual application).
If you create the data importer without using the core data stack, make sure you learn well the db schema that would be generated/expected by the core data based model. There is nothing magic there - just make sure you follow how the cross entity relationships are implemented and how entity hierarchies are stored.
I had to create recently a data importer from Access database into the core data based Sqlite store as a .NET app. Once my destination core data model was define, I created a small app that populated the Sqlite store with randomly generated entities (including all the expected relationships). Then, I reverse engineered how the core data actually created the Sqlite store for the model and how it handles the relationships by learning from the generated and persisted data. Then, I implemented the .NET based importer/data-transformer according to my observations. At the end, I got perfect core data friendly data store that could be open an modified from the application that was using the core data stack on Mac OSX.

Resources