How to modify a parse tree in ANTLR4? - antlr4

I have used ANTLR4 for writing a Fortran parse. Now I get the parse tree (there is no AST in ANTLR4). My next work is to modify the parse tree according to my needs, such as inserting new data declaration statements and replacing current statement. I looked for addChild in ANTLR java API documents, but it seems no such method in RuleNode. So what should I do?

One way is to embed your code in the grammar file. This makes things a lot messy.
Another way is to write your classes in a separate file, create the required objects in #parser::members{...} section or in action part of your rules, and use them to get details from the grammar. This way, you'll get information from the grammar and you can model your data with your classes.
Good Luck!

Related

Ontology Populating

Hello everyone,
because of my lack of experience with ontologies and web semantics, I have a conceptual misunderstanding. When we refer to 'ontology population', do we make clones of the ontology with our concrete data or do we map our concrete data to the ontology? If so, how is it done? My intention is to build a knowledge graph using an ontology (FIBO ontology for the loans domain) and I have also an excel file with loans data. Not every entry in my excel file corresponds to the ontology classes predefined. However, that is not a major problem I suppose. So, to make myself more clear, I want to know how do I practically populate the ontology?
Also, I would like to note that I am using neo4j as a graph database and python as my implementation language, so the process of the population of the ontology would have been done using its libraries.
Thanks in advance for your time!
This video could inform your understanding about modelling and imports for graph database design: https://www.youtube.com/watch?v=oXziS-PPIUA
He steps through importing a CSV in to Neo4j and uses python.
The terms ontology and web semantics (OWL) are probably not what you're asking about (being loans/finance domain, rather than web). Further web semantics is not taken very seriously by professionals these days.
"Graph database modelling" is probably a useful area of research to solve your problem.
I can recommend you use Apache Jena to populate your ontology with the data source. You can use either Java or Python. The first step begins with extracting triples from the loaded data depending on the RDF schema, which is the basis of triple extraction. The used parser in this step may differ to be compatible with the data source in your case it is the excel file. After extracting triples, an intermediate data model (IDM) is used for mapping from the triple format. IDM could be in any useful format for mapping, like JSON. After mapping, the next step will be loading the individuals from the intermediate data model to the RDF schema that was previously used. Now the RDF schema is updated to contain the individuals too. At this phase, you should review the updated schema to check whether it needs more data, then run the logic reasoner to evaluate and correct the possible problems and inconsistencies. If the Reasoner runs without any error, the RDF schema now contains all the possible individuals and you could use it for visualisation using Neo4j

Workflow for interpreting linked data in .ttl files with Python RDFLib

I am using turtle files containing biographical information for historical research. Those files are provided by a major library and most of the information in the files is not explicit. While people's professions, for instance, are sometimes stated alongside links to the library's URIs, I only have URIs in the majority of cases. This is why I will need to retrieve the information behind them at some point in my workflow, and I would appreciate some advice.
I want to use Python's RDFLib for parsing the .ttl files. What is your recommended workflow? Should I read the prefixes I am interested in first, then store the results in .txt (?) and then write a script to retrieve the actual information from the web, replacing the URIs?
I have also seen that there are ways to convert RDFs directly to CSV, but although CSV is nice to work with, I would get a lot of unwanted "background noise" by simply converting all the data.
What would you recommend?
RDFlib's all about working with RDF data. If you have RDF data, my suggestion is to do as much RDF-native stuff that you can and then only export to CSV if you want to do something like print tabular results or load into Pandas DataFrames. Of course there are always more than one way to do things, so you could manipulate data in CSV, but RDF, by design, has far more information in it than a CSV file can so when you're manipulating RDF data, you have more things to get hold of.
most of the information in the files is not explicit
Better phrased: most of the information is indicated with objects identified by URIs, not given as literal values.
I want to use Python's RDFLib for parsing the .ttl files. What is your recommended workflow? Should I read the prefixes I am interested in first, then store the results in .txt (?) and then write a script to retrieve the actual information from the web, replacing the URIs?
No! You should store the ttl files you can get and then you may indeed retrieve all the other data referred to by URI but, presumably, that data is also in RDF form so you should download it into the same graph you loaded the initial ttl files in to and then you can have the full graph with links and literal values it it as your disposal to manipulate with SPARQL queries.

How to collect RDF triples for a simple knowledge graph?

When building a knowledge graph, the first step (if I understand it correctly), is to collect structured data, mainly RDF triples written by using some ontology, for example, Schema.org. Now, what is the best way to collect these RDF triples?
Seems two things we can do.
Use a crawler to crawls the web content, and for a specific page, search for RDF triples on this page. If we find them, collect them. If not, move on to the next page.
For the current page, instead of looking for existing RDF triples, use some NLP tools to understand the page content (such as using NELL, see http://rtw.ml.cmu.edu/rtw/).
Now, is my understanding above (basically/almost) correct? If so, why do we use NLP? why not just rely on the existing RDF triples? Seems like NLP is not as good/reliable as we are hoping… I could be completely wrong.
Here is another try of asking the same question
Let us say we want to create RDF triples by using the 3rd method mentioned by #AKSW, i.e., extract RDF triples from some web pages (text).
For example, this page. If you open it and use "view source", you can see quite some semantic mark-ups there (using OGP and Schema.org). So my crawler can simply do this: ONLY crawl/parse these mark-ups, and easily change these mark-ups into RDF triples, then declare success, move on to the next page.
So what the crawler has done on this text page is very simple: only collect semantic markups and create RDF triples from these markup. It is simple and efficient.
The other choice, is to use NLP tools to automatically extract structured semantic data from this same text (maybe we are not satisfied with the existing markups). Once we extract the structured information, we then create RDF triples from them. This is obviously a much harder thing to do, and we are not sure about its accuracy either (?).
What is the best practice here, what is the pros/cons here? I would prefer the easy/simple way - simply collect the existing markup and change that into RDF content, instead of using NLP tools.
And I am not sure how many people would agree with this? And is this the best practice? Or, it is simply a question of how far our requirements lead us?
Your question is unclear, because you did not state your data source, and all the answers on this page assumed it to be web markup. This is not necessarily the case, because if you are interested in structured data published according to best practices (called Linked Data), you can use so-called SPARQL endpoints to query Linked Open Data (LOD) datasets and generate your knowledge graph via federated queries. If you want to collect structured data from website markup, you have to parse markup to find and retrieve lightweight annotations written in RDFa, HTML5 Microdata, or JSON-LD. The availability of such annotations may be limited on a large share of websites, but for structured data expressed in RDF you should not use NLP at all, because RDF statements are machine-interpretable and easier to process than unstructured data, such as textual website content. The best way to create the triples you referred to depends on what you try to achieve.

Storing data obtained from Information Extration

I have some experience with java and I am a student doing my final year project.
I need to work on a project in Natural language processing , well I am currently trying to work on stanford-nlp libraries (but am not locked to it , i can change my tool) so answers can be for any tool proper for my problem.
I have planned to work on Information Extraction IE , and have seen some page/pdf that explain how it works with various NLP techniques. Data will be processed with NLP and i need to perform Information Retrieval IR on the processed data
My problem now is: What data-structure or storage medium should I use to store the data I have retrieved by using NLP techniques
that data-store must have a capacity to support query
XML,JSON does not look an ideal candidate . (i could be wrong) : if they can be then some help/guidance on best way to do it will be helpful.
my current view is to convert/store the parse tree into a data format that can be directly read for query .(parse tree:a diagrammatic representation of the parsed structure of a sentence or string)
a sample of type of data need to be stored , for the text "My project is based on NLP." the Dependency would be as below
root(ROOT-0, based-4)
poss(project-2, My-1)
nsubjpass(based-4, project-2)
auxpass(based-4, is-3)
prep(based-4, on-5)
pobj(on-5, NLP-6)
Have you already extracted the information or are you trying to store the parse tree? If the former, this is still an open question in NLP. See, for example, the book by Jurafsky and Martin, which discusses many ways to do this.
Basically, we can't answer until we know what you're trying to store. If it's super simple information, you might be able to get away with a simple relational database.

What is the act of creating objects from XML data called?

In the context of loading an XML file, what's a good name for the step in which you create internal data structures (be they objects, structs, or whatever) to hold the data in memory? What do you usually call the other steps?
LOAD, OPEN, or READ the xml, by opening a file.
PARSE the xml, with some XML parser.
??? the xml, creating data structures.
Options that have come to mind for step 3 are: handle, create_foobars, create_foobars_from_xml, or even read, load, or parse.
One other option that comes to mind is to have an object's constructor take an xml entity, but I'm not fond of coupling the objects to the xml schema like that.
Deserialization is the correct term for the "???" part of your question. If you want to convert the object back to XML, then that would be (you guessed it) serialization.
Deserialization or unmarshalling.

Resources