I am trying to build a topic hierarchy by following the below mentioned two DBpedia properties.
skos:broader property
dcterms:subject property
My intention is to given the word identify the topic of it. For example, given the word; 'suport vector machine', I want to identify topics from it such as classification algorithm, machine learning etc.
However, sometimes I am bit confused as how to build a topic hierarchy as I am getting more than 5 URIs for subject and many URIs for broader properties. Is there a way to measure strength or something and reduce the additional URIs that I get from DBpedia and to assign only the highest probable URI?
It seems there are two questions there.
How to limit the number of DBpedia Spotlight results.
How to limit the number of subjects and categories for a particular result.
My current code is as follows.
from SPARQLWrapper import SPARQLWrapper, JSON
import requests
import urllib.parse
## initial consts
BASE_URL = 'http://api.dbpedia-spotlight.org/en/annotate?text={text}&confidence={confidence}&support={support}'
TEXT = 'First documented in the 13th century, Berlin was the capital of the Kingdom of Prussia (1701–1918), the German Empire (1871–1918), the Weimar Republic (1919–33) and the Third Reich (1933–45). Berlin in the 1920s was the third largest municipality in the world. After World War II, the city became divided into East Berlin -- the capital of East Germany -- and West Berlin, a West German exclave surrounded by the Berlin Wall from 1961–89. Following German reunification in 1990, the city regained its status as the capital of Germany, hosting 147 foreign embassies.'
CONFIDENCE = '0.5'
SUPPORT = '120'
REQUEST = BASE_URL.format(
text=urllib.parse.quote_plus(TEXT),
confidence=CONFIDENCE,
support=SUPPORT
)
HEADERS = {'Accept': 'application/json'}
sparql = SPARQLWrapper("http://dbpedia.org/sparql")
all_urls = []
r = requests.get(url=REQUEST, headers=HEADERS)
response = r.json()
resources = response['Resources']
for res in resources:
all_urls.append(res['#URI'])
for url in all_urls:
sparql.setQuery("""
SELECT * WHERE {<"""
+url+
""">skos:broader|dct:subject ?resource
}
""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
for result in results["results"]["bindings"]:
print('resource ---- ', result['resource']['value'])
I am happy to provide more examples if needed.
It seems you are trying to retrieve Wikipedia categories relevant to a given paragraph.
Minor suggestions
First, I'd suggest you to perform a single request, collecting DBpedia Spotlight results into VALUES, for example, in this way:
values = '(<{0}>)'.format('>) (<'.join(all_urls))
Second, if you are talking about topic hierarchy, you should use SPARQL 1.1 property paths.
These two suggestions are slightly incompatible. Virtuoso is very inefficient, when a query contains both multiple starting points (i. e. VALUES) and arbitrary length paths (i. e. * and + operators).
Here below I'm using the dct:subject/skos:broader property path, i.e. retrieving the 'next-level' categories.
Approach 1
The first way is to order resources by their general popularity, e. g. their PageRank:
values = '(<{0}>)'.format('>) (<'.join(all_urls))
sparql.setQuery(
"""PREFIX vrank:<http://purl.org/voc/vrank#>
SELECT DISTINCT ?resource ?rank
FROM <http://dbpedia.org>
FROM <http://people.aifb.kit.edu/ath/#DBpedia_PageRank>
WHERE {
VALUES (?s) {""" + values +
""" }
?s dct:subject/skos:broader ?resource .
?resource vrank:hasRank/vrank:rankValue ?rank.
} ORDER BY DESC(?rank)
LIMIT 10
""")
Results are:
dbc:Member_states_of_the_United_Nations
dbc:Country_subdivisions_of_Europe
dbc:Republics
dbc:Demography
dbc:Population
dbc:Countries_in_Europe
dbc:Third-level_administrative_country_subdivisions
dbc:International_law
dbc:Former_countries_in_Europe
dbc:History_of_the_Soviet_Union_and_Soviet_Russia
Approach 2
The second way is to calculate category frequency a given text...
values = '(<{0}>)'.format('>) (<'.join(all_urls))
sparql.setQuery(
"""SELECT ?resource count(?resource) AS ?count WHERE {
VALUES (?s) {""" + values +
""" }
?s dct:subject ?resource
} GROUP BY ?resource
# https://github.com/openlink/virtuoso-opensource/issues/254
HAVING (count(?resource) > 1)
ORDER BY DESC(count(?resource))
LIMIT 10
""")
Results are:
dbc:Wars_by_country
dbc:Wars_involving_the_states_and_peoples_of_Europe
dbc:Wars_involving_the_states_and_peoples_of_Asia
dbc:Wars_involving_the_states_and_peoples_of_North_America
dbc:20th_century_in_Germany
dbc:Modern_history_of_Germany
dbc:Wars_involving_the_Balkans
dbc:Decades_in_Germany
dbc:Modern_Europe
dbc:Wars_involving_the_states_and_peoples_of_South_America
With dct:subject instead of dct:subject/skos:broader, results are better:
dbc:Former_polities_of_the_Cold_War
dbc:Former_republics
dbc:States_and_territories_established_in_1949
dbc:20th_century_in_Germany_by_period
dbc:1930s_in_Germany
dbc:Modern_history_of_Germany
dbc:1990_disestablishments_in_West_Germany
dbc:1933_disestablishments_in_Germany
dbc:1949_establishments_in_West_Germany
dbc:1949_establishments_in_Germany
Conclusion
Results are not very good. I see two reasons: DBpedia categories are quite random, tools are quite primitive. Perhaps it is possible to achieve better results, combining approaches 1 and 2. Anyway, experiments with a large corpus are needed.
Related
I want to get the latitude and longitude of the companies listed in a dataframe already cleaned but the only information that I have is the name of the company and the country (In this case just UK).
DataFrame
After trying different things I have got some of the lats and longs but not the ones located in UK in most of the cases.
This is the code I tried:
base_url= "https://maps.googleapis.com/maps/api/geocode/json?"
AUTH_KEY = "AI**************QTk"
geolocator = GoogleV3(api_key = AUTH_KEY)
parameters = {"address": "Revolut, London",
"key": AUTH_KEY}
print(f"{base_url}{urllib.parse.urlencode(parameters)}")
r = requests.get(f"{base_url}{urllib.parse.urlencode(parameters)}")
data = json.loads(r.content)
data.get("results")[0].get("geometry").get("location") #That works for the first company
df["loc"] = df["Company name for communication"].apply(geolocator.geocode)
df["point"]= df["loc"].apply(lambda loc: tuple(loc.point) if loc else None)
df[['lat', 'lon', 'altitude']] = pd.DataFrame(df['point'].to_list(), index=df.index)
DataFrame with long and lat wrong
I would agree so much any help. Let me know if my explanation is not clear to provide more details. Thank you!
If you are only trying to get Geocoding API results in the UK, then you would want to make use of component filtering.
The Geocoding API can return address results restricted to a specific area. You can specify the restriction using the components filter. For more information, see Component Filtering. Specifically, you would want to include the country.
Note that the value should be a country name or a two letter ISO 3166-1 country code. The API follows the ISO standard for defining countries, and the filtering works best when using the corresponding ISO code of the country. For example
Here is a sample Geocoding web request with country components filtering in the UK looks like:
https://maps.googleapis.com/maps/api/geocode/json?address=high+st+hasting&components=country:gb&key=YOUR_API_KEY
This will only return a result that is only located in the UK, and will return zero results if not available.
You may also want to take a look at region biasing.
Note that if you bias for the region, the returned result prefers results in the country, but doesn't restrict them to that country and will return a result for an address. Unlike component filtering, this takes a ccTLD (country code top-level domain) argument specifying the region bias. Most ccTLD codes are identical to ISO 3166-1 codes, with some notable exceptions. For example, the United Kingdom's ccTLD is "uk" (.co.uk) while its ISO 3166-1 code is "gb" (technically for the entity of "The United Kingdom of Great Britain and Northern Ireland").
Please also take a look at the Geocoding API Best Practices
I have got the results using the component filtering with this code:
#Get the location of first company
base_url= "https://maps.googleapis.com/maps/api/geocode/json?"
AUTH_KEY = "AI********************Tk"
geolocator = GoogleV3(api_key = AUTH_KEY)
components = [ ('country', 'GB' )]
def get_location(x):
return geolocator.geocode(x, components=components)
df["loc"] = df["Company name for communication"].apply(get_location)
df["point"]= df["loc"].apply(lambda loc: tuple(loc.point) if loc else None)
df[['lat', 'lon', 'altitude']] = pd.DataFrame(df['point'].to_list(), index=df.index)
df
DataFrame with lat and lon
I am playing with WordNet and try to solve a NLP task.
I was wondering if there exists any way to get a list of words belonging to some large sets, such as "animals" (i.e. dog, cat, cow etc.), "countries", "electronics" etc.
I believe that it should be possible to somehow get this list by exploiting hypernyms.
Bonus question: do you know any other way to classify words in very large classes, besides "noun", "adjective" and "verb"? For example, classes like, "prepositions", "conjunctions" etc.
Yes, you just check if the category is a hypernym of the given word.
from nltk.corpus import wordnet as wn
def has_hypernym(word, category):
# Assume the category always uses the most popular sense
cat_syn = wn.synsets(category)[0]
# For the input, check all senses
for syn in wn.synsets(word):
for match in syn.lowest_common_hypernyms(cat_syn):
if match == cat_syn:
return True
return False
has_hypernym('dog', 'animal') # => True
has_hypernym('bucket', 'animal') # => False
If the broader word (the "category" here) is the lowest common hypernym, that means it's a direct hypernym of the query word, so the query word is in the category.
Regarding your bonus question, I have no idea what you mean. Maybe you should look at NER or open a new question.
With some help from polm23, I found this solution, which exploits similarity between words, and prevents wrong results when the class name is ambiguous.
The idea is that WordNet can be used to compare a list words, with the string animal, and compute a similarity score. From the nltk.org webpage:
Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node).
def keep_similar(words, similarity_thr):
similar_words=[]
w2 = wn.synset('animal.n.01')
[similar_words.append(word) for word in words if wn.synset(word + '.n.01').wup_similarity(w2) > similarity_thr ]
return similar_words
For example, if word_list = ['dog', 'car', 'train', 'dinosaur', 'London', 'cheese', 'radon'], the corresponding scores are:
0.875
0.4444444444444444
0.5
0.7
0.3333333333333333
0.3076923076923077
0.3076923076923077
This can easily be used to generate a list of animals, by setting a proper value of similarity_thr
I have two text files. One is a list(key-value pairs) of items and the other is a input file that the key-value pairs are to be matched. If a match is found it is marked with its corresponding value in the input file.
For example:
my list file:
food = ###food123
food microbiology = ###food mircobiology456
mirco organism = ###micro organims789
geo tagging = ###geo tagging614
gross income = ###gross income630
fermentation = fermentation###929
contamination = contamination##878
Salmonella species = Salmonella species###786
Lactic acid bacteria = Lactic acid bacteria###654
input file:
There are certain attributes necessary for fermentation of meat.
It should be fresh, with low level of contamination, the processing should be hygienic and the refrigeration should be resorted at different stages.
Elimination of pathogens like Coliform, Staphylococci, Salmonella species may be undertaken either by heat or by irradiation. There is need to lower the water activity and this can be achieved by either drying or addition of the salts.
Inoculation with an effective, efficient inoculum consisting of Lactic acid bacteria and, or Micrococci which produces lactic acid and also contributes to the flavor development of the product.
Effective controlled time, temperature humidity during the production is essential.
And, Salt ensures the low pH value and extends the shelf-life of the fermented meats like Sausages.
Expected Output:
There are certain attributes necessary for ((fermentation###929)) of meat.
It should be fresh, with low level of ((contamination##878)), the processing should be hygienic and the refrigeration should be resorted at different stages.
Elimination of pathogens like Coliform, Staphylococci, ((Salmonella species###786)) may be undertaken either by heat or by irradiation. There is need to lower the water activity and this can be achieved by either drying or addition of the salts.
Inoculation with an effective, efficient inoculum consisting of ((Lactic acid bacteria###654)) and, or Micrococci which produces lactic acid and also contributes to the flavor development of the product.
Effective controlled time, temperature humidity during the production is essential.
And, Salt ensures the low pH value and extends the shelf-life of the fermented meats like Sausages.
For this I am using python3, parsing the list file, and storing it in a hash. Hash has all the elements of the list as key-value pairs. Then each line of input file is matched with all keys present in hash and when a match is found the corresponding hash value is replaced as shown in the output.
This method works fine when the size of input and list is small, but when both the list and input size grows its taking lot of time.
How can I improve the time complexity of this matching method?
Algorithm I am using :
#parse list and store in hash
for l in list:
ll = l.split("=")
hash[ll[0]] = ll[1]
#iterate input and match with each key
keys = hash.keys()
for line in lines:
if(line != ""):
for key in keys:
my_regex = r"([,\"\'\( \/\-\|])" + key + r"([ ,\.!\"।\'\/\-)])"
if((re.search(my_regex, line, re.IGNORECASE|re.UNICODE))):
line = re.sub(my_regex, r"\1" + "((" + hash[key] + "))" + r"\2",line)
I have 2 csv file, one file contains a lots of datas, in words, numbers and also symbols, called Control.csv, and another file only contains words, called dictionary.csv. Below are the codes i have tried so far
import io
import csv
f1 = io.open('Control.csv', 'r', errors='ignore').readlines()
f2 = io.open('Diseases.csv', 'r', errors='ignore').readlines()
f3 = io.open('Compare.csv', 'w')
c1 = csv.reader(f1)
c2 = csv.reader(f2)
c3 = csv.writer(f3)
dictionary = list(c2)
for Control_row in c1:
row = 1
found = False
for Diseases_row in dictionary:
Compare_row = Control_row
if Control_row == Diseases_row:
Compare_row.append('FOUND in dictionary (row ' + str(row) + ')')
found = True
break
row = row + 1
if not found:
Compare_row.append('NOT FOUND in dictionary')
c3.writerow(Compare_row)
I wanted to compare the 2 csv files, if any of the rows in Control.csv contains any words exists in Dictionary.csv, the rows will be transferred to a new file called Compare.csv. Each rows in Control.csv has a lots of words, numbers and symbol
Problem: I kept on getting NOT FOUND in dictionary eventhough there are rows in Control.csv that have the words from Dictionary.csv. And the rows that does not contain any words from Dictionary.csv also got transferred inside Compare.csv
Example:
Control.csv
1. The National Nutrition and Physical Activity Guidelines For Childcare Centres is an action to reduce the rate of obesity among children. The action takes into consideration that preventive measures and the practise of a healthy lifestyle should be instilled earlier from school, towards producing a health literate generation who appreciates good health.
2. In line with further development of the country's tourism industry as a core goal of organising Regatta 2019, the event also attracted the participant of foreign teams.The participation of teams from abroad shows the traditional sports is also capable of uniting the local community and visitors from other countries. The participants took the opportunity to enjoy the excitement and joy of rowing in waters.
Dictionary.csv
obesity
diseases
Compare.csv (should have)
1 The National Nutrition and Physical Activity Guidelines For Childcare Centres is an action to reduce the rate of obesity among children. The action takes into consideration that preventive measures and the practise of a healthy lifestyle should be instilled earlier from school, towards producing a health literate generation who appreciates good health.
I'm testing such sentence to extract entity values:
s = "Height: 3m, width: 4.0m, others: 3.4 m, 4m, 5 meters, 10 m. Quantity: 6."
sent = nlp(s)
for ent in sent.ents:
print(ent.text, ent.label_)
And got some misleading values:
3 CARDINAL
4.0m CARDINAL
3.4 m CARDINAL 4m CARDINAL 5 meters QUANTITY 10 m QUANTITY 6 CARDINAL
namely, number 3m is not paired with m. This is the case for many examples as I can't rely on this engine when want to separate meters from quantities.
Should I do this manually?
One potential difficulty in your example is that it's not very close to natural language. The pre-trained English models were trained on ~2m words of general web and news text, so they're not always going to perform perfect out-of-the-box on text with a very different structure.
While you could update the model with more examples of QUANTITY in your specific texts, I think that a rule-based approach might actually be a better and more efficient solution here.
The example in this blog post is actually very close to what you're trying to do:
import spacy
from spacy.pipeline import EntityRuler
nlp = spacy.load("en_core_web_sm")
weights_pattern = [
{"LIKE_NUM": True},
{"LOWER": {"IN": ["g", "kg", "grams", "kilograms", "lb", "lbs", "pounds"]}}
]
patterns = [{"label": "QUANTITY", "pattern": weights_pattern}]
ruler = EntityRuler(nlp, patterns=patterns)
nlp.add_pipe(ruler, before="ner")
doc = nlp("U.S. average was 2 lbs.")
print([(ent.text, ent.label_) for ent in doc.ents])
# [('U.S.', 'GPE'), ('2 lbs', 'QUANTITY')]
The statistical named entity recognizer respects pre-defined entities and wil "predict around" them. So if you're adding the EntityRuler before it in the pipeline, your custom QUANTITY entities will be assigned first and will be taken into account when the entity recognizer predicts labels for the remaining tokens.
Note that this example is using the latest version of spaCy, v2.1.x. You might also want to add more patterns to cover different constructions. For more details and inspiration, check out the documentation on the EntityRuler, combining models and rules and the token match pattern syntax.