dbpedia - are only English articles indexed? - dbpedia

I would like to query dbpedia for articles in different languages, e.g. Hungarian. Here is an example query: it searches for articles with the name 'Budapest' (capital of Hungary).
http://dbpedia.org/sparql
PREFIX dbprop: <http://dbpedia.org/property/>
PREFIX db: <http://dbpedia.org/resource/>
SELECT ?article, ?url, ?name WHERE {
?article foaf:isPrimaryTopicOf ?url .
?article foaf:name ?name
FILTER regex(?name, 'Budapest')
}
LIMIT 100
note: the query takes a while to execute because of the regex matching.
There are Wikipedia articles with this name in both English and Hungarian, however the query gives English articles only (all urls are under the en.wikipedia.org domain).
Are articles on other languages indexed in dbpedia?, if so, how can I modify the query to find the Hungarian articles too?

Yes only English literals are in the public endpoint (including abstracts).
If you want to query other language abstracts:
prepare a triplestore on your localhost (e.g. Virtuoso).
insert the long-abstracts_hu.ttl.bz2 file (Hungarian dbpedia) intoa graph of your choice. (note: you might have to extract or convert the .bz2 file to .gz first - depending on the triple store)
do a federated query over the public dbpedia endpoint and your local store
If you run into trouble, feel free to ask for assistance.

Related

Azure Cognitive Search - No results using wildcards on content with DOT

I'm using Azure Cognitive Search to build a rich search experience inside a web application, however I'm facing the following issue: a field of the index will contains codes like "Z.A.01.12", "A.A.44.11" and so on...I'm trying to use the wildcards * as suffix in order to search all the results that starts with the value Z.A (just an example).
"Z.A.01.12" -> Z.A* => No results found.
"Z.A.01.12" -> Z.A.* => No results found.
"Z.A.01.12" -> Z\.A* => No results found.
I have tried different analyzer (standard lucene, en.microsoft, whitespace and keyword), but also when I see that exists only one token produced (for example with whitespace) with the entire content, when I query the service using wildlcard I receive "No results found".
I have already set queryType=full and searchMode=any. Furthermore I also tried to escape the . with "", but the results is always empty. Is there anything I can do to manage these cases?

Combining phrases from list of words Python3

doing my best to grab information out of a lot of pdf files. Have them in a dictionary format where the key is a given date and the values are a list of occupations.
looks like this when proper:
'12/29/2014': [['COUNSELING',
'NURSING',
'NURSING',
'NURSING',
'NURSING',
'NURSING']]
However, occasionally there are occupations with several words which cannot be reliably understood in single word-form, such as this:
'11/03/2014': [['DENTISTRY',
'OSTEOPATHIC',
'MEDICINE',
'SURGERY',
'SOCIAL',
'SPEECH-LANGUAGE',
'PATHOLOGY']]
Notice that "osteopathic medicine & surgery" and "speech-language pathology" are the full text for two of these entries. This gets hairier when we also have examples of just "osteopathic medicine" or even "medicine."
So my question is this - How should I go about testing combinations of these words to see if they match more complex occupational titles? I can use the same order of the words, as I have maintained that from the source.
Thanks!

Finding Related Topics using Google Knowledge Graph API

I'm currently working on a behavioral targeting application and I need a considerably large keyword database/tool/provider that enables applications to reach to the similar keywords via given keyword for my app. I've recently found that Freebase, which had been providing a similar service before Google acquired them and then integrated to their Knowledge Graph. I was wondering if it's possible to have a list of related topics/keywords for the given entity.
import json
import urllib
api_key = 'API_KEY_HERE'
query = 'Yoga'
service_url = 'https://kgsearch.googleapis.com/v1/entities:search'
params = {
'query': query,
'limit': 10,
'indent': True,
'key': api_key,
}
url = service_url + '?' + urllib.urlencode(params)
response = json.loads(urllib.urlopen(url).read())
for element in response['itemListElement']:
print element['result']['name'] + ' (' + str(element['resultScore']) + ')'
The script above returns the queries below, though I'd like to receive related topics to yoga, such as health, fitness, gym and so on, rather than the things that has the word "Yoga" in their name.
Yoga Sutras of Patanjali (71.245544)
Yōga, Tokyo (28.808222)
Sri Aurobindo (28.727333)
Yoga Vasistha (28.637642)
Yoga Hosers (28.253984)
Yoga Lin (27.524054)
Patanjali (27.061115)
Yoga Journal (26.635073)
Kripalu Center (26.074436)
Yōga Station (25.10318)
I'd really appreciate any suggestions, and I'm also open to using any other API if there is any that I could make use of. Cheers.
See your point:) So here's the script I use for that using Serpstat's API. Here's how it works:
Script collects the keywords from Serpstat's database
Then, collects search suggestions from Serpstat's database
Finally, collects search suggestions from Google's suggestions
Note that to make script work correctly, it's preferable to fill all input boxes. But not all of them are required.
Keyword — required keyword
Search Engine — a search engine for which the analysis will be carried out. For example, for the US Google, you need to set the g_us. The entire list of available search engines can be found here.
Limit the maximum number of phrases from the organic issue, which will participate in the analysis. You cannot set more than 1000 here.
Default keys — list of two-word keywords. You should give each of them some "weight" to receive some kind of result if something goes wrong.
Format: type, keyword, "weight". Every keyword should be written from a new line.
Types:
w — one word
p — two words
Examples:
"w; bottle; 50" — initial weight of word bottle is 50.
"p; plastic bottle; 30" — initial weight of phrase plastic bottle is 30.
"w; plastic bottle; 20" — incorrect. You cannot use a two-word phrase for the "w" type.
Bad words — comma-separated list of words you want the script to exclude from the results.
Token — here you need to enter your token for API access. It can be found on your profile page.
You can download the source code for script here

Get default stop word list in elastic search

I am trying to find out what the predefined stop word list for elastic search are, but i have found no documented read API for this.
So, i want to find the word lists for this predefined variables (_arabic_, _armenian_, _basque_, _brazilian_, _bulgarian_, _catalan_, _czech_, _danish_, _dutch_, _english_, _finnish_, _french_, _galician_, _german_, _greek_, _hindi_, _hungarian_, _indonesian_, _irish_, _italian_, _latvian_, _norwegian_, _persian_, _portuguese_, _romanian_, _russian_, _sorani_, _spanish_, _swedish_, _thai_, _turkish_)
I found the english stop word list in the documentation, but I want to check if it is the one my server really uses and also check the stop word lists for other languages.
The stop words used by the English Analyzer are the same as the ones defined in the Standard Analyzer, namely the ones you found in the documentation.
The stop word files for all other languages can be found in the Lucene repository in the analysis/common/src/resources/org/apache/lucene/analysis folder.

notesdocumentcollection.ftsearch and a search query with special characters

i try to make a search function in ssjs that looks like this.
notesdocumentcollection.ftsearch('"*' + searchword + '*"');
i have a document with this field value "Dr. Max Muster".
if i search for "dr" i get a result.
if i search for "dr. max" i don't get a result.
if i remove the wildcard and type "dr. max" i will get an result.
i also tryed it like this
notesdocumentcollection.ftsearch('*' + searchword + '*');
Is there any way to get an result with wildcards and special characters in the search query ?
P.S.
If i try this in the notesclient in the view it will work.
EDIT:
for this query "dr. ma" i got this debug results from the server
IN FTGSearch option = 0x400089
[12CC:000A-1A30] Query: dr. ma
[12CC:000A-1A30] Engine Query: ("drma")
[12CC:000A-1A30] OUT FTGSearch error = F22
[12CC:000A-1A30] FTGSearch: found=0, returne
[12CC:000A-1A30] IN FTGSearch option = 0x40008C
[12CC:000A-1A30] Query: *"dr**ma"*
[12CC:000A-1A30] Engine Query: ("*dr**ma*")
[12CC:000A-1A30] OUT FTGSearch error = F22
[12CC:000A-1A30] FTGSearch: found=0, returned=0, start=0, count=0, limit=0
OK first up the search engine uses a trigram system. So searching for 2 characters will not work as expected. The wild cards may be helping but there is no guarantee it will get everything.
So as I understand the next part if you manually type in the following into the Full Text Search bar in the notes client and it works? (quotes included)
"*dr. max*"
One thing to be aware of in the Notes client is that you can activate two different search modes (switch in basic preferences). Web query and Notes query.
By default web query is on (IIRC), so you search as if you would your standard internet search engines.
If you have switched it to Notes query, or the search starts with an all capitals word it use the syntax that Notes has used previously.
So it possible you are are seeing differences in the client vs XPages due to that.
To test this you can debug as follows. On the Domino server console type the following.
set config DEBUG_THREADID=1
set config CONSOLE_LOG_ENABLED=1
set config Debug_FTV_Search=1
Now do a search in the notes client and the XPage. It will generate something like the following on the Domino Console (note: I added the numbers at the start for the important lines).
IN FTGSearch
[07FC:0048-0A94] option = 0x400219
1. [07FC:0048-0A94] Query: ("*test*")
2. [07FC:0048-0A94] Engine Query: ("*test*"%STEM)
3. [07FC:0048-0A94] GTR query performed in 6 ms. 5 documents found
4. [07FC:0048-0A94] 0 documents disualified by deletion
5. [07FC:0048-0A94] 0 documents disqualified by ACL
6. [07FC:0048-0A94] 0 documents disqualified by IDTable
7. [07FC:0048-0A94] 0 documents disqualified by NIF
8. [07FC:0048-0A94] Results marshalled in 3 ms. 5 documents left
9. [07FC:0048-0A94] OUT FTGSearch error = 0
[07FC:0048-0A94] FTGSearch: found=5, returned=5, start=0, count=0, limit=0
[07FC:0048-0A94] Total search time 10 ms.
Explanation of each bit.
String you sent to the search engine. In this case it was "test" (with quotes)
The compiled search string.
How long it took and total number of documents found.
Total discarded because it was flagged as deleted.
Total discarded because you did not have the rights to view them.
Total discarded because of the IDTable results.
Total discarded because they would not appear in the view you are searching from.
Time it took and remaining documents.
If any errors occurred.
So generate those two search results and post them if it is not obvious why it mentioned it didn't work.
The documentation for FTSearch says to enclose words and phrases in quotes. So try this (where you enclose the searchword variable in quotes - and not the wildcard star):
notesdocumentcollection.ftsearch('*"' + searchword + '"*');
the Notes Fulltext Query Syntax is a better kept secret than the Disney Time share apartments (if you ever were at Disney you get the drift).
The official syntax guide is here: http://www-10.lotus.com/ldd/dominowiki.nsf/dx/full-text-syntax
What helped me a lot is to take the searchsite.ntf and rip it apart. Inside all concepts of FTSearch have been implemented in a working fashion (code that works beats documentation any time).
Hope that helps

Resources