Query wikipedia - nlp

I would like to query two or three terms in order to locate them in Wikipedia´s entries. Specifically, I´m trying to see if some terms get repeated in the first paragraphs (abstract) across entries. Could be direct or through dbpedia. Thanks

Using Mediawiki API you can find articles that contain those keywords.
Try the API:Search documentation.
For doing what you want to do, also, you'd probably need to find the articles that have those keywords and then parse the text to check if they are in the first paragraphs.
With this:
?action=parse&page=Nicolas_Cage&prop=text&section=0
you can get the HTML of the first section of a page (see this post).

Related

Wikipedia wildcard search not working?

I'm trying to do a wildcard search on Wikipedia but the search is not behaving the way the instructions say it should. Here's the advanced search help page:
https://en.wikipedia.org/wiki/Help:Advanced_search
As an example, it says this regarding a Wildcard search:
the query *stan will match Kazakhstan or Afghanistan or Stan Kenton.
However, when I attempt to do that search (or even click on the embedded link to that search), I only get
the page *stan does not exist
and it just lists a bunch of "Stan" entries starting with "Stan Laurel filmography."
Why would this feature not work? Am I missing something?
It does work, however because direct matches for "stan" are scored higher than words with it, Kazakhstan is waaaay down in results. You can try slightly narrowing the results with intitle:*stan however this is still bad. However, a quick check with k*stan shows that it works.
Conclusion: user-written help page has a bad example.

Is there a good way to retrieve company summary from wikipedia?

My question is not about parsing.
I have been looking through the wikipedia API. I need to search for companies and get a one sentence summary. It's working good, the only problem I have is when I need to disambiguate. It's hard for my code to know whether "dropbox (service)" or "dropbox (band)" is the dropbox company my user is looking for.
I tried to put the word "company" in the query, expecting it to work like a google search, but unfortunately it didn't.
so my question is: is there an easy way to disambiguate the results I get by telling wikipedia it is a "company" that I want?
If you're looking for companies only then consider using their full names instead of short forms. In case of Dropbox, the name of the company is Dropbox, Inc. If you search for Dropbox, Inc in Wikipedia you will be redirected to the page Dropbox(Service) which i believe is the page youre looking for.
If you dont have the resources to have the name of the company in the perfect format, then consider using Category:Companies to refine your results further.
When you get to the page, you can mine for the extract of the company by using the Mediawiki API as follows
https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=&titles=Dropbox%20(service)
Note: The extract is called section0 in MediaWiki
I recommend trying Wikidata. Wikidata are a multilingual factual database of everything, and they have a query interface at query.wikidata.org. The language the interface uses is called SPARQL. For instance, if you're interested in a list of well-known cats, https://w.wiki/W4W is your query. More details can be found at https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service.
import wikipedia
print(wikipedia.summary("COMPANY_NAME"))
Try to filter out the companies by categories - there is a list provided in the end of the page:
xx = wikipedia.page("Dropbox")
xx.title
print(xx.categories)

wikipedia api search titles generator

Trying to search tiles through the api using a generator.
I notice that there are two possible generators, with both I have problems:
prefix search - doesn't work well if I have multiple words and the order is reversed in the query (for example "brian adams" would return an answer, however "adams brian" does not
search - seems to not allow searching by titles, only by text which returns low-quality results.
Anyone knows of a way around this?
"srwhat=title" is disabled, so you should use "intitle:" in your search query:
https://en.wikipedia.org/w/api.php?action=query&list=search&srnamespace=0&srprop=timestamp&&srsearch=intitle:adams%20brian
for more info please check this page:
https://www.mediawiki.org/wiki/API:Search

DataView component in SPD attaches needless strings to field values

In SharePoint Designer I use some lists as sources and then link them together with an operation GetListItems (I fetch items from multiple lists on different site collections for rollup/aggregation):
alt text http://img151.imageshack.us/img151/1807/ss20090428101310.png
Now something is fine as I managed to get the result: alt text http://img410.imageshack.us/img410/4835/ss20090428101013.png
But the strings that are attached to field result (6;#, 2;#) is... disturbing.
How can I get rid from those attached strings? They are not attached to all fields, but to some (important ones):
alt text http://img168.imageshack.us/img168/1647/ss20090428100732.png
Ahh, well usally that happens - you keep searching for answer, then seek for help and find it yourself.
I used substring xsl function, to strip away those first characters. Messy, if i want to add links to that table, but works.
alt text http://img2.imageshack.us/img2/3117/ss20090428102714.png
By the way, the main question how to rollup content from multiple site collections has been journey to me for several days already. If anyone is in the same situation, I recommend (well because I found myself an answer there) these:
How-To Rollup two lists in two site
collections on a page
Or a better way to use for a single
site collection: SharePoint
Customisation Tricks: Use The
SPDataSource, Luke! (Good links
inside that article).
Something I didn't touch, because I
didn't need such an advanced method,
but maybe someone does: Populating
data sources in code

How to get a description of a URL

I have a list of URLs and am trying to collect their "descriptions." By description I mean what comes up, for example, if you Googled the link. For example, http://stackoverflow.com">Google: http://stackoverflow.com shows the description as
A language-independent collaboratively
edited question and answer site for
programmers. Questions and answers
displayed by user votes and tags.
This the data I'm trying to accumulate for the URLs I have.
I tried parsing the URL's meta-descriptions, however most of them are lacking a meta-description (yet Google and other search engines manage to get a description somehow).
Any ideas? Should I just "google" each link and scrape the data? I have a feeling Google wouldn't like this...
Thanks guys.
Different search engines have different algorithms to get the description out of the page if/when they are lacking the description meta tag. Some ignore the tag even it it's there.
If you want the description Google has, the most accurate way to get it would be to scrape it. Otherwise, you could write your own or look around on the web for code that does it.
These are called snippets.
Google use proprietary (and possibly patented) methods to garner this information, so there is no simple answer.
As you suggest, they will use meta-description information if it is there. (How to set the meta-information to help Google.)
They will also honour requests from the page authors to NOT include snippets. (How to prevent Google from displaying snippets) You should probably respect this too (as well as robots.txt, of course.)
You may have some luck with existing auto-summary packages, such as OTS.
You may want to check AboutUs.org (i.e. http://www.aboutus.org/StackOverflow.com).
But, there's little chance that the site will have an aboutus page and not have a meta description.
Some info that might explain how google does this:
Webmasters/Site owners Help
Adding a URL to google
I am not familiar with Google APIs, but perhaps there is an official way to get such information.
Interesting. some sources are better than others.
For "audiotuts.com" google has a worse description than AboutUs.com.
Google
Nov 18th in General by Joel Falconer ·
1. Recently, an AUDIOTUTS reader asked me about creative process. While this
is a topic that can’t be made into a
...
AboutUs.com:
AUDIOTUTS is a blog/tutorial site for
musicians, producers and audio
junkies! It is the sister site of the
popular PSDTUTS, VECTORTUTS and
NETTUTS.
I hate problems like these... they should be trivial but they aren't!
If you can assume English content, you can first look for Meta Description, and if that doesn't work, you can look for the first two or three sentence-like word sequences.
A product I worked on looked for the first P or DIV that contained more than one sequence of > n "words" delimited by periods. It would use the two or three sentence-like sequences, up to x total words, as a summary paragraph. It wasn't 100% accurate, but good enough for the average case. The number of words was adjusted a few times to eliminate things like navigation elements.

Resources