Is there a good way to retrieve company summary from wikipedia? - search

My question is not about parsing.
I have been looking through the wikipedia API. I need to search for companies and get a one sentence summary. It's working good, the only problem I have is when I need to disambiguate. It's hard for my code to know whether "dropbox (service)" or "dropbox (band)" is the dropbox company my user is looking for.
I tried to put the word "company" in the query, expecting it to work like a google search, but unfortunately it didn't.
so my question is: is there an easy way to disambiguate the results I get by telling wikipedia it is a "company" that I want?

If you're looking for companies only then consider using their full names instead of short forms. In case of Dropbox, the name of the company is Dropbox, Inc. If you search for Dropbox, Inc in Wikipedia you will be redirected to the page Dropbox(Service) which i believe is the page youre looking for.
If you dont have the resources to have the name of the company in the perfect format, then consider using Category:Companies to refine your results further.
When you get to the page, you can mine for the extract of the company by using the Mediawiki API as follows
https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=&titles=Dropbox%20(service)
Note: The extract is called section0 in MediaWiki

I recommend trying Wikidata. Wikidata are a multilingual factual database of everything, and they have a query interface at query.wikidata.org. The language the interface uses is called SPARQL. For instance, if you're interested in a list of well-known cats, https://w.wiki/W4W is your query. More details can be found at https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service.

import wikipedia
print(wikipedia.summary("COMPANY_NAME"))
Try to filter out the companies by categories - there is a list provided in the end of the page:
xx = wikipedia.page("Dropbox")
xx.title
print(xx.categories)

Related

Searching across multiple languages -- how to?

TLDR: I wanna build multi-language search on my website ala Pinterest, how do I do that?
I am starting a website, where people can publish content that gets metadata typed by the user. People can then interact with the content by looking at it, liking it, commenting on it, sharing it to social media. Also content discovery is mostly done through search.
I do not wish to create geographic boundaries on my website. I would like people who speak any language to find content that is relevant to them in any language. This requirement makes sense because the content is highly visual, ala Pinterest. So even if I don't understand that the word "car" is written in French in the description, it's fine because I'll mostly be interested in seeing the car.
Pinterest is really really good with search across language. For example, on uk.pinterest.com I typed "coupe carrée" which is the French for "bob haircut" and all the results are visually relevant. Even if the pin metadata is in English and the original web site is all in English.
How is that possible? how was Pinterest able to match to my french search query content whose text is all in English? is there translation at some step: coupe carrée > bob haircut > content containing "bob haircut"?
I looked at their engineering blog and all I found is tech to detect the original country and language of a website. Nothing about managing language in search.
please let me know if this is the wrong place to ask the how-it-works question.
Thanks in advance for any help/pointers you will be able to share!
The general strategy in this case is to index your content with every language translation you wish to search.
This would require use of a language translation API at index-time. And a language identification model. Here's a Solr example.

Search a specific search of a journal article based on the user type

I have this requirement:
We have a journalarticle and we wish to have sections which have content for internal and external users for the application.
We are able to hide the content from rendering by implementing custom template on web content display and using a simple custom-field for a user which helps us to classify it.
Having said that when we search something as an external user, the search portlet is able to fetch an article where the search text is a part of internal user content, and due to the above mentioned template the content is not visible.
In short, from the user's perspective the resultant article does not match the searched term.
I wish to seek some pointer to check whether there is a mechanism to ensure that when an external user searches something then we only search the dynamic-element of the doc which matches the user type?
We have thousands of such articles and create multiple copy of the same article does not seems viable solution.. so any pointers would be a great help.
Liferay version : 6.2 GA4 CE
Thanks!
AJ
First of all: Not finding a search term in a document can be a sign of good working synonym resolution in the search engine. It's questionable if this behaviour is always wrong or only in this particular case. Remember google bombs?
That being said, I believe that this architecture of half-visible documents is flawed from the beginning. Ideally I'd suggest to change it, for example by splitting the information to two articles, so that you can use the standard permissions to resolve. If you link both, you can determine how/which article or template to use. It's not an ideal solution, but might be a workaround.
Another workaround might be to change Liferay's indexer component and index two different versions of the article, with two different permissions. Of course, you'll have to change the search side as well, so that you'll find each article at most once, even if it's now twice in the search engine.
Again - not ideal, but might be the quickest fix that you can get right now without changing the underlying architecture. However, to change the underlying architecture is my actual recommendation.

Plone: creating and using document tags?

For an academic plone site I am creating, it is desirable to support document tags (see below).
There are multiple users for this site, and each user has a (long) list of publications that they alone can add / edit.
In its simplest form, a publication entry consists of a hyperlink or even just plain text. For instance:
A. Baynes, J. Watson and S. Holmes, "The role of observation and deduction in forensics", Applied Crime Solving, 221, 210-243 (1901). doi: 10.1032/acsolv2714
(The above is a fictitious article, but it has all the elements one expects in most citations.)
For those unfamiliar with DOI links, these are fixed text strings that can be resolved to the page for the article in question using dx.doi.org. Further, copyright / license terms often prohibit the authors from providing a full PDF / HTML for their articles on their websites. The articles often lie behind a paywall (usually accessible from most Universities / major research labs). So, running full text searches on the article itself is NOT an option.
Returning to the problem definition, I am assuming that the users will add their publications as links, but I want to give them the ability to specify a comma separated list of words / phrases (or tags) that more closely identify what the article is about.
For the above article, an appropriate list of tags would be:
forensics, haemoglobin, degradation of evidence
After each user appends such tags to the article, I want to create a backend that will allow visitors to the site to simply be able to enter these tags in a search field and find all publications that pertain to, say, haemoglobin.
That search should pull all publications that list haemoglobin as a tag, for all users of the site.
I intentionally used haemoglobin as a tag to illustrate that relevant tags need not be (and usually aren't) part of the text specified in the title of the article.
Further, the Plone "Collections" feature is not an adequate solution to this problem. Collections are typically generated by the admin. That means that a) admin intervention for something like this is essential and b) tags are best defined by users, not the admin.
When adding any content type (File, Folder, Page, Link, Collection, ...) in Plone, you can apply any number of tags to the content. This is done in the "Categorization" tab when editing/creating the content.
Visitors/Users can search the site based on tags like normal searches (using the search box or accessing the /##search URL).
Moreover you can use "tag cloud" portlets to visualise the tags' frequencies. Check the followings to get an idea:
1. A tag cloud portlet that rotates tags in 3D using a Flash movie
2. TagCloud
Don't forget to check Plone documentation, and specially Plone user manual to get yourself acquainted with the way Plone works.
#user2751530
I would like to know whether you are still working on this specific project - I am currently developing a similar one using plone v4, documentviewer v3 and an as of yet nonexistant frontend. I would like to discuss different approaches to the tagging-by-user problem, you can contact me through skype (dawitt19) or twitter (pref.) through #japhigu.

Does SharePoint Search support range tags?

I am working on a project to digitize approximately 1 million images for which metadata will be added to facilitate search.
Each image is, for example, a page in a dictionary. But not text. Just a static scanned image. OCR is not an option :(
My objective is to emulate the current search procedure which consists of looking up the alphabetical entries till the correct page is found. In absence of machine readable text, I am looking at tagging each page with Dictionary range tag. For Example (Apple-Canada). So if someone searches for "Banana", it should hit the (Apple-Canada) range Tag.
Is this supported in SharePoint out of the box? If not, is there an addon product which provides this functionality or am I looking at building a customized extension?
Any help will be appreciated :)
Installing the IFilter for TIF files is done with a couple of clicks and gives you free OCR along the way. Very good for scanned pages.
On your question though: No, SharePoint does not have any kind of "range" tags or fields. The only vaguely similar thing to what you are requesting is the Thesaurus of the search. There you could define acronyms and synonyms for words and it would actually search for something else. So you could enter Banana but it would actually search for Apple. Some examples here: How to: Customize the Thesaurus in SharePoint Search and Search Server.
Other than that I can only think of a custom implemented search provider giving you the flexibility you need.

How to get a description of a URL

I have a list of URLs and am trying to collect their "descriptions." By description I mean what comes up, for example, if you Googled the link. For example, http://stackoverflow.com">Google: http://stackoverflow.com shows the description as
A language-independent collaboratively
edited question and answer site for
programmers. Questions and answers
displayed by user votes and tags.
This the data I'm trying to accumulate for the URLs I have.
I tried parsing the URL's meta-descriptions, however most of them are lacking a meta-description (yet Google and other search engines manage to get a description somehow).
Any ideas? Should I just "google" each link and scrape the data? I have a feeling Google wouldn't like this...
Thanks guys.
Different search engines have different algorithms to get the description out of the page if/when they are lacking the description meta tag. Some ignore the tag even it it's there.
If you want the description Google has, the most accurate way to get it would be to scrape it. Otherwise, you could write your own or look around on the web for code that does it.
These are called snippets.
Google use proprietary (and possibly patented) methods to garner this information, so there is no simple answer.
As you suggest, they will use meta-description information if it is there. (How to set the meta-information to help Google.)
They will also honour requests from the page authors to NOT include snippets. (How to prevent Google from displaying snippets) You should probably respect this too (as well as robots.txt, of course.)
You may have some luck with existing auto-summary packages, such as OTS.
You may want to check AboutUs.org (i.e. http://www.aboutus.org/StackOverflow.com).
But, there's little chance that the site will have an aboutus page and not have a meta description.
Some info that might explain how google does this:
Webmasters/Site owners Help
Adding a URL to google
I am not familiar with Google APIs, but perhaps there is an official way to get such information.
Interesting. some sources are better than others.
For "audiotuts.com" google has a worse description than AboutUs.com.
Google
Nov 18th in General by Joel Falconer ·
1. Recently, an AUDIOTUTS reader asked me about creative process. While this
is a topic that can’t be made into a
...
AboutUs.com:
AUDIOTUTS is a blog/tutorial site for
musicians, producers and audio
junkies! It is the sister site of the
popular PSDTUTS, VECTORTUTS and
NETTUTS.
I hate problems like these... they should be trivial but they aren't!
If you can assume English content, you can first look for Meta Description, and if that doesn't work, you can look for the first two or three sentence-like word sequences.
A product I worked on looked for the first P or DIV that contained more than one sequence of > n "words" delimited by periods. It would use the two or three sentence-like sequences, up to x total words, as a summary paragraph. It wasn't 100% accurate, but good enough for the average case. The number of words was adjusted a few times to eliminate things like navigation elements.

Resources