I'm using the wikipedia API to get the infoboxes from certain pages.
An example would be Imperial College London
My problem is the HESA student population|INSTID=0132 value that I'm getting. I was hoping to just get the number for student population but instead I'm getting the id above. How can I get the values of the infoboxes present in a page?
Moreover if you check the wiki page there are two infoboxes (main and rankings). How can I get both of them?
There's an alternative REST API you could use to access wikipedia content. To get the well-structured HTML for an article you would request:
https://en.wikipedia.org/api/rest_v1/page/html/Imperial_College_London
The HTML is produced by the Parsoid service which produced HTML/RDFa content following the DOM Spec. Inboxes will be html table element with class `infobox, so you could easily locate all inboxes on the page.
Inboxes are normally created by complex templates, so it might be easier for you to just parse the table HTML.
Related
Is there a way to publish the amount of views/edits of a page in/on the page itself in xWiki?
I've used parts of this article to create an accessible form to see usages of different spaces. But I would also like to publish the views/edits on the pages itself.
Thanks in advance!
Richard
I would suggest you use an UIExtension point (see https://extensions.xwiki.org/xwiki/bin/view/Extension/UIExtension%20Module and the tutorial https://www.xwiki.org/xwiki/bin/view/Documentation/DevGuide/Tutorials/UIXTutorial/) from the list of available ones (https://www.xwiki.org/xwiki/bin/view/Documentation/DevGuide/ExtensionPoint/) to add the extra information to be displayed on each page.
I guess the most suitable UIXP would be the Content Footer one (https://www.xwiki.org/xwiki/bin/view/Documentation/DevGuide/ExtensionPoint/ContentFooterUIX/).
Inside the UIX you add, you can do a simple query to fetch the view/edit values from the statistics module (either with the API, if it exists) or with a HQL query, like in the example you've mentioned.
I'm looking to scrape the main description from Wikipedia when you first enter a page (e.g. the text that first appears when you send a wikipedia link on a website and it embeds).
I initially tried using MediaWiki, but when I tried it, it mostly returns non-relevant data which does not include the description/embed text.
Could there be any endpoint that Wikipedia has that could return the data that I want to make use of?
I asked this question over on the Kentico devnet but haven't had a definitive answer.
I have a particular requirement for a Kentico 8.2 implementation where in code, given a specific TreeNode, I'd like to find first all the zones on the template being used then, for each zone, get the details of all web parts and widgets used in those zones.
In my case I do not need to worry about template inheritance. None of my pages implement template inheritance.
I found this post on the old Kentico forums which suggested I might be able to do use PageInfoProvider to get a PageInfo object for the relevant TreeNode then use its PageTemplateInfo property to gain access to what I need.
However, I don't see a PageTemplateInfo property of CMS.DocumentEngine.PageInfo. There is DesignPageTemplateInfo and UsedPageTemplateInfo. I thought maybe UsedPageTemplateInfo would be the one, and it does indeed include the correct zones in its WebPartZones collection. But I don't see the web parts (actually, widgets) I'm expecting in the zones' WebParts collections.
I guess what I'm asking is, how can I use API to gain access to the content of the DocumentWebParts column from dbo.CMS_Document as a structured object? I've realised I can get access to the information I need by calling .GetProperty("DocumentWebParts") on TreeNode, but this is unstructured XML. I presume somewhere in the API I can get this information as a structured object.
Does anyone know how I might access the details I need? Thanks.
P.S. My template uses the ASPX+portal model.
As I mentioned in my second answer on the DevNet, you cannot specifically use the cms_document table simply because webparts function within templates and not with a specific page. Widgets on the other hand are specific per page even if the page has the same template as another page. Take a look at the example on the DevNet I provided, it should get you what you're looking for.
Anyone know how to extract data from a webpage using Import.io where the data is loaded into the page via Ajax?
I am unable to extract data from below mentioned pages.
There is no issue in first page data extraction, but how do I move on to extract data from second page?
URL is given below.
<http://www.amazon.com/gp/aag/main?ie=UTF8&asin=&isAmazonFulfilled=&isCBA=&marketplaceID=ATVPDKIKX0DER&orderID=&seller=A13JB7253Q5S1B>
The data on that page is deployed using an interesting mix of technologies; it relies heavily on server side code and Javascript. That type of page can be a challenge, however, there are always methods to get the data. For example, some sellers have a page like this:
http://www.amazon.co.uk/gp/node/index.html?ie=UTF8&marketplaceID=ATVPDKIKX0DER&me=A2WO1PQ2OIOIGM&merchant=A2WO1PQ2OIOIGM
Which is very easy to extract data from, even using the magic algorithm - https://magic.import.io/?site=http:%2F%2Fwww.amazon.co.uk%2Fgp%2Fnode%2Findex.html%3Fie%3DUTF8%26marketplaceID%3DA1F83G8C2ARO7P%26me%3DA2WO1PQ2OIOIGM%26merchant%3DA2WO1PQ2OIOIGM
I had to take off the redirect=true from the URLs before it would work - just an FYI.
Other times some stores don't have such a URL, its a bit of a pain, and there URLs can be tough to figure out.
We do help some of our enterprise customers build bespoke APIs when the data is very important to them, so do feel free to get in touch. I imagine a larger scale workaround would be to create a dataset/API based on a the categories you are interested in and then to filter that larger dataset down (python or CSV style) by seller name. That would probably work!
I managed to get a static dataset but no API. You can find that dataset at the following GUID: c7c63f1c-7081-4d4a-ad91-afe9789a6620
Thanks
Currently Google displays elements in the result excerpts that belongs to the functional part of the site. Is there a way to exclude these elements to get crawled/displayed in google?
Like eEdit, eDelete, etc in the example above.
To exclude the pages from Google's index, block them using the Robots.txt file or if it is just the content then use the "rel="nofollow" tag.
Hope this helps.
Update on my particular situation here: I just found out that the frontend code has been generated in a way where the title and the description meta was identical.
Google is smart enough to expect that if a copy is already displayed in the title of the search result there's no reason to add in to the excerpt as well, instead looks for content - believed to be valuable - from the actual page.
Lessons learned:
there's no way to hide elements from google but keep it visible for your users
if you'd like to have control over the content displayed in google searches, avoid using the same copy in your title and description