XPath Data Scraping From Online Community

XPath Data Scraping From Online Community - excel

I recently read this article on how to scrape the Inbound.org community members profile using Excel. And you can watch the video here if you prefer it that way.
Since the release of this tutorial, the Inbound website structure has changed a bit, as you can see at minute 11:00 in the video, if you attempt to copy the XPath of the social media icons it appears slightly different and because of this I haven't been able to extract that information.
Here's what I get now:
/html/body/div[3]/div/div/div[1]/div/div[2]/a[1]/i
This is how I wrote the syntax in Excel:
=XPathOnUrl(A2,"//a[#class='twitter']","href")
And then like this:
=XPathOnUrl(A2,"//a[contains(#class,twitter)]/#href")
Although I tried in many different ways, none of them showed me the link to the member's social media profile.
I even tried changing the xpath in multiple ways to get different data from the page, but none of it was the social media information:
=XPathOnUrl(A2,"//*[contains(#class,member-banner-tagline)]/div[2]/div/div/div[1]/div/div[1]")
=XPathOnUrl(A2,"//*[contains(#class,member-banner-tagline)]/div[2]/div/div/div[1]/div/h1")
I honestly don't know what to try anymore, something's wrong and I can't figure it out. Anybody have enough experience with this or can pinpoint the problem here with my syntax?
Thanks a lot

The first formula you tried looks fine, but this is the one that works for me (SEO Tools version 4.3.4) :
=Dump(XPathOnUrl(A2;"//a[#class='twitter']";"href";HttpSettings(TRUE)))

Related

Does Office 365 image search work? If so, how?

According to Microsoft ("Image Analysis" in https://techcommunity.microsoft.com/t5/Microsoft-SharePoint-Blog/Enrich-your-SharePoint-Content-with-Intelligence-and-Automation/ba-p/194174, from May 21, 2018), we should be able to search for text within images.
Is this working for you/anyone? If so, I would like to know what you had to do to get it to work.
I have a SharePoint modern team site with PNG images that contain clearly readable text...but search will not find anything. I have requested re-indexing.
I have had a Microsoft Support request (#10638094) open since June 27 with this question/issue, and no one--even after escalation--has been able to answer it.
Based on the article above, it appears that "MediaService" column(s) should be added to the library to support this; however, I can find no such columns in the environment (using PnP export to review).
Naomi Moneypenny and Kathrine Hammervold highlighted this functionality at Ignite 2017 (https://channel9.msdn.com/Events/Ignite/Microsoft-Ignite-Orlando-2017/BRK2181, about 27:00), but it doesn't seem to be available/working (at least not for me).
August 24: So, after research, digging yet further, I have an escalated support ticket at Microsoft (#10638094, unsolved) and there are conversations at https://techcommunity.microsoft.com/t5/Intelligent-Search-Discovery/Search-for-words-in-your-images-in-Office-365/ba-p/135703, https://techcommunity.microsoft.com/t5/Microsoft-SharePoint-Blog/Enrich-your-SharePoint-Content-with-Intelligence-and-Automation/bc-p/236625, and Does Office 365 image search work? If so, how?. I have yet to hear of this functionality working for anyone. I will keep digging, and I will certainly post if I hear anything. J

After some digging, from official it seems already released at the end of 2017. However there is no any related doc or official guide to this Text in image search function.
The 2 way i can think of perform text in image search.
Perform OCR yourself on the image before uploading the image and embed the text in image metadata.
Use support image type like IIRC and TIF that image are recognized.
In your case, you can upload the image and have another column that contains text and apply metadata to the image in a list/ library column.
OneDrive in another hand also has this function. For example, search for things like "cat" and it * should* pull up most pictures you have of cats. Its more likely using tag as label for the image instead of reading the picture it self.
Also, i believe OneNote has its index recognizable text and handwriting. Maybe this can point you to the right directions.
*Microsoft Azure's computer Vision offer service to recognized text in image. Maybe this can help.

"Is this working for you/anyone?" Yes, I responded to this post elsewhere and see it posted here, as well. Unfortunately, I cannot tell you HOW to get it to work or to verify that it is correctly configured. I can only suggest a test for you to see if it is working for you, as it works for me. I have not tested every way in which it could or should work. I have only discovered it working with PNGs I inserted into Wiki Pages in SharePoint Online. Those PNGs are generated using Snag-It to take Screen Captures and I do not see where Snag-It would be doing any OCR on the image to embed anything, etc. OCR is not even in the Snag-It help file, so I believe the PNG files are just simple PNGs. I insert them into the SharePoint Wiki page, which uploads them to the Site Assets library. And, when I search for a word in the image, the image is returned as a result - not the Wiki page. So, suggest you try a simple test of just inserting a PNG with text in it into a Wiki Page and give the index a bit of time to run to see if it works for you.

It seems like the functionality has matured recently. I have been testing it more thoroughly, and I have documented the results in my blog at http://www.collaboration-foundry.com/SharePointImageAnalysis.
Bottom line: It works for me in OneDrive and SharePoint (modern and classis), but I've only seen it work on the out-of-the-box Document content type--which limits custom solutions somewhat.
It's cool functionality when it works. Looking forward to seeing Microsoft build on this.
John

Searching for Old YouTube Videos

I'm trying to find all of the YouTube videos created by IGN's channel during the month of February 2014. IGN currently has 118,000+ videos uploaded, so going back through all of them is not possible. I previously used the following Google search string and a custom date range to find them:
site:youtube.com ignentertainment
This doesn't work anymore for some reason. I'd be much obliged if anyone has any ideas of how to do this. I have no idea what an API is, but if there's a VERY simple way of using that to do what I want that can be explained briefly, I'm willing to go that route.
Thanks.

You can use google to limit the period that it fetches search hits from.
Start by searching using "site:youtube.com ignentertainment" or simply "ignentertainment" and then click on the tools button, you now got a new bar between your search bar and the results that can limit time among other things.
Open the time related options and choose to input a specified period and your all done.
Edit: oh and the command site:youtube.com ignentertainment sure worked for me.

can you have "variables" in text in google sites?

Sorry, this is a bad question. I don't even know what the title should be. I'm a total noob at making websites so this might be easy to find but I just don't know the terminology to search for. I cannot find anything about how to do this...
What I want to do is have something like references/variables that I can use in a block of text and it will automatically get replaced with whatever value should be there. Best way I can think of to describe it would be if I was using the site as a design doc for a game or something, I would be able to type in [Title] or something similar on any page and when it loads that text would be replaced with whatever my Title is. That way If I ever change titles, names, classes, races, places, items, etc... they would only have to be changed in 1 place and the change would be reflected everywhere.
I notice if I add a link to a page it will automatically use the Title of that page as the text of the link. That is almost exactly what I want. Except when I change the Title of the other page the text of the link remains as the original text. It doesn't get updated to the new Title and that is not at all what I want.
Also, I want to do this in Google Sites and as simply as possible. I don't really want to use a database. I was hoping Google Sites would have some kind of funcionality for this.

I don't believe this is possible (on Google Sites) and likely you need to consider a hosted solution.
Quoting the answer from this relevant post:
You should consider hosting your solution using Google's App Engine
instead of Google Sites. You can set it up so it uses PHP (see link
below), you can configure it to use your domain name and you get
enough CPU, disk and bandwidth allowance to serve around five million
page views for free each month, if you are serving more than that,
their prices are extremely competitive.
Google App Engine:
http://code.google.com/appengine/docs/whatisgoogleappengine.html How
to setup PHP using Google App Engine: http://blog.caucho.com/?p=187
Also I'm not sure how your PHP skills are but if you're unfamiliar with it then this should help to get you started.

Google docs viewer url parameters

Is there any sort of documentation on exactly what parameters you can put in the url of Google viewer?
Originally, I thought it was just url,embedded,chrome, but I've recently come accross other funny ones like a,pagenumber, and a few others for authentication etc.
Any clues?

One I know is "chrome"
If you've got https://docs.google.com/viewer?........;chrome=true
then you see a fairly heavy UI version of that doc, however with "chrome=false" you get a compact version.
But indeed, I'd like a complete list myself!

I know this question is very old and perhaps you already solved your issue, but for anyone on the internet who might be looking for an answer...
I have been looking for this recently, following a guide I found on GitHub Gist
https://gist.github.com/tzmartin/1cf85dc3d975f94cfddc04bc0dd399be
More specifically, the option to embed a certain page of pdf using
<iframe src="https://docs.google.com/viewer?srcid=[put your file id here]&pid=explorer&efh=false&a=v&chrome=false&embedded=true" width="580px" height="480px"></iframe>
The best I could fing was this article (I suppose from a long time now)
https://weekly-geekly.github.io/articles/111647/index.html
HOWEVER, I tried modifying the attributes and the result was simply a redirect to
https://drive.google.com/file/d/[ID]/edit
https://drive.google.com/file/d/[ID]/preview or
https://drive.google.com/file/d/[ID]/view
AS OF MAY 2020, THIS SOLUTION PROBABLY DOESN'T WORK

I'm also on a quest to discover some of the parameters of the viewer.
the "chrome" parameter doesn't seem to do anything, though. Is this
supposed to be the same as embedded=true?
Parameters I know of:
url= (obviously)
embedded= (obviously)
hl= set language of UI (tooltips)
#:0.page.1 = jump to page 2 (page 1 is numbered 0) - this is unreliable and often requires a refresh after the first load,
defeating the purpose.
That said, when I use the Google Docs viewer on my site, "fit page to
screen" is the default view without any parameters. So maybe I'm
misunderstanding your question.
Source: For convenience, this is a full quote of the sole answer (it is from user k3david) to the crosspost of this question #Doc has posted to the Google support forum in 2011.

You can pass q=whatever to pass a search query to the viewer.

How to get a description of a URL

I have a list of URLs and am trying to collect their "descriptions." By description I mean what comes up, for example, if you Googled the link. For example, http://stackoverflow.com">Google: http://stackoverflow.com shows the description as
A language-independent collaboratively
edited question and answer site for
programmers. Questions and answers
displayed by user votes and tags.
This the data I'm trying to accumulate for the URLs I have.
I tried parsing the URL's meta-descriptions, however most of them are lacking a meta-description (yet Google and other search engines manage to get a description somehow).
Any ideas? Should I just "google" each link and scrape the data? I have a feeling Google wouldn't like this...
Thanks guys.

Different search engines have different algorithms to get the description out of the page if/when they are lacking the description meta tag. Some ignore the tag even it it's there.
If you want the description Google has, the most accurate way to get it would be to scrape it. Otherwise, you could write your own or look around on the web for code that does it.

These are called snippets.
Google use proprietary (and possibly patented) methods to garner this information, so there is no simple answer.
As you suggest, they will use meta-description information if it is there. (How to set the meta-information to help Google.)
They will also honour requests from the page authors to NOT include snippets. (How to prevent Google from displaying snippets) You should probably respect this too (as well as robots.txt, of course.)
You may have some luck with existing auto-summary packages, such as OTS.

You may want to check AboutUs.org (i.e. http://www.aboutus.org/StackOverflow.com).
But, there's little chance that the site will have an aboutus page and not have a meta description.

Some info that might explain how google does this:
Webmasters/Site owners Help
Adding a URL to google

I am not familiar with Google APIs, but perhaps there is an official way to get such information.

Interesting. some sources are better than others.
For "audiotuts.com" google has a worse description than AboutUs.com.
Google
Nov 18th in General by Joel Falconer ·
1. Recently, an AUDIOTUTS reader asked me about creative process. While this
is a topic that can’t be made into a
...
AboutUs.com:
AUDIOTUTS is a blog/tutorial site for
musicians, producers and audio
junkies! It is the sister site of the
popular PSDTUTS, VECTORTUTS and
NETTUTS.
I hate problems like these... they should be trivial but they aren't!

If you can assume English content, you can first look for Meta Description, and if that doesn't work, you can look for the first two or three sentence-like word sequences.
A product I worked on looked for the first P or DIV that contained more than one sequence of > n "words" delimited by periods. It would use the two or three sentence-like sequences, up to x total words, as a summary paragraph. It wasn't 100% accurate, but good enough for the average case. The number of words was adjusted a few times to eliminate things like navigation elements.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string