How can i get website report showing links from each page - web

I want to get a report which specifies what all links are there in each page of the website.I tried using different softwares,but the problem is they are just giving all links without showing exactly which links are there in each page.Also the website i am trying to make a report on is very unstructured,so it's not possible to just classify links,based on url forward slashes.For example,links starting with https://example.com/blog, will not give me all links inside the
'https://example.com/blog' page,because links inside 'https://example.com/blog' page can contains links without 'https://example.com/blog/' in the beginning of the link.
What can i do about this?
Thanks.

In Google analytics, there is no such concept as the next page.
Rather, it only knows the previous page.
It is due to the disconnected nature of the web.
You can, however, use the previous page to trace back to get the data you want.
Instead of looking for all links inside the https://example.com/blog, you will be looking at getting all links where the previous page is https://example.com/blog
More detailed explanation

Related

Back to search results

On the website I run we have a single search field where you can enter a name or profession. When you search you are served with a page full of results that come from 3 seperate sources.
Once you click on one field e.g. John Do, you will be taken to his page. On that page we have a back to search, but it goes to a blank screen.
I want to go back to the actual search results so the person doesn't have to do it all again, but I'm not sure where to start. Any suggestions?
That's a tricky situation.
There can be many solutions for this issue but I'll will name some of them.
Activate the cache of the pages (Quick trick, no suitable for websites that relies on users (*login)), you can go back and your form will be the same with the results without any issue.
Manage the load of the page of Jhon Do as a ajax load and #hashtag references, you don't reload the page but you just manage the states of the HTML. (Can be done with JS frameworks or React)
Depends on which platform are you working try to manage the variable of the search with this concept post-redirect-get
Hope that is helps!
Cheers.

How can I prevent certain element to get displayed in Google search excerpt?

Currently Google displays elements in the result excerpts that belongs to the functional part of the site. Is there a way to exclude these elements to get crawled/displayed in google?
Like eEdit, eDelete, etc in the example above.
To exclude the pages from Google's index, block them using the Robots.txt file or if it is just the content then use the "rel="nofollow" tag.
Hope this helps.
Update on my particular situation here: I just found out that the frontend code has been generated in a way where the title and the description meta was identical.
Google is smart enough to expect that if a copy is already displayed in the title of the search result there's no reason to add in to the excerpt as well, instead looks for content - believed to be valuable - from the actual page.
Lessons learned:
there's no way to hide elements from google but keep it visible for your users
if you'd like to have control over the content displayed in google searches, avoid using the same copy in your title and description

Search result: How to show only pages, not different content items?

We are using Liferay as a classic CMS meaning that we compose pages using web content articles. There is an issue with Liferay's internal search I could not yet find a proper answer for:
Because web content articles are pretty much only building blocks for pages we don't want the search to show them as distinct items. The user should only get a list of pages that contain their search keywords, including all the articles put onto this page.
At the moment we can see two different approaches and both come with certain problems we could not solve yet:
Idea 1
We modify the journal indexer and try to obtain all URLs of the pages (how?) where the article has been placed on. Then we add them to the document to be indexed. In the search result we then can access the URLs and collect them. In the end we make sure every URL is only shown once.
Idea 2
At some point Liferay renders the entire page before sending it to the browser. If we somehow could put an indexer there, we could index the entire page. We then could limit the search to the special "page documents". Getting the fully rendered page would be the main issue here, because either we would have to run a crawler to frequently trigger this indexing or we would need to find a way to trigger page rendering from within an indexer or something like that.
I have been carrying this problem around for quite a while now and still could not find an idea good enough to spend time trying it out. If anyone of you has some input on those two ideas or maybe an entirely different approach, I would be extremely grateful.
I'll just answer myself, because by now we found a suitable solution to solve our problem:
In addition to the default search portlet there is also a "Web Content Search Portlet" shipped with Liferay. It seems to have been part of Liferay for quite a while now, but it's somewhat hard to find, because there is hardly any documentation for it (I only found the Liferay wiki page, which isn't really anything at all). It searches only within web content articles and shows links to the pages rather than just a link an isolated view of the article. It has much less configuration options than the default search portlet, however. Pretty much all it allows to change is whether articles actually have to be placed on at least one page to show up in the results.
So there is no need for any kind of custom indexer or any other "hack"...all we need to do is use the correct portlet. We will only need to write a hook that changes the appearance of the result page.
What you ask is interesting but your ideas are on the wrong direction.
Specially idea 2 it's particulary wrong because you cannot do indexing work meanwhile a page is rendered. Think about performace only.
In Liferay pages and assets are not directly linked: pages have portlets and portlets display assets (web content and more).
Liferay indexing refers and scans assets content, not refers the display result of the assets. Think about permission: the same page can display different contents depends on the user who looks.
bye

SEO: Google fetch returns blank page (but rendered HTML seems correct)

I usually find all the answers to my question but this time I could not find any. This if actually the first time I post on stackoverflow!
Here is my problem.
The root of my website: "www.example.com" returns a blank (not empty) page when I use Google Webmaster Tools to fetch my website. When I look at the rendered HTML, it is exactly what it should be, but the preview of the page is just blank.
All the other pages of my website like "www.example.com/sample_page.html" seem to give the proper preview though. I have even tried to make a redirection (htaccess) of the root domain "www.example.com" to "www.example.com/sample_page.html" but it also gives a blank preview.
I use cached HTML files so it does not have anything to do with enabled JS or whatsoever. Furthermore, like I told you, the rendered HTML seems OK it's just the preview that does not return anything.
Any hint is greatly appreciated as I have been trying to find a solution since a few weeks now.

Mediawiki / Excel: Hyperlink from Excel to a non-existant wiki page gives a 404 - how can I fix or work around this?

I suspect this could be something faulty with Excel (although I keep an open mind), but I wondered if anyone knew how I could get around this apparent bug:
I wish to create Excel spreadsheets which link to pages in a local wiki (running MW 1.14.0, full details below) where those pages don't yet all exist.
The idea is that over time we will fill in details of the pages, but we would like to create the links now (because copies of the Excel files will get sent out to various internal users and it will not be feasible to go track them down and add links later once the pages are created)
The problem is that when I create such a hyperlink in Excel and then go to follow the hyperlink, I get a message back indicating that the page does not exist. The full text of the message is:
"Unable to open http://. The Internet site reports that the item you requested could not be found. (HTTP/1.0 404)"
This happens on our site or in fact if you link to a non-existant page on wikipedia (e.g. http://en.wikipedia/wiki/Swed53rf). Whereas if you put such a link into a browser you get the correct response (which is to be taken to a page indicating that there is no such page but that you can create it by following the usual link)
Is there some setting on Apache that I might need to configure / override to make sure it returns a valid server response to Excel?
Creating links to existing pages works fine. I appreciate that in theory we could go around creating all the pages that are required, but some of the people involved in the project (creating the initial Excel files) do not / cannot use our wiki and it would be better if this just worked as it would appear it should rather than having to try to add steps to work around it in this way.
I also wondered if it were anything to do with the short URL reformatting. Our wiki, like wikipedia has short URLs, eg:
http://server/w/index.php?title=User:Joe_Blogs/Sandbox
can be reached from
http://server/wiki/User:Joe_Blogs/Sandbox
but including hyperlinks to the full name versions of the pages does not resolve the issue.
The version of Excel being used is Excel 2003 (SP3)
I have discovered that this also happens with Word 2003 (I imagine they are using the same code). However the desired behaviour occurs with Lotus Notes (a miracle, as it's rubbish in so many other ways! )
I have not done any significant development on Apache, but I could consider some form of custom page that re-directs to the non-existent wiki page if Mediawiki changes were deemed to complex/tricky. (although I'm not particularly sure where I'd start with this idea, I'm guessing some sort of URL parameter to accept the destination pagename might be a possible approach)
Any helpful suggestions gratefully received!!
[FYI: I have posted a question on MWUsers forum (www.mwusers.com) too after Googling this to no avail! I'll update the forum response there if I get an answer here or vice versa]
Many thanks,
Neil
Running on Ubuntu Server 8.10
Product Version:
MediaWiki 1.14.0
PHP 5.2.4-2ubuntu5.6 (apache2handler)
MySQL 5.0.51a-3ubuntu5.4
Installed extensions:
CategoryTree (Version r44056)
Renameuser
CategoryTree (Version r44056)
ImageMap (Version r35980)
ParserFunctions (Version 1.1.1)
StringFunctions (Version 2.0.2)
Not sure how to get Excel to let you go to a page which turns out to be a 404, but as a temporary workaround, you can hack out MediaWiki's 404 reporting on missing pages...
In MediaWiki 1.14 or 1.15 releases this will be in Article::view() in includes/Article.php:
if( $return404 ) {
$wgRequest->response()->header( "HTTP/1.x 404 Not Found" );
}
Note that the latest dev code is a little different, but you can find it where it sends the same header in the same file. :)
Wikipedia returns a 404 with a redirect which gets you to the page you want; my guess would be that Excel's rendering engine is not following the redirect.
You could try capturing the conversation in Wireshark, both with a browser and with Excel. That might show you what's happening differently.
Surely once you roll out the new pages, the links would start working, though?

Resources