Download/Save a webpage including all its linked sites - web

In Chrome there's this option of saving a complete webpage. I would like to save a complete webpage but that the pages which are linked should also be saved. Is this possible? I.e. I want to go one step further than simply saving the page I'm looking at. Is it possible to go two steps further? I.e. saving all the linked pages in the linked pages? Is it possible to generalized this to N-steps? I realize this would need a lot of memory space, but is there a code available to do this task?

Related

Can someone guide me how to collect a list of url address in the tab using python?

I'm trying to collect a list of "https://..." and hope to store them in csv file. I can do them manually such as use excel, copy the urls from the website of interest and paste them one by one. But it's tedious and definitely would take lot of time.
can someone suggest and guide for a faster way?
If you just need the addresses quickly from one page you could run this javascript snippet document.links.forEach(link=>console.log(link.href)) in the console of your browser, this will output all of the links on that page.
If you want to use python to scrape the page I would suggest taking a look at this question on stackoverflow, this uses the beautifulsoup framework.
If there is dynamic content loaded on the page with javascript it's probably better to use something like Selenium, relevant stackoverflow question

How can i get website report showing links from each page

I want to get a report which specifies what all links are there in each page of the website.I tried using different softwares,but the problem is they are just giving all links without showing exactly which links are there in each page.Also the website i am trying to make a report on is very unstructured,so it's not possible to just classify links,based on url forward slashes.For example,links starting with https://example.com/blog, will not give me all links inside the
'https://example.com/blog' page,because links inside 'https://example.com/blog' page can contains links without 'https://example.com/blog/' in the beginning of the link.
What can i do about this?
Thanks.
In Google analytics, there is no such concept as the next page.
Rather, it only knows the previous page.
It is due to the disconnected nature of the web.
You can, however, use the previous page to trace back to get the data you want.
Instead of looking for all links inside the https://example.com/blog, you will be looking at getting all links where the previous page is https://example.com/blog
More detailed explanation

Transfer weebly site to a new address and keep slideshow

I built a site on weebly and archived and exported all of the files associated with it. When I tried the files with the new site (basically just needed to change addresses), the slideshow no longer worked. That's when I found out via google that this is a known problem, but all of the posts on it are over two years old. Has anyone figured out a way around this?
Usually I would include code on SO, but there is so much that you'll have to let me know what you want to see.
I put my weebly site inside an iframe on the "authorized site". No one can link to any of the individual pages and I couldn't figure out the iframeresizer, so there is awkward space at the end of some pages. It's terrible.
But the rest of the group can easily go into weebly and edit any of the pages, which saves me a lot of time and responsibility. Plus, of course, the precious slide show works.

Search result: How to show only pages, not different content items?

We are using Liferay as a classic CMS meaning that we compose pages using web content articles. There is an issue with Liferay's internal search I could not yet find a proper answer for:
Because web content articles are pretty much only building blocks for pages we don't want the search to show them as distinct items. The user should only get a list of pages that contain their search keywords, including all the articles put onto this page.
At the moment we can see two different approaches and both come with certain problems we could not solve yet:
Idea 1
We modify the journal indexer and try to obtain all URLs of the pages (how?) where the article has been placed on. Then we add them to the document to be indexed. In the search result we then can access the URLs and collect them. In the end we make sure every URL is only shown once.
Idea 2
At some point Liferay renders the entire page before sending it to the browser. If we somehow could put an indexer there, we could index the entire page. We then could limit the search to the special "page documents". Getting the fully rendered page would be the main issue here, because either we would have to run a crawler to frequently trigger this indexing or we would need to find a way to trigger page rendering from within an indexer or something like that.
I have been carrying this problem around for quite a while now and still could not find an idea good enough to spend time trying it out. If anyone of you has some input on those two ideas or maybe an entirely different approach, I would be extremely grateful.
I'll just answer myself, because by now we found a suitable solution to solve our problem:
In addition to the default search portlet there is also a "Web Content Search Portlet" shipped with Liferay. It seems to have been part of Liferay for quite a while now, but it's somewhat hard to find, because there is hardly any documentation for it (I only found the Liferay wiki page, which isn't really anything at all). It searches only within web content articles and shows links to the pages rather than just a link an isolated view of the article. It has much less configuration options than the default search portlet, however. Pretty much all it allows to change is whether articles actually have to be placed on at least one page to show up in the results.
So there is no need for any kind of custom indexer or any other "hack"...all we need to do is use the correct portlet. We will only need to write a hook that changes the appearance of the result page.
What you ask is interesting but your ideas are on the wrong direction.
Specially idea 2 it's particulary wrong because you cannot do indexing work meanwhile a page is rendered. Think about performace only.
In Liferay pages and assets are not directly linked: pages have portlets and portlets display assets (web content and more).
Liferay indexing refers and scans assets content, not refers the display result of the assets. Think about permission: the same page can display different contents depends on the user who looks.
bye

TYPO3: How to count page impressions on every page with an extension

I need to count the page impressions of every page on a TYPO3 site into the db.
So I think I need an extension which is called on every page impression and increase a column 'impressions' in the db of the specific page.
I'm new to typo3 and new to extension development as well. Is there a way to include an extbase-extension on every page so some php-script get called?
(Update)
I want to add more information:
I don't need a counter which counts all PIs. The counter needs to be page-related. So it make sense to extend the pages-table from Typo3. Another need is that the extension should be done with extbase.
I'm new to typo3 and new to extension development as well. Is there a way to include an extbase-extension on every page so some php-script get called?
Once your plugin is configured you can include it with page.1234 < plugin.tx_yourextension_pi1 on any page. 1234 determines the position on your page.
The script should be USER_INT, so it's not being cached (mind you, this will cost loads of performance as previously stated by #norwebian)
As you don't want to output anything, make sure the controller stays empty as well.
Did you do a quick search in the extension repository? Trying a search for "page counter" reveals four relevant extensions.
"Sys_stat" is the closest thing to an "official" solution, it is really just enabling a few settings already existent. It has been reported to fill up the database with too much data, though.
"Generic Visitor Counter" would be my favourite, I believe (if I was going for a page counter at all), it is recently updated and seems simple enough.
You should really consider a proper stats extension, though. Both ics_awstats and ke_stats have been in my toolset.
YMMV. Be aware that if your site is popular, stats gathering quickly gets out of hand. On the other hand, if you go for a simple counter, including uncached extensions will cost performance.
I am not sure if I really understood what you want and need. After all, page impressions are not the same as page views. I wouldn't know the difference "onpage" right now though. So am I right in assuming that you mean page views?
If yes: I would take the following approach:
A separate, autonomous extension with a JavaScript for asynchronous calling of an API and a table for storing page views / page impressions.
Each page globally binds a JavaScript that initializes itself.
Once the DOM is ready, it sends a call to an AJAX API endpoint with the URL of the page as a parameter.
The endpoint takes only the URL.
For each unique URL, a record including counter is created or updated.
Extending the table for the pages doesn't make sense to me. What are you doing with a website that consists of news overviews, news details, press and blog sections, a dealer search and a store with product pages?
I would keep the statistics table standalone.
If you expand the table a bit and add date and time - no simple increment of hits - you can even identify the hottest pages of the week, the month, etc.
--
My approach won't increase/delay page load time much, if at all, and will have little noticeable impact even on heavily requested websites.
With the AJAX endpoint, it's then up to you how you deploy it and how much of the CMS framework you want to load.

Resources