question about sitemap files and their content for a dynamic website

question about sitemap files and their content for a dynamic website - web

I am writing a set of functions to generate a sitemap for a website. Lets assume that the website is a blog.
The definition of a sitemap is that it lists the pages that are available in a website. For a dynamic website, those pages change quite regularly.
Using the example of a blog, the 'pages' will be the blog posts (I'm guessing), since there is a finite limit on the number of links in a sitemap (ignore sitemap indexes for now), it means I cant keep adding a list of the latest blog posts, because at some point in the future, the limit will be exceeded.
I have made two (quite fundamental) assumptions in the above paragraph. They are:
Assumption 1:
A sitemap contains a list of pages in a website. For a dynamic website like a blog, the pages will be the blog posts. therefore, I can create a sitemap that simply lists the blogposts on the website. (This sounds like a feed to me)
Assumption 2:
since there is a hard limit on the number of links in the sitemap file, I can impose some arbitary limit N, and simply generate the file periodically, to list the latest N blogposts (at this stage, this is indistinguishable from a feed)
My questions then are:
Are the assumptions (i.e. my understanding of what goes inside a sitemap file) valid/correct?
What I described above, sounds very much like a feed, can bots not simply use a feed to index a web site (i.e. is a sitemap necessary)?
If I am already generating a file that has the latest changes in it, I don't see the point of adding in the sitemap protocol file - can someone explain this?

Assumption 1 is correct - the site map should indeed be a list of the pages on the site - in your case, yes that would be the blog posts, and any other pages like a contact page, home page, about page, etc that you have.
Yes, it is a bit like a feed, but a feed generally only has the latest items in it, while the site map should have everything.
From Google's docs:
Sitemaps are particularly helpful if:
Your site has dynamic content.
Your site has pages that aren't easily discovered by Googlebot during the crawl process—for example, pages featuring rich AJAX or images.
Your site is new and has few links to it. (Googlebot crawls the web by following links from one page to another, so if your site isn't well linked, it may be hard for us to discover it.)
Your site has a large archive of content pages that are not well linked to each other, or are not linked at all.
Assumption 2 is a little incorrect - The limit for a site map file is 50,000 links/10MB uncompressed, if you think you are likely to hit that limit, then start by creating a sitemap index file that only links to one sitemap, and then add to it as you go.
Google will accept an RSS feed as a site map if that's all you have, but points out that these usually only contain the most recent links - the value in having a sitemap is that it should cover everything on the site, not just the latest items, which are probably the most discoverable.

Related

Kentico 12 Smart Search Page Crawler Index not working

I have a Kentico 12 MVC site where the cms and I guess "client" site are in the same server but separate IIS entries. One is called admin.site.com and the other is called dev.site.com.
I'm trying to implement the Smart Search functionality with a Page Crawler index. The reason I want a Page Crawler index is because my content structure is as follows:
Page Container > Page Type "Product"
Then within "Product" page type, I'm pulling in content from a different part of the content tree using widgets/page builder functionality in the Page tab. The Content tab of that page has very little actual content.
If I use Pages Index and search on that, it only grabs the page types that are in the content widget section of the site, so not the pages that implement the widgets which are the actually live pages on the site. I implemented the Page Crawler index and tried a search preview but literally anything I search comes with no results. Please let me know what details you'd need from me to help, I appreciate any help!
Best,
RP

Check the documentation and especially the note:
"We do not recommend using crawler indexes on MVC content-only sites. The crawler only selects pages from the site's content tree in Kentico, which may not match the actual structure of the website (in many cases, content-only pages only store data and do not represent pages on the live site)."
To achieve your need you will need to create your own crawler code and combine it with custom search index.

Search result: How to show only pages, not different content items?

We are using Liferay as a classic CMS meaning that we compose pages using web content articles. There is an issue with Liferay's internal search I could not yet find a proper answer for:
Because web content articles are pretty much only building blocks for pages we don't want the search to show them as distinct items. The user should only get a list of pages that contain their search keywords, including all the articles put onto this page.
At the moment we can see two different approaches and both come with certain problems we could not solve yet:
Idea 1
We modify the journal indexer and try to obtain all URLs of the pages (how?) where the article has been placed on. Then we add them to the document to be indexed. In the search result we then can access the URLs and collect them. In the end we make sure every URL is only shown once.
Idea 2
At some point Liferay renders the entire page before sending it to the browser. If we somehow could put an indexer there, we could index the entire page. We then could limit the search to the special "page documents". Getting the fully rendered page would be the main issue here, because either we would have to run a crawler to frequently trigger this indexing or we would need to find a way to trigger page rendering from within an indexer or something like that.
I have been carrying this problem around for quite a while now and still could not find an idea good enough to spend time trying it out. If anyone of you has some input on those two ideas or maybe an entirely different approach, I would be extremely grateful.

I'll just answer myself, because by now we found a suitable solution to solve our problem:
In addition to the default search portlet there is also a "Web Content Search Portlet" shipped with Liferay. It seems to have been part of Liferay for quite a while now, but it's somewhat hard to find, because there is hardly any documentation for it (I only found the Liferay wiki page, which isn't really anything at all). It searches only within web content articles and shows links to the pages rather than just a link an isolated view of the article. It has much less configuration options than the default search portlet, however. Pretty much all it allows to change is whether articles actually have to be placed on at least one page to show up in the results.
So there is no need for any kind of custom indexer or any other "hack"...all we need to do is use the correct portlet. We will only need to write a hook that changes the appearance of the result page.

What you ask is interesting but your ideas are on the wrong direction.
Specially idea 2 it's particulary wrong because you cannot do indexing work meanwhile a page is rendered. Think about performace only.
In Liferay pages and assets are not directly linked: pages have portlets and portlets display assets (web content and more).
Liferay indexing refers and scans assets content, not refers the display result of the assets. Think about permission: the same page can display different contents depends on the user who looks.
bye

SEO search result indentation (google)

I want my website to have indentation in google result search.
After taking reference of many websites, I found this one website "www.traveloka.com"
Inside the website, I can't find any meta keywords stuffs.
But the website is well indented.
My question is :
- does meta keywords really needed to have google indent my search result ?
- if yes, why the website www.traveloka.com is well indented without meta keywords ?
- if no, what matters then ? Beside having the page have href linking to each other ?
UPDATE :
While doing SEO, I found this website :
chlooe.com
It reports SEO advises, which ones to be changed, etc.
I'll follow the instructions there. any thoughts ?

If by indentation you mean ... it's called sublinks.
Meta tags are no longer important for most search engines. They now rank the pages according to content so in your site's content, use strong keywords to get better ranking.
Having a specific page title helps a lot too.
As for the meta tags, personally, I like to leave it in but they are no longer mandatory.
The Google site links are generated automatically by Google depending on your content.
Here are a few tips:
1) Have a sitemap.xml in your website. This will tell the crawlers which pages are available on your site. To generate a sitemap.xml, I use http://www.xml-sitemaps.com/
2) Submit that sitemap to google webmaster tools.
3) Use clean urls. For example www.mydomain.com/contact, .../about-us, .../portfolio, ... etc. These help search engines seperate the content and create sub links depending on the most important content.
4) Most important of all, get traffic on your website... no traffic = poor ranking.
This is not a full tutorial but just some tips. Search for "google sub links" to learn more.
Hope this helps
https://support.google.com/webmasters/answer/47334?hl=en

How would I best make this SEO_able?

I have a search engine that searches albums.
For each music album, I have a page.
So, the work flow goes like this:
People search for music titles
The search engine displays a list of albums.
People click on an album to go to a details page.
I want google to index my front page and the details page. I want the details page to be highly ranked. How can I build a sitemap for this?
By the way, I have about 5 million albums (but I want the top 1000 ones to be highly ranked on google)

You would not use a sitemap for that many results. You would want each album to appear as a page with a unique URI to reference that page. That way the search engine can crawl your site by crawling links since search bots cannot submit form data. Each of those URIs should be simple, meaning limited to this part of the URI syntax:
scheme://authority_segment/path
Program your web application to remove and throw away any extraneous data, such as query string or parameters. If you do this you have to be sure that you are watching for URI poisoning or SQL injection even through means of character encoding.

How can I build a sitemap for this?
By pulling the addresses out of your database and creating a XML file with a high priority for some selected pages. Somehow I think that isn’t your real question …

If I wanted to automate building a site map for a site like this, I'd employ Python. I'd pretty much write everything from the ground up (except the data store access). The format is quite simple.
I'm not sure I quite understand your question...

Move Sharepoint Wiki between site collections

I've got two sharepoint site collections. Now i have a Wiki in one and i want to move it to te second one. I did the move with creating a template and import it into the new one. The problem í've got now is the links. The Wiki links are refering to the old location.
Does any one a solution for this?

The problem with the SharePoint Wiki system is that it resolves the Wiki style links [[link]] at save time to absolute links to the page in the Wiki page list.
I think you will need to write some code that loops through and updates the text in your Wiki pages. Use the WSS object model to to find and update each list item that represents a Wiki page.
You can also have a look at www.sharepointproducts.com where you can download a free version of a tool (CopyMove for SharePoint) that can also move Wiki pages across site collections. It does, however, not update the links on the Wiki pages. But it is my tool - so I will give it some thought whether to add support for this. It is not the first time I have heard about this problem.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string