With the Kimono Web, in the crawled payload there was always url and index field in every source URL JSON. But with the desktop, these fields are missing and my product was totally depends on it.
I'm browsing the source codes of Kimono Desktop but I couldn't manage to find that part.
The index field is explained in there ; https://help.kimonolabs.com/hc/en-us/articles/203349674-Add-a-unique-index-to-each-result-object-
Can anyone help me with it ?
Thanks
I've had the same issue. I found this workaround for the missing url field with the desktop application http://mudd.com/blog/how-to-extract-vdp-data-from-your-website/
Also, in case you used the crawl scheduling feature with the Kimono web app, I found that if I edit my APIs and save them again it lets me choose a crawl frequency. I just discovered this so I'm crossing my fingers and waiting to see if it's really going to work.
Related
While creating knowledge base in dialog flow from URL, I am getting message "Error". However I am able to see FAQ which are on this URL when opening in browser. For reference please find below screenshot below, If feasible suggest how can I find exact reason for this error as dialog flow don't give any other relevant error for this.
URL which I am configuring knowledge base is :
https://www.owens.edu/faq/early-alert/
enter image description here
The full error message is the following:
"Failed to crawl https://www.owens.edu/faq/early-alert. Please verify that your URL is publicly accessible and is hosted on a site that can be indexed by Google Search."
I have tested the FAQ page you shared and by using the "Developer tools" of Chrome, I was able to see that error message. I suggest you to take a look at the "Supported content" documentation for knowledge bases in Dialogflow. In there, you can see the following statement:
Files from public URLs must have been crawled by the Google search indexer, so that they exist in the search index. You can check this with the Google Search Console. Note that the indexer does not keep your content fresh. You must explicitly update your knowledge document when the source content changes.
Therefore, make sure to meet all the requirements listed there.
I have a working App built on Nodejs + Drywall + Openshift, sorry it's in Arabic. Basically, I am looking to improve on the service, but having a major roadblock. The site is a classifieds site and I need to optimize it for SEO, however, my links to ads are shown like this...
http://yobyobi.com/ads/show/55c9ff9dcf68970612ba2d38
55c9ff9dcf68970612ba2d38 is the Ad ID on my mongoDB, I do also have a record for the indicating the date and the title of Add combined "Sun-Nov-22-2015-8-pm-2007-camry-for-sale", the goal is to make the URL pretty and understandable by search engines. The end result I want to accomplish is one of the following:
yobyobi.com/ads/show/55c9ff9dcf68970612ba2d38/Sun-Nov-22-2015-8-pm-2007-camry-for-sale
yobyobi.com/ads/show/Sun-Nov-22-2015-8-pm-2007-camry-for-sale/55c9ff9dcf68970612ba2d38
yobyobi.com/ads/show/Sun-Nov-22-2015-8-pm-2007-camry-for-sale/
Now, option number 3 would be ideal, but would slow down my application if I have to search by Ad title instead of Ad ID. Similar to what Stackoverflow is doing (attached pic)
Stackoverflow example
Code
app.get('/ads/show/:id', require('./views/account/ads/index').read);
The above line returns the Ad for me with all the details including the title that I want to use, but the problem is that I cannot change the route URL after I receive the title.
I am not sure if this module would help in whatever I am trying to do it's called "named-routes"
Has anyone ran across this problem? If so can you share some insight on how to best tackle the problem?
Thanks in advance,
well, the solution was dead simple, do the following:
add * as a wild card like this
app.get('/ads/show/:id/*', require('./views/account/ads/index').read);
now when you create the links to the post, attach anything where the * is and it should show the same Ad without breaking the page.
Cheers
I am facing a problem while developing an app, where I need to display search engine results directly on my app page without directing to www.google.com.
This is how it looks, in the search box I'll enter the RSS feed site name, and now I want to get the google search result on my app page so that I can easily extract RSS feed website and perform the operation I was intended to do.
I am intending only to get RSS feeds from the site just by typing sitename.
Thank you!
Answer.
Almost working..,
Thank you #Chandan,#Suzi
Check under 2. A Better Approach
I didn't try it out practically and am not sure whether its deprecated by this time or not.
I want to submit my site to Google. How much time does it take to crawl a new post on the website?
Also, is there a way to feed this post to Google crawler as soon as a post is created?
Google has three modes of entering a website into its results - discover, crawl, index.
In order to 'discover' your site, it must be made aware of it's existence - normally through back-links. If you're site is brand new you can use the submit URL form - but this isn't really a trusted method. You're better off signing up for a Google Webmaster Tools account and submitting your site. An additional step is to submit an XML sitemap of your site. If you are publishing to your site in a blogging/posting way - you can always consider PubSubHubbub.
From there on, crawl frequency is normally based on site popularity (as measured by ye olde PageRank). Depth of crawl (crawl-budget) is also determined by PR.
There are a couple ways to help "feed" the Google Crawler a URL.
The first way is to go here and submit a URL ---> www.google.com/webmasters/tools/submit-url/
The second way is to go to your Google Webmasters Tools and clicking "Fetch as GoogleBot"
And then inputting the URL you want to add:
http://i.stack.imgur.com/Q3Iva.png
The URL will then appear similar to this:
http:\\example.site Web Success URL submitted to index 1/22/12 2:51 AM
As for how long it takes for a question on here to appear on google, there are many factors that are put in to this.
If the owners of the site use Google Webmasters Tools, the following setting is available:
http://i.stack.imgur.com/RqvOi.png
For fast crawl you should submit your xml sitemap in google web master and manually crawled and index your web pages url through google webmaster fetch.
I also used google crawled and index method and after that this practices give me best result.
This is a great resource that really breaks down all the factors that affect a crawl budget and how to optimize your website to increase it. Cleaning up your broken links and removing outdated content, for example, can work wonders. https://prerender.io/crawl-budget-seo/
I acknowledged error in my response by adding a comment to original question a long time ago. Now, I am updating this post in interest of keeping future readers from being misguided as I was. Please see notes from other users below - they are correct. Google does not make use of the revisit-after meta tag. I am still keeping the original response text here to make sure that anyone else looking for similar answer will find it here along with this note confirming that this meta tag IS NOT VALID! Hope this helps someone.
You may use HTML meta tag as follows:
<meta name="revisit-after" content="1 day">
Adjust time period as necessary. There is no guarantee that robots will return in given time frame but this is how you are telling robots about how often a given page is likely to change.
The Revisit Meta Tag is used to tell search engines when to come back next.
I have this songs site what ever data it has same is being displayed in other site
even if i echo "hello" same is done on other site does any body know how can i prevent that
just getting in more depth i found out that site is using file_get_contents() how can i prevent him from doing that
Well, you can try to dermine their IP address and block it
You said file_get_contents was being used.
A URL can be used as a filename with this function if the fopen wrappers have been enabled. See fopen() for more details on how to specify the filename. See the Supported Protocols and Wrappers for links to information about what abilities the various wrappers have, notes on their usage, and information on any predefined variables they may provide.
To disable them, more information is at http://www.php.net/manual/en/filesystem.configuration.php#ini.allow-url-fopen
Edit: If they go and use CURL or an equivalent after this, try and mess with their script by changing the HTML layout, etc. If that doesn't help, try and locate the IP of the script host, and make it return nonsense ;)
Edit2: If they use an iframe use javascript to redirect on iframe detection
Or you can even generate rubbish information just for that crawler, just to mess the "clone" site.
The first question to be answered is: Have you identified the crawler getting the information from your site?
If so, then you can give anything you want to this process: Nothing (ignore / block), a message telling the owners to stop getting your information, give them back rubbish contents, ...
Anyway, the first step is doing things properly. Be sure that you site has a "robots.txt" with the accepted policy for crawlers.