How to stop Site from Scraping my site

How to stop Site from Scraping my site - .htaccess

I have this songs site what ever data it has same is being displayed in other site
even if i echo "hello" same is done on other site does any body know how can i prevent that
just getting in more depth i found out that site is using file_get_contents() how can i prevent him from doing that

Well, you can try to dermine their IP address and block it

You said file_get_contents was being used.
A URL can be used as a filename with this function if the fopen wrappers have been enabled. See fopen() for more details on how to specify the filename. See the Supported Protocols and Wrappers for links to information about what abilities the various wrappers have, notes on their usage, and information on any predefined variables they may provide.
To disable them, more information is at http://www.php.net/manual/en/filesystem.configuration.php#ini.allow-url-fopen
Edit: If they go and use CURL or an equivalent after this, try and mess with their script by changing the HTML layout, etc. If that doesn't help, try and locate the IP of the script host, and make it return nonsense ;)
Edit2: If they use an iframe use javascript to redirect on iframe detection

Or you can even generate rubbish information just for that crawler, just to mess the "clone" site.
The first question to be answered is: Have you identified the crawler getting the information from your site?
If so, then you can give anything you want to this process: Nothing (ignore / block), a message telling the owners to stop getting your information, give them back rubbish contents, ...
Anyway, the first step is doing things properly. Be sure that you site has a "robots.txt" with the accepted policy for crawlers.

Related

No AdSense Impressions on a site newly converted to Drupal 8

I am scratching my head trying to figure out (and yes, I know there are multiple reasons this could be happening) why my AdSense Impressions have dropped to 0 after changing my site to Drupal 8.6.4.
I have installed the Drupal AdSense module, into which I've put my "pub-XYZ~~" account number.
I left it like that for several days thinking perhaps the crawler hadn't found it. Then I got cold feet and thought perhaps it wasn't working, especially since I didn't see any AdSense code appearing in the source of the page.
So I added the following code via Asset Injector into the head of the page:
<script async src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"> </script>
<script>
(adsbygoogle = window.adsbygoogle || []).push({
google_ad_client: "ca-pub-239656292892567776",
enable_page_level_ads: true
});
</script>
(That's not my real client ID, just random numbers.)
Now I see a line of script in the head of the page:
<script src="/sites/default/files/js/js_Gc2nyd2PQaQJQwlbfhfc8Yz8TwWRl90UGM3vTenwS8s.js"></script>
And that (if I click on it) opens up the Google AdSense code I've written above.
Yet I've waited two or three days more, still not seeing any impressions, page visits, CTR (every metric on my "Performance" report is zero), and I am concerned that maybe I've done something wrong.
So does anyone know, if I'm using the Drupal AdSense module, where do I see the code?
And two, if I'm using the module, where can I see the code appearing in the source? (The Google answer doc says "You can do this by viewing the source of your site from a browser and double-checking that the ad code looks exactly like the code we provide you in your account, and includes every line of the ad code." But in the Drupal AdSense module, the only field is one for that pub-XYZ~~~ number, and nothing else, and as I mentioned, I'm not finding the code anywhere in the site when I view the source.
Three, if I'm using the module, will it mess things up to have the code above put in via the Asset Injector?
And lastly, am I just too worried and the AdSense module is doing what it should and I should check back in 10 days or 20, rather than in 5 or 7?
Thank you for any help. I had just installed AdSense (by adding it to the head of the page, this exact code) on the old site before switching to Drupal, and it was definitely working then, so I know that the issue isn't that the site isn't approved or the account's invalid or such. It WAS working fine. But after this move to Drupal 8, it's completely failed and I just don't know which link of the chain is the one I should fix. I have been scouring both Drupal docs and AdSense docs for this issue/answers and haven't found anything that seems to be the issue...and I really am hoping to know if the code side of it is correct.
Again, thank you in advance!

Okay, so for anyone else who needs this info, I'm answering my own question: I never did get the Google AdSense "auto ads" to work on my site, and am pretty sure the reason they didn't is that I was trying the "auto ads" code rather than the on-the-page, placed ads type code. I still don't know if it was simply a matter of time and the crawler hadn't found my site again, or if I had incorrect code, or what.
But I am now seeing an ad on my site, and what worked for me was:
Turn off any AdSense code in the head of the page. (I had injected the script via Asset Injector, and I disabled that.)
Make sure Drupal's AdSense module is running. DEselect the option that asks people if they wish to turn off their adblocker. The only thing I added in AdSense's main config window was my "pub-XYZ~" number.
Ditch the Google "auto ad" option and do the "Ad Units" option, creating an ad in AdSense. (AdSense > Ads > Ad Units). Do everything there and get your ad ID#.
Back to Drupal: Either create a new custom block or use one of the Drupal AdSense options to create a block on your site. If you use an Drupal AdSense option, it prompts you for the info needed to display the right ad. You'll need that ad ID# info at the very least.
Make sure that block is placed on your page. I chose "Responsive" but presumably this works for all the options. Fixed size, etc. I believe you could also (if you wanted) simply place the Google code directly into a custom block and use that. It seems people do.
If you've done it right, logged into your Drupal site, with the block placed, it will show placeholder text with your pub-# and ad ID#, in a little box. You won't see an actual ad (this is in the "Help and Information" option at the top of the AdSense module config). If you're seeing the placeholder box, it's a good sign that everything's going well with the Drupal AdSense module side of it.
Then wait, and wait, and eventually, logged out, on a private browser window, you should see the ad when the crawler finds it and other magic happens. I waited about 24 hours after setting this whole thing up before seeing an ad appear.
(Please note that this all was with a site that had a working AdSense account and had previously been getting lots of impressions for the ads. So if you don't have those aspects set up initially, none of the above will work either.)

Kimono Desktop's payload url and index fields missing

With the Kimono Web, in the crawled payload there was always url and index field in every source URL JSON. But with the desktop, these fields are missing and my product was totally depends on it.
I'm browsing the source codes of Kimono Desktop but I couldn't manage to find that part.
The index field is explained in there ; https://help.kimonolabs.com/hc/en-us/articles/203349674-Add-a-unique-index-to-each-result-object-
Can anyone help me with it ?
Thanks

I've had the same issue. I found this workaround for the missing url field with the desktop application http://mudd.com/blog/how-to-extract-vdp-data-from-your-website/
Also, in case you used the crawl scheduling feature with the Kimono web app, I found that if I edit my APIs and save them again it lets me choose a crawl frequency. I just discovered this so I'm crossing my fingers and waiting to see if it's really going to work.

tabs permission or content script?

I'm writing an extension that needs to show a page action on amazon.com pages.
Would it be better to request the "tabs" permission or to inject a content script into amazon.com pages?
The tabs permission strikes me as using less resources (because it just checks the URL against a regex in the background script) but I think it's a scarier permission message ("access your tabs and browsing activity")?
Injecting a content script into amazon.com pages seems like it would take more resources it but would only need permission to amazon.com...

It is a generic question and answer depends on Client to Client. You have pointed out the + and - of each.
I suggest you to go for content scripts if your clients are particular about security and privacy, in this you are adding an extra load to pages(with content scripts and message passing) which may slow down the normal execution process.
I suggest you to go for tab permission, if you are all about performance. It is a native API, and executes in background page no extra load on tabs. Many extensions on web store does use tabs API, i dont think this would scare them as this is not new.
However, it is all about your target section of users.

How fast does Google take to crawl new page, and can we influence Google's crawler?

I want to submit my site to Google. How much time does it take to crawl a new post on the website?
Also, is there a way to feed this post to Google crawler as soon as a post is created?

Google has three modes of entering a website into its results - discover, crawl, index.
In order to 'discover' your site, it must be made aware of it's existence - normally through back-links. If you're site is brand new you can use the submit URL form - but this isn't really a trusted method. You're better off signing up for a Google Webmaster Tools account and submitting your site. An additional step is to submit an XML sitemap of your site. If you are publishing to your site in a blogging/posting way - you can always consider PubSubHubbub.
From there on, crawl frequency is normally based on site popularity (as measured by ye olde PageRank). Depth of crawl (crawl-budget) is also determined by PR.

There are a couple ways to help "feed" the Google Crawler a URL.
The first way is to go here and submit a URL ---> www.google.com/webmasters/tools/submit-url/
The second way is to go to your Google Webmasters Tools and clicking "Fetch as GoogleBot"
And then inputting the URL you want to add:
http://i.stack.imgur.com/Q3Iva.png
The URL will then appear similar to this:
http:\\example.site Web Success URL submitted to index 1/22/12 2:51 AM
As for how long it takes for a question on here to appear on google, there are many factors that are put in to this.
If the owners of the site use Google Webmasters Tools, the following setting is available:
http://i.stack.imgur.com/RqvOi.png

For fast crawl you should submit your xml sitemap in google web master and manually crawled and index your web pages url through google webmaster fetch.
I also used google crawled and index method and after that this practices give me best result.

This is a great resource that really breaks down all the factors that affect a crawl budget and how to optimize your website to increase it. Cleaning up your broken links and removing outdated content, for example, can work wonders. https://prerender.io/crawl-budget-seo/

I acknowledged error in my response by adding a comment to original question a long time ago. Now, I am updating this post in interest of keeping future readers from being misguided as I was. Please see notes from other users below - they are correct. Google does not make use of the revisit-after meta tag. I am still keeping the original response text here to make sure that anyone else looking for similar answer will find it here along with this note confirming that this meta tag IS NOT VALID! Hope this helps someone.
You may use HTML meta tag as follows:
<meta name="revisit-after" content="1 day">
Adjust time period as necessary. There is no guarantee that robots will return in given time frame but this is how you are telling robots about how often a given page is likely to change.
The Revisit Meta Tag is used to tell search engines when to come back next.

How to logout of an HTTP authentication (htaccess) that works in Google Chrome?

I got a solution for Firefox and IE but I didn't find any solution for Google Chrome.
Is there a way to do it in Google Chrome?

I know it's a really old post... I mean like friggin 5 years now, but I just found a somewhat good solution.
Inside your protected folder, create another folder, let's call it "logout". Place the same .htaccess file in here as you have in your protected folder, except with a small modification.
instead of:
Require valid-user
now write:
Require user EXIT
And make sure, you don't have a user named exit! :D
In your protected area, your logout link or button or whatever, should redirect the user to this address: example.com/protectedFolder/logout
The browsers usually are able to keep only one user logged in from one site name or realm name... the sign in attempt for the user Exit will overwrite everything, thus the originally logged in user, would have to log in again to the protected area.
But as always, I might be wrong, and you should still close all your browser window, and restart the computer if you want to be sure! :)
Also, it wouldn't hurt, if you would tell your users what is going to happen, when they hit logout!
I have tested this in chrome and in internet explorer 11.(will not work in edge, and maybe others neither)
The solution was found here:
https://www.mavensecurity.com/media/BasicAuthLogOut.pdf

You can't logout a HTTP authenticated session other then closing the browser window. Also see the accepted answer on this question for an extensive explanation.

try redirect to:
wrong_user:wrong_password#yourdomain.com

I have put together the following article which explains how I have managed to achieve this in Chrome. I hope this helps. https://www.hattonwebsolutions.co.uk/articles/how_to_logout_of_http_sessions
In short - you create a sub folder (as per Gyula's answer) but then send an ajax request to the page (which fails) and then trigger a timeout redirect to the logged out page. This avoids having a secondary popup in the logout folder requesting another username (which would confuse users). My article uses Jquery but it should be possible to avoid this.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string