Different response when I fetching and crawling url(scrapy) - python-3.x

I am trying to scrape the website. When I use, Scrapy shell with chrome user agent, I get the response(html content) same as viewed in browser.
But when I crawl same link using Scrapy script with default user_agent, it shows different response with content I didn't need. But when I change the user_agent to same used in shell, it shows 404 error.
Please help me. I am really stuck on it.
I tried many user agents. I also changed concurrent requests, but nothing is working.

Related

Google couldn't fetch my sitemap.xml file

I've got a small flask site for my old wow guild and I have been unsuccessful in getting google to read my sitemap.xml file. I was able to successful verify my site using googles Search Console and it seems to crawl it just fine but when I go to submit my sitemap, it lists the status as "Couldn't fetch". When I click on that for more info all it says is "Sitemap could not be read" (not helpful)
I originally used a sitemap generator website (forgot which one) to create the file and then added it to my route file like this:
#main.route('/sitemap.xml')
def static_from_root():
return send_from_directory(app.static_folder, request.path[1:])
If I navigated to www.mysite.us/sitemap.xml it would display the expected results but google was unable to fetch it.
I then changed things around and started using flask-sitemap to generate it like this:
#ext.register_generator
def index():
yield 'main.index', {}
This also works fine when I navigate directly to the file but google again does not like this.
I'm at a loss. There doesn't seem to but any way to get help from google on this and so far my interweb searches aren't turning up anything helpful.
For reference, here is the current sitemap link: www.renewedhope.us/sitemap.xml
I finally got it figured out. This seems to go against what google was advising but I submitted the sitemap as http://renewedhope.us/sitemap.xml and that finally worked.
From their documentation:
Use consistent, fully-qualified URLs. Google will crawl your URLs
exactly as listed. For instance, if your site is at
https://www.example.com/, don't specify a URL as https://example.com/
(missing www) or ./mypage.html (a relative URL).
I think that only applies to the sitemap document itself.
When submitting the sitemap to google, I tried...
http://www.renewedhope.us/sitemap.xml
https://www.renewedhope.us/sitemap.xml
https://renewedhope.us/sitemap.xml
The only format that they were able to fetch the sitemap from was:
http://renewedhope.us/sitemap.xml
Hope this information might help someone else facing the same issue :)
put this tag in your robots.txt file Sitemap: domainname.com/sitemap.xml. Hope this will be helpful.

increase website traffic by visiting website with python script and tor proxy

I want to make a script that open a url with python and stay in website for seconds and then do this over and over to increase website traffic.
with tor and request lib in python I write this script and I config tor to change IP every 5 seconds :
import requests
import time
url = 'https://google.com'
while True:
proxy = {'http': 'socks5://127.0.0.1:9150'}
print(requests.get(url, proxies=proxy).text)
time.sleep(5)
But when I checked my google analytic or my Alexa account, I notice the traffics which made by this script, aren't affect.
I wonder how can I make traffics for a website which affect and the tools like google analytic couldn't find that my traffics aren't fake either.
It won't help at all. See how Alexa traffic rankings are determined. Metrics are collected from a panel of users with certain browser extensions installed that report their browsing habits, or by Alexa Javascript code you install on your site.
Given those metrics for collection, visiting your site with Tor and Python code won't have any impact on your ranking.
This is because you are sending a web request.
If I ran that code and removed the .text, it would give me a response 200 code.
You are just checking if the website is online via a proxy
So instead of doing it with a python request, you can use selenium. Selenium with creating a browser and can execute mouse events with programming you can also use the time to put some difference between clicks so it won't look like a bot. Here you can check how to do it in details- https://ervidhayak.com/blog/detail/increase-traffic-on-website-with-python-selenium

Associate requests with active page

I'm working on a browser extension (compatible with Chrome, FF, Opera, and Edge) and I'm trying to figure out how to associate requests to domains outside of the current page. For example, when you go to google.com a lot of requests to domains other than google.com occur such as to gstatic.com.
An extension like NoScript shows all of the requested domains that a page made and lets you allow or deny. I'm trying to get a similar functionality.
Is this something that can be done in the content script or is there some way to keep state information in the background script that I can then display in the popup? Obviously it's possible but I'm just not seeing which callback I can use.

Tracking down X-Frame-Options header

We've partnered with a company whose website will display our content in an IFRAME. I understand what the header is and what it does and why, what I need help with is tracking down where it's coming from!
Windows Server 2003/IIS6
Container page: https://testDomain.com/test.asp
IFRAME Content: https://ourDomain.com/index.asp?lots_of_parameters,_wheeeee
Testing in Firefox 24 with Firebug installed. (IE and Chrome do the same thing.) Also running Fiddler so I can watch network traffic while I'm at it.
For simplicity's sake, I created a page with nothing on it but the IFRAME in question - same physical server, different domain/site - and it failed with
Load denied by X-Frame-Options: https://www.google.com/ does not permit cross-origin framing.
(That's in the Firebug console.) I'm confused because:
Google is not referenced anywhere in the containing app, or in the IFRAMEd app. All javascript libraries are kept locally; there is no analytics in the app. No Google, nowhere.
The containing page has NOTHING on it, except the IFRAME. No html tags, no head tag, no body tag. IFRAME. That's it.
The X-FRAME-OPTIONS header does not exist in IIS on the server: not at the "Websites" node, not in the individual sites.
So where the h-e-double-sticks is that coming from? What am I missing?
Interesting point: if I remove http"S" from the IFRAME url, it works. Given the nature of the data, SSL is required.
You might check global.asax.cs, the app could be adding the header to every response automatically. If you just search the app for "x-frame-options" you might find something also.

How to control firefox to open a URL with sending a POST request?

Here is my question. I have a site whose IP is 192.168.80.180. I want to use firefox to open it. and at the same time sending it a POST request to login. So I don't need to input the username and password in the web page and when firefox is startup it will directly login to the website. My platform is Linux. Please give me some ideas.
Update:
Concretely, how to communicate with firefox ,and let it load a certain URL which is a dynamic page.
Check this :
http://linux.about.com/od/commands/l/blcmdl1_curl.htm
curl allow you to do requests on anywebsite.

Resources