How do I find the ip address of a Google search page using Python - python-3.x

New to Python programming and trying to solve a coding project.
I am trying to write a piece of code that will access a subpage within a website. I'm able to access the main page of the site using it's ip to .connect, and then using .sendall and .recv to get the main page's basic info.
Now I wan't to move on and capture a search page.
In this specific example: If you type keywords into the address bar (using Chrome at this moment), you get a page of search results. I'm trying to capture the raw data of that page and dump it into a file. I can access the main page ip address for Google using .gethostbyname, but the url for the search page is a string of words. I haven't a clue how to write code that will allow access that page, or to send the search words to trigger the same response from Google, allowing me to capture that data as an answer to .sendall.
Is there a way for me to access this page, which was obviously created and sent back to my web browser, using Python? If I can't by using a simple .connect and .recv code, is there another/bette way?
All recommendations appreciated. Never posted code, so excuse any etiquette errors:
import socket
import sys
try:
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
except socket.error:
print("Failed to create socket.")
sys.exit()
try:
host = (socket.gethostbyname("www.google.com"), 80)
except socket.gaierror:
print("Failed to get host")
sys.exit()
print (host)
print(type(host))
mysock.connect(host)
message = b"GET / HTTP/1.1\r\n\r\n"
try:
mysock.sendall(message)
except socket.error:
print("Failed to send")
sys.exit()
data = mysock.recv(5000)
mysock.close()

When you initially create a connection socket, your operating system reserves a "file" space (in quotes on purpose, not going to go into it now), that you create on your machine. The operating system then designates a port on your system for the file space that you made, and gives you back a file descriptor, describing its location. This port is where you send and receive data.
When you run the connect method to connect to some Google URL, the socket library automatically decides that you should use a specific protocol to make the connection, and does some initial communication with the server to create a flow. This flow is where you will send one request, split up into packets of the same size, and receive a response from the server in the same way.
To create the request, which is basically just a string sent initially to Google's servers that tells them what you want and, more importantly, how you want it, we need to do make something extra called an SSL request. If you'll notice, the correct URL to Google is https://google.com and not http://google.com (although the latter redirects), because you want to negotiate a specific private key to encrypt your communication and hide it from others who might see it. Once you have done your connect magic, you send this SSL request with the send method, normally the request is automatically created by the Python library. you then receive your response, which is the response headers (values mapped to one another giving you some initial info on what you are getting), and then your body, which is HTML code.
Let's delve into the request a bit more. When you submit a search to Google, the search is saved in the URL that you requested. as #user2357112 said, a search for new apple iphone becomes https://www.google.com/search?q=new+apple+iphone&.... Everything before the equals sign is a GET parameter and everything after it is its value. For your purposes, you only care about the q= portion, which represents the search keywords you entered into the search bar. Everything else should remain the same, separated by ampersands (&).
Once you have sent a request to that URL and gotten your HTML response, you have to parse it to get the search results. Please make a separate question for that if you have to, since each post should only have one question to answer.

Related

Socket, too many socket requests

Hi I don't understand why in my website the socket requests are multiple every time I send a message
using the browser console, on the network, the call comes out every time I send a message,
look at photos;
enter image description here
you can test the chat yourself, with every message you send the request appears on the network
Testing website
https://www.awesome-easley.37-187-54-25.plesk.page/
Can you tell me, if there is a way to not make the request list appear every time you send a message? do you have any suggestions?
using the site https://socket.io/demos/chat/, for example you send messages and no prompts appear using the console
enter image description here
why does the drop-down list appear on my site at each call and instead on the socket site no call is generated at each message sending? thank you
I thought it might be because of the cache, but I don't know where to start, does anyone have any suggestions?

DIFFERENT webpages Same URL

I hope this finds you in good shape.
I'm attempting to scrape data for my colleagues, and I've noticed that various websites can share the same URL. This has given me problems because I won't be able to scrape the data I require. Is there a solution to this.
Colgate's website in question is depicted below. The corporate vice-president tab and the leadership tab share the same URL. Can someone tell me how to scrape their names and roles or tell me how to find their individual URLs?
https://www.colgatepalmolive.com/en-us/who-we-are/our-leadership-team
You’re going to need more complex logic than just screen-scraping. The nature of object-oriented web scripting means that these links don’t work the way you think they do.
If you imagine a web page as static HTML, then each link is a discrete URL that the web server receives, interprets, and displays.
But most web pages aren’t static HTML anymore. When you click on the picture for Joe Smith you are not sending a message to the web server to retrieve and send another static HTML page that contains Joe’s bio. Rather your click is sending a message to the “Joe Smith Object” and telling it “please display the bio portion of your object.” The message never says “open the Joe Smith” bio, it simply says “open your bio. How does it know which one to open? The “display your bio” message only gets sent to whichever object the user clicked on. If Joe’s,object gets the message, the request Is for Joe’s bio. If Jane’s object received the message, the request is for Jane’s bio.

How to implement logic based on external redirects?

I'm building a website for a client (real estate), and on the website are links to a different website (adverts for properties). My client routinely activates and deactivates these adverts when he rents out a certain property.
The hrefs on my links look something like this:
<a href="https://domain.xx/estate/idxx/des-crip-tion-xx-xx-x-xx/">. If the advert is indeed active, it just takes them to the advert. If it is not active, however, the website in question redirects the user to https://domain.xx/estate-for-rent/city/, effectively sending the users to my client's competition.
I wish to implement some logic where, before handing the users over to the other website, the server checks to see if it is redirected to https://domain.xx/estate-for-rent/city/, or some similar logic, and if so, uses preventDefault, or something, and notifies the user that the advert is not available instead of sending them to the other website.
I wonder if I can use the fact that only if the advert is active does the resulting url in the users browser window (after they've been directed to the other website) match the url in my href. Can i somehow get the server to try to access the url in my href, and have it see where it gets redirected, and then do something based on that? On the back-end, I'm running NodeJS with Express by the way, and if it matters, I'm relying heavily on EJS for templating. Thanks in advance for any help!
This sounds more like a problem you could solve on the client as opposed to the server. For example, at a high level here's how I would do it:
Handle the click event for each link (really simple to do a catch-all with jQuery)
Fire off a HEAD request via AJAX to the destination URL (this would be much more efficient than a GET but depends on the external service supporting this verb)
Use the status code to determine what to do next (e.g. 2xx allow redirect, 3xx pop a message and block)

How to find the XPINC port in Notes Client without opening an xpage

I've used the syntax detailed here to access an image from an attachment in another document, and this works. I've also taken the url that is generated and pasted it into a Notes client form as an img src="...". It works there too. However, I can't find a way of finding the port number that XPages in Notes Client (XPINC) uses.
e.g. my URL starts with http://127.0.0.1:64669/xsp/ but the 64669 changes from session to session, and user to user. In the first instance this port was found by examining the URL of the xpage that contained the link, but as I want to do this in a conventional Notes form, I want some way of finding that port number.
Alternatively, if anyone knows a simple way to embed a very small XPage in a Notes client form...?
Second alternative: something along the lines of "Microsoft.XMLHTTP" that will allow the "Notes://" protocol instead of "http://"?

Post Username & Password To Protected Folder/Site

I'm trying to post a username & password from an HTML form to a protected folder on a website? Is this possible? I thought I just pass in a syntax in the URL like the below but not having any success
http://username:password#theurlofthesite.co.uk
I'm still getting the alert pop up asking for the username and password? I need to be able to auto log the person in..
Hope someone can help? Thanks
If you login via a HTML form, then this won't work. This is only for HTTP authentication, which is something else completely different.
I don't think many (any?) browsers support being opened to post data. Which leaves you hoping that the site accepts GET based logins (and they should be shot if they do.).
The address part of the URL is parsed by your web server, so the code which handles the HTML form never sees it.
If you want to pass parameters to a form, you must use url?field=value&field2=value2. This only works with forms that use the GET action. For POST, you need a program to generate an encoded document and upload that.
In both cases, your user name and password are broadcasted as plain text on the Internet, so the account will be hacked within a few hours. To put it more clearly: There is no way to "protect" the data in this folder this way. This is like adding a door with four locks to your house and keep the keys on a nail in a post on the street next to the door.
I did exactly what I did in the question and it works on all browser except Safari on a Mac

Resources