Python 3 - Extract IP address and port number from dynamic webpages - python-3.x

I would like to extract IP address and port number from the this link. Here is my Python code:
http://spys.one/free-proxy-list/FR/
import urllib.request
import re
url = 'http://spys.one/free-proxy-list/FR/'
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html = urllib.request.urlopen(req).read().decode('utf-8')
ip = re.findall( r'[0-9]+(?:\.[0-9]+){3}',html )
# ip = re.findall( r'[0-9]+(?:\.[0-9]+){3}:[0-9]+[0-9]',html) # This is also not working
print (ip)
Output -
['37.59.0.139', '212.47.239.185', '85.248.227.165', '167.114.250.199', '51.15.86.160', '212.83.164.85', '82.224.48.173']
I get only IP address but not the port numbers.
I'm expecting something like this - '37.59.0.139:17658'

Your code does not work because -- aside from several issues with your regex that have been pointed out in other answers -- the website you provided displays the port number of each IP by executing some javascript in the underlying HTML code.
In order to capture each IP and its associated port number, you first need to execute the javascript so that the port numbers are properly printed in the HTML response (you can follow the guidelines here: Web-scraping JavaScript page with Python). Then you need to extract this information from the javascript-computed HTML response.
By inspecting the HTML response, I found out that each port number is preceded by :</font> and followed by <.
A working code snippet can be found below. I took the liberty of slightly modifying your IP-regex as only certain IP addresses were associated with a port number (other IPs were related to the hostname column and should be discarded) - namely, the IPs of interest are those followed by the <script string.
import dryscrape
import re
url = 'http://spys.one/free-proxy-list/FR/'
#get html with javascript
session = dryscrape.Session()
session.visit(url)
response = session.body()
#capture ip:
IP = re.findall(r'[0-9]+(?:\.[0-9]+){3}(?=<script)',response)
#capture port:
port = re.findall(r'(?<=:</font>)(.*?)(?=\<)',response)
#join IP with ports
IP_with_ports = []
for i in range(len(IP)):
IP_with_ports.append(IP[i] + ":" + port[i])
print (IP_with_ports)
OUTPUT: ['178.32.213.128:80', '151.80.207.148:80', '134.119.223.242:80', '37.59.0.139:17459', ..., '37.59.0.139:17658']
Do note that the code above only works for the website you provided, as each website has its own logic for displaying data.

First, you've a bit of a wonky part of your regex: you have (?:, you probably mean (:?. Not sure what the former means, but the latter means zero or one :
Your regex is only looking for four groupings of numbers split by : or .. You need up to five groups of numbers: 0.0.0.0:0000 = five groups. Try this instead:
re.findall( r'([0-9]{1,3}\.){3}[0-9]{1,3}(:[0-9]{2,4})?'
[0-9]{1,3} = between one and 3 digits
\. = a period (escaped, because . means "any character")
{3} = the above needs to be repeated exactly three times
(:[0-9]{2,4}) a colon followed by a numeric sequence between two and four characters long. This is your port.
? the port is optional, it will either be there or it won't.

Related

How to loop through each row of Pandas DataFrame

I'm using an API to find information on smart contracts. Each Contract has a unique address and the information can be pulled by plugging it into an API link. Example:
'https://contract-api.dev/contracts/xxxxxxxxxxxxxxxxxxxxxxxxx'
I have a CSV with 1000 contract addresses. Example:
Address
0
0xc02aaa39b223fe8d0a0e5c4f27ead9083c756cc2
1
0xa5409ec958c83c3f309868babaca7c86dcb077c1
2
0xdac17f958d2ee523a2206206994597c13d831ec7
3
0xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb48
4
0xa2327a938febf5fec13bacfb16ae10ecbc4cbdcf
...
...
And this code allows me to get exactly what I want for one row,
ADDRESS = df['Address'][0]
total = requests.get(f"https://contract-api.dev/contracts/{ADDRESS}", headers={'CF-Access-Client-Id': 'xxxxxx.access', 'CF-Access-Client-Secret': 'xxxxx'}).json()
total['deployment']['created_by_user']
Where the output is:
'0x4f26ffbe5f04ed43630fdc30a87638d53d0b0876'
I just need to find a way to loop through every row and insert the contract Address into the API link, retrieve the "created_by_user" address, then move to the next row.
What do you think the best way to do this is?
def retrieve_created_by_user(address_):
return requests.get(f"https://contract-api.dev/contracts/{address_}", headers={'CF-Access-Client-Id': 'xxxxxx.access', 'CF-Access-Client-Secret': 'xxxxx'}).json()['deployment']['created_by_user']
for address in df['Address']:
created_by_user = retrieve_created_by_user(address)
...
Or since it can probably take a bit you can use a progress bar like tqdm:
from tqdm import tqdm
for address is tqdm(df['Address']):
created_by_user = retrieve_created_by_user(address)
...
Here do with that rows data whatever you plan to do.
If for some reason you want to be able to index the values (e.g. to restart from a manual index later) you can use list(df['Address']).

Collapsing IP networks with Python ipaddress module

I'm having difficulty using the ipaddress.collapse_addresses() method.
# n is a list of 192.168.0.0/24 networks (1,2,3,4....etc)
def sumnet():
n = nlist()
for net in n:
snet = ipaddress.collapse_addresses(net)
return snet
I'm only getting back the original list:
Collapsed Networks
[IPv4Network('192.168.0.0/24'), IPv4Network('192.168.1.0/24'),
IPv4Network('192.168.2.0/24'), IPv4Network('192.168.3.0/24'),
IPv4Network('192.168.4.0/24'), IPv4Network('192.168.5.0/24'),
IPv4Network('192.168.6.0/24'), IPv4Network('192.168.7.0/24'),
IPv4Network('192.168.8.0/24')]
Assuming your input is a list of IPv4Networks from ipaddress like...
netlist = [ipaddress.IPv4Network('192.168.0.0/24'),
ipaddress.IPv4Network('192.168.1.0/24'),
ipaddress.IPv4Network('192.168.2.0/24'),
ipaddress.IPv4Network('192.168.3.0/24'),
ipaddress.IPv4Network('192.168.4.0/24'),
ipaddress.IPv4Network('192.168.5.0/24'),
ipaddress.IPv4Network('192.168.6.0/24'),
ipaddress.IPv4Network('192.168.7.0/24'),
ipaddress.IPv4Network('192.168.8.0/24')]
and your desired output is
[IPv4Network('192.168.0.0/21'), IPv4Network('192.168.8.0/24')]
All this can be done with...
import ipaddress
def sumnet(netlist):
return list(ipaddress.collapse_addresses(netlist))
netlist = [ipaddress.IPv4Network('192.168.0.0/24'),
ipaddress.IPv4Network('192.168.1.0/24'),
ipaddress.IPv4Network('192.168.2.0/24'),
ipaddress.IPv4Network('192.168.3.0/24'),
ipaddress.IPv4Network('192.168.4.0/24'),
ipaddress.IPv4Network('192.168.5.0/24'),
ipaddress.IPv4Network('192.168.6.0/24'),
ipaddress.IPv4Network('192.168.7.0/24'),
ipaddress.IPv4Network('192.168.8.0/24')]
print(sumnet(netlist))
The collapse_addresses method actually takes an entire list of addresses, you don't have to feed it ip_addresses one by one. It will return a generator for the collapsed network, but you can just convert that to a list to deal with it easier.
Let me know if this is not what you were trying to accomplish.
It is a little hard to understand exactly what your code is supposed to do as the following snippet starts a for loop where it grabs the first ip address collapses it into a generator and returns that generator with that single ip address, without looking at any of the other ip addresses. This however doesn't seem to be consistent with what your question claims the output to be.
for net in n:
snet = ipaddress.collapse_addresses(net)
return snet

why re.findall behaves weird way as compared with re.search

Scenario 1: Works as expected
>>> output = 'addr:10.0.2.15'
>>> regnew = re.search(r'addr:(([0-9]+\.){3}[0-9]+)',output)
>>> print(regnew)
<re.Match object; span=(0, 14), match='addr:10.0.2.15'>
>>> print(regnew.group(1))
10.0.2.15
Scenario 2: Works as expected
>>> regnew = re.findall(r'addr:([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+)',output)
>>> print(regnew)
['10.0.2.15']
Scenario 3: Does not work as expected. Why is the output not ['10.0.2.15']?
>>> regnew = re.findall(r'addr:([0-9]+\.){3}[0-9]+',output)
>>> print(regnew)
['2.']
Your regex is not correct for what you want:
import re
output = 'addr:10.0.2.15'
regnew = re.findall(r'addr:((?:[0-9]+.){3}[0-9]+)', output)
print(regnew)
Notice what it changed is that I wrapped with parenthesis the full IP address, and added '?:' for the first part of the address. '?:' means it is a non capturing group. findall() as stated in the docs, gives a list of captured groups, that is why you want that '(?:[0-9]+.)' as non capturing group and you want to have the whole thing in a group.
The difference here between findall and everything else is that findall returns capture groups by default (if any are present) instead of the entire matched expression.
A quick fix would be to simply change your repeated group to a noncapturing group, so findall will return the full match rather than the last result in your capture group.
addr:(?:[0-9]+\.){3}[0-9]+
That will of course include addr: in your match. To get just the IP address, wrap both the pattern and quantifier in a capture group.
addr:((?:[0-9]+\.){3}[0-9]+)

Navigating the html tree with BeautifulSoup and/or Selenium

I've just started using BeautifulSoup and came across an obstacle at the very beginning. I looked up similar posts but didn't find a solution to my specific problem, or there is something fundamental I’m not understanding. My goal is to extract Japanese words with their English translations and examples from this page.
https://iknow.jp/courses/566921
and save them in a dataFrame or a csv file.
I am able to see the parsed output and the content of some tags, but whenever I try requesting something with a class I'm interested in, I get no results. First I’d like to get a list of the Japanese words, and I thought I should be able to do it with:
import urllib
from bs4 import BeautifulSoup
url = ["https://iknow.jp/courses/566921"]
data = []
for pg in url:
r = urllib.request.urlopen(pg)
soup = BeautifulSoup(r,"html.parser")
soup.find_all("a", {"class": "cue"})
But I get nothing, also when I search for the response field:
responseList = soup.findAll('p', attrs={ "class" : "response"})
for word in responseList:
print(word)
I tried moving down the tree by finding children but couldn’t get to the text I want. I will be grateful for your help. Here are the fields I'm trying to extract:
After great help from jxpython, I've now stumbed upon a new challenge (perhaps this should be a new thread, but it's quite related, so maybe it's OK here). My goal is to create a dataframe or a csv file, each row containing a Japanese word, translation and examples with transliterations. With the lists created using:
driver.find_elements_by_class_name()
driver.find_elements_by_xpath()
I get lists with different number of element, so it's not possible to easily creatre a dataframe.
# len(cues) 100
# len(responses) 100
# len(transliterations)279 stramge number because some words don't have transliterations
# len(texts) 200
# len(translations)200
The transliterations lists contains a mix of transliterations for single words and sentences. I think to be able to get content to populate the first line of my dataframe I would need to loop through the
<li class="item">
content (xpath? #/html/body/div2/div/div/section/div/section/div/div/ul/li1) and for each extract the word with translation, sentences and transliteration...I'm not sure if this would be the best approach though...
As an example, the information I would like to have in the first row of my dataframe (from the box highlighted in screenshot) is:
行く, いく, go, 日曜日は図書館に行きます。, にちようび は としょかん に いきます。, I go to the library on Sundays.,私は夏休みにプールに行った。, わたし は なつやすみ に プール に いった。, I went to the pool during summer vacation.
The tags you are trying to scrape are not in the source code. Probably because the page is JavaScript rendered. Try this url to see yourself:
view-source:https://iknow.jp/courses/566921
The Python module Selenium solves this problem. If you would like I could write some code for you to start on.
Here is some code to start on:
from selenium import webdriver
url = 'https://iknow.jp/courses/566921'
driver = webdriver.Chrome()
driver.get(url)
driver.implicitly_wait(2)
cues = driver.find_elements_by_class_name('cue')
cues = [cue.text for cue in cues]
responses = driver.find_elements_by_class_name('response')
responses = [response.text for response in responses]
texts = driver.find_elements_by_xpath('//*[#class="sentence-text"]/p[1]')
texts = [text.text for text in texts]
transliterations = driver.find_elements_by_class_name('transliteration')
transliterations = [transliteration.text for transliteration in transliterations]
translations = driver.find_elements_by_class_name('translation')
translations = [translation.text for translation in translations]
driver.close()
Note: You first need to install a webdriver. I choose chrome.
Here is a link: https://chromedriver.storage.googleapis.com/index.html?path=2.41/. Also add this to your path!
If you have any other questions let me know!

Can't pull out the information from object using Beautiful Soup 4

I am working (for the first time) with scraping a website. I am trying to pull the latitude (in decimal degrees) from a website. I have managed to pull out the correct parent node that contains the information, but I am stuck on how to pull out the actual number from this. All of the searching I have done has only told me how to pull it out if I know the string (which I don't) or if the string is in a child node, which it isn't. Any help would be great.
Here is my code:
a_string = soup.find(string="Latitude in decimal degrees")
a_string.find_parents("p")
Out[46]: [<p><b>Latitude in decimal degrees</b><font size="-2">
(<u>see definition</u>)
</font><b>:</b> 35.7584895</p>]
test = a_string.find_parents("p")
print(test)
[<p><b>Latitude in decimal degrees</b><font size="-2"> (<u>see definition</u>)</font>
<b>:</b> 35.7584895</p>]
I need to pull out the 35.7584895 and save it as an object so I can append it to a dataset.
I am using Beautiful Soup 4 and python 3
The first thing to notice is that, since you have used the find_parents method (plural), test is a list. You need only the first item of it.
I will simulate your situation by doing this.
>>> import bs4
>>> HTML = '<p><b>Latitude in decimal degrees</b><font size="-2"> (<u>see definition</u>)</font><b>:</b> 35.7584895</p>'
>>> item_soup = bs4.BeautifulSoup(HTML, 'lxml')
The simplest way of recovering the textual content of this is to do this:
>>> item_soup.text
'Latitude in decimal degrees (see definition): 35.7584895'
However, you want the number. You can get this in various ways, two of which come to my mind. I assign the result of the previous statement to str so that I can manipulate the result.
>>> str = item_soup.text
One way is to search for the colon.
>>> str[1+str.rfind(':'):].strip()
'35.7584895'
The other is to use a regex.
>>> bs4.re.search(r'(\d+\.\d+)', str).groups(0)[0]
'35.7584895'

Resources