How can i cut the output from a response using Python requests?
The output looks like:
...\'\n});\nRANDOMDATA\nExt.define...
or
...\'\n});\nOTHERRANDOMDATA\nExt.define...
And i only want to print out the RANDOMDATA.
req = "https://example.org/endpoint"
response = requests.get(req, verify=False)
print (response.content)
You can use the re module findall() function for searching all occurrences of a regular expression in a string.
The following code, just search for the string RANDOMDATA in the response got from requests.get() function
import re
req = "https://example.org/endpoint"
response = requests.get(req, verify=False)
print (response.content)
ar = re.findall('RANDOMDATA',str(response.content))
if len(ar):
print(ar[0])
This link would be helpful to learn about regular expressions
Additionally, if you have a variable data containing a string to b searched, and a variable t containing a string to search, you can use,
import re
arr = re.findall(t,data)
To return all the occurrences of t in data and :
arr = data.find(t)
To get the index of the first occurrence of t in data
Related
I have a string, I have to get digits only from that string.
url = "www.mylocalurl.com/edit/1987"
Now from that string, I need to get 1987 only.
I have been trying this approach,
id = [int(i) for i in url.split() if i.isdigit()]
But I am getting [] list only.
You can use regex and get the digit alone in the list.
import re
url = "www.mylocalurl.com/edit/1987"
digit = re.findall(r'\d+', url)
output:
['1987']
Replace all non-digits with blank (effectively "deleting" them):
import re
num = re.sub('\D', '', url)
See live demo.
You aren't getting anything because by default the .split() method splits a sentence up where there are spaces. Since you are trying to split a hyperlink that has no spaces, it is not splitting anything up. What you can do is called a capture using regex. For example:
import re
url = "www.mylocalurl.com/edit/1987"
regex = r'(\d+)'
numbers = re.search(regex, url)
captured = numbers.groups()[0]
If you do not what what regular expressions are, the code is basically saying. Using the regex string defined as r'(\d+)' which basically means capture any digits, search through the url. Then in the captured we have the first captured group which is 1987.
If you don't want to use this, then you can use your .split() method but this time provide a split using / as the separator. For example `url.split('/').
I want to get the count of any phrase appearing in a URL, say https://en.wikipedia.org/wiki/India.
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/India'
r = requests.get(url)
soup = BeautifulSoup(r.text,'lxml')
Now, I want to get the count of the phrase India is a in the soup. How to go about this?
Please suggest.
This can be done in one of two ways.
First, the common denominator:
texts = soup.find_all(text=True)
cleaned = ["".join(t.strip()) for t in texts]
counter=0
Now, if you want to use regex:
import re
regex = re.compile(r'\bIndia is a\b')
for c in cleaned:
if regex.match(c) is not None:
counter+=1
I, personally, don't like using regex except as last resort, so I would go the longer way
phrase = 'India is a'
for c in cleaned:
if phrase==c or phrase+' ' in c:
counter+=1
In both cases, print(counter) outputs 6.
Note that, intentionally, these do not count the 3 situations where the phrase is part of a larger phrase (such as India is also); it counts only the exact phrase or the phrase followed by a space.
I tried below and the same worked fine:
import re
import requests
url = 'https://en.wikipedia.org/wiki/India'
response = requests.get(url)
response_text = response.text
keyword = 'India is a'
match = re.findall("%s" % keyword, response_text)
count = (len(match))
count
Output is 9.
This code will look into <head>, <body> and elsewhere.
Here is an url, like https://www.example.com?timestamp={timestamp}&sign={sign}
And I have caculated the sigh by some algorithm, and got the flowing value.
org_sign = "HBwZu47FkVdQGTyd7uLjfVAQA8nBSwqvE/x3Xv/KFNE=".
Now I want to encode the value and add it to the url.
In urllib.parse.quote_plus, the '/' => '%2F', and '='=>'%3D', so I got 'HBwZu47FkVdQGTyd7uLjfVAQA8nBSwqvE%2Fx3Xv%2FKFNE%3D' .
But I want %2f and %3d, and then HBwZu47FkVdQGTyd7uLjfVAQA8nBSwqvE%2fx3Xv%2fKFNE%3d. And I cannot change the whole sign to a lower case directly, like sign.lower(). Because it is a sign, it is case-sensitive.
Python 3.7.2
>>> from urllib import parse
>>> org_sign = "HBwZu47FkVdQGTyd7uLjfVAQA8nBSwqvE/x3Xv/KFNE="
>>> parse.quote_plus(org_sign)
'HBwZu47FkVdQGTyd7uLjfVAQA8nBSwqvE%2Fx3Xv%2FKFNE%3D'
And I read the document in urllib. It doesn't mention any thing about case-sensitive.
This workaround works for me:
import urllib.parse
class LowerCaseQuoter(urllib.parse.Quoter):
def __missing__(self, b):
# Handle a cache miss. Store quoted string in cache and return.
res = chr(b) if b in self.safe else '%{:02x}'.format(b)
self[b] = res
return res
urllib.parse._safe_quoters[b''] = LowerCaseQuoter(b'').__getitem__
Original string: /media/Filmy/PohadkyCD/Sofie Prvni/Season 1 (2013-2014) + pilot/1x00.Sofie PrvnĂ, Byla jednou jedna princezna (Once Upon a Princess).avi
Encoded string by urllib.parse.quote(filename, safe=''): %2fmedia%2fFilmy%2fPohadkyCD%2fSofie%20Prvni%2fSeason%201%20%282013-2014%29%20%2b%20pilot%2f1x00.Sofie%20Prvn%c3%ad%2c%20Byla%20jednou%20jedna%20princezna%20%28Once%20Upon%20a%20Princess%29.avi
I'm trying to fetch data inside a (big) script tag within HTML. By using Beautifulsoup I can approach the necessary script, yet I cannot get the data I want.
What I'm looking for inside this tag resides within a list called "Beleidsdekkingsgraad" more specifically
["Beleidsdekkingsgraad","107,6","107,6","109,1","109,8","110,1","111,5","112,5","113,3","113,3","114,3","115,7","116,3","116,9","117,5","117,8","118,1","118,3","118,4","118,6","118,8","118,9","118,9","118,9","118,5","118,1","117,8","117,6","117,5","117,1","116,7","116,2"] even more specific; the last entry in the list (116,2)
Following 1 or 2 cannot get the case done.
What I've done so far
base='https://e.infogr.am/pob_dekkingsgraadgrafiek?src=embed#async_embed'
url=requests.get(base)
soup=BeautifulSoup(url.text, 'html.parser')
all_scripts = soup.find_all('script')
all_scripts[3].get_text()[1907:2179]
This, however, is not satisfying since each time the indexing has to be changed if new numbers are added.
What I'm looking for an easy way to extract the list from the script tag, second to catch the last number of the extracted list (i.e. 116,2)
You could regex out javascript object holding that item then parse with json library
import requests,re,json
r = requests.get('https://e.infogr.am/pob_dekkingsgraadgrafiek?src=embed#async_embed')
p = re.compile(r'window\.infographicData=(.*);')
data = json.loads(p.findall(r.text)[0])
result = [i for i in data['elements'][1]['data'][0] if 'Beleidsdekkingsgraad' in i][0][-1]
print(result)
Or do whole thing with regex:
import requests,re
r = requests.get('https://e.infogr.am/pob_dekkingsgraadgrafiek?src=embed#async_embed')
p = re.compile(r'\["Beleidsdekkingsgraad".+?,"([0-9,]+)"\]')
print(p.findall(r.text)[0])
Second regex:
Another option:
import requests,re, json
r = requests.get('https://e.infogr.am/pob_dekkingsgraadgrafiek?src=embed#async_embed')
p = re.compile(r'(\["Beleidsdekkingsgraad".+?"\])')
print(json.loads(p.findall(r.text)[0])[-1])
# Ex1
# Number of datasets currently listed on data.gov
# http://catalog.data.gov/dataset
import requests
import re
from bs4 import BeautifulSoup
page = requests.get(
"http://catalog.data.gov/dataset")
soup = BeautifulSoup(page.content, 'html.parser')
value = soup.find_all(class_='new-results')
results = re.search([0-9][0-9][0-9],[0-9][0-9][0-9], value
print(value)
The code is above .. I want to find a text in the form on regex = [0-9][0-9][0-9],[0-9][0-9][0-9]
inside the text inside the variable 'value'
How can i do this ?
Based on ShellayLee's suggestion i changed it to
import requests
import re
from bs4 import BeautifulSoup
page = requests.get(
"http://catalog.data.gov/dataset")
soup = BeautifulSoup(page.content, 'html.parser')
value = soup.find_all(class_='new-results')
my_match = re.search(r'\d\d\d,\d\d\d', value)
print(my_match)
STILL GETTING ERROR
Traceback (most recent call last):
File "ex1.py", line 19, in
my_match = re.search(r'\d\d\d,\d\d\d', value)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/re.py", line 182, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or bytes-like object
You need some basics of regex in Python. A regex in Python is represented in as a string, and re module provides functions like match, search, findall which can take a string as an argument and treat it as a pattern.
In your case, the pattern [0-9][0-9][0-9],[0-9][0-9][0-9] can be represented as:
my_pattern = r'\d\d\d,\d\d\d'
then used like
my_match = re.search(my_pattern, value_text)
where \d means a digit symbol (same as [0-9]). The r leading the string means the backslaches in the string are not treated as escaper.
The search function returns a match object.
I suggest you walk through some tutorials first to get rid of further confusions. The official HOWTO is already well written:
https://docs.python.org/3.6/howto/regex.html