Automatically generate Google search query - python-3.x

I am trying parse HTML data of certain patents to gather information using Python 3.7 and bs4.
My Problem simplified:
Given this URL
https://patents.google.com/patent/X/en?oq=Y
Where:
X = automatically generated string by Google
Y = My user input (a patent number)
and usually: X == Y (some patent number)
I need to get the value of X.
A more detailed description of my problem:
For 90% of my queries, there is no problem, as I can simply parse using the following code:
patent_number = "EP1000000B1"
paten_url = ("https://patents.google.com/patent/" + patent_number + "/en?oq=" + patent_number)
r = requests.get(patent_url)
response = r.content
soup = BeautifulSoup(response, "html.parser")
However, sometimes the query structure varies, for instance:
I try to search for patent number WO198700753A1 using the code above, but I get an 404 Error, because the URL
https://patents.google.com/patent/WO198700753A1/en?oq=WO198700753A1
does not exist.
This part does not seem to be relevant
en?oq=" + patent_number
, but the first part is.
Searching Google Patents by hand reveals that Google automatically redirects my query from WO198700753A1 to WO1987000753A1 (another 0 added).
Is there any way to automatically generate my url (the part in the middle), so my program will always find results?
Thanks for your help ;)

At the moment, the structure of the link has changed and the oq parameter is not needed. The patent number cannot be randomly generated because each patent has its own unique number. If you generated a patent which not exist, then this page does not exist too.

Related

Scraping from website list returns a null result based of Xpath

So I'm trying to scrape the job listing off this site https://www.dsdambuster.com/careers .
I have the following code:
url = "https://www.dsdambuster.com/careers"
page = requests.get(url, verify=False)
tree = html.fromstring(page.content)
path = '/html/body/div[1]/section/div/div/div[2]/div[1]/div/div[2]/div/div[9]/div[1]/div[3]/div[*]/div[1]/a[*]/div/div[1]/div'
jobs = tree.xpath(xpath)
for job in jobs:
Title = (job.text)
print(Title)
not too sure why it wouldnt work...
I see 2 issues here:
You are using very bad XPath. It is extremely fragile and not reliable.
Instead of
'/html/body/div[1]/section/div/div/div[2]/div[1]/div/div[2]/div/div[9]/div[1]/div[3]/div[*]/div[1]/a[*]/div/div[1]/div'
Please use
'//div[#class="vf-vacancy-title"]'
You are possibly missing a wait / delay.
I'm not familiar with the way you are using here, but with Selenium that I do familiar with, you will need to wait for the elements completely loaded before extracting their text contents.

How to get a right video url of an Instagram post using python

I am trying to build a program which runs a function that input a url of a post, output the links of images and videos the post contain. It works really good for images. However, when it comes to get the links of videos, it return me a wrong url. I have no idea how to handle this situation.
https://scontent-lax3-2.cdninstagram.com/v/t50.2886-16/86731551_2762014420555254_3542675879337307555_n.mp4?efg=eyJ2ZW5jb2RlX3RhZyI6InZ0c192b2RfdXJsZ2VuLjcyMC5jYXJvdXNlbF9pdGVtIiwicWVfZ3JvdXBzIjoiW1wiaWdfd2ViX2RlbGl2ZXJ5X3Z0c19vdGZcIl0ifQ&_nc_ht=scontent-lax3-2.cdninstagram.com&_nc_cat=106&_nc_ohc=WDuXskvIuLEAX9rj7MU&vs=17877888256532912_3147883953&_nc_vs=HBksFQAYJEdCOXJLd1gyMVdhWUNkQUpBS090UGo3eEhDb3hia1lMQUFBRhUAAsgBABUAGCRHTFBXTUFVTXNPaG5XcW9EQU5KUEE5bEZVdVZxYmtZTEFBQUYVAgLIAQAoABgAGwGIB3VzZV9vaWwBMBUAABgAFuD4nJGH9sE%2FFQIoAkMzLBdAEszMzMzMzRgSZGFzaF9iYXNlbGluZV8xX3YxEQB17gcA&_nc_rid=97e769e058&oe=5EDF10A5&oh=3713c35f89fca1aa9554a281aa3421ed
https://scontent-gmp1-1.cdninstagram.com/v/t50.2886-16/0_0_0_\x00.mp4?_nc_ht=scontent-gmp1-1.cdninstagram.com&_nc_cat=100&_nc_ohc=Wnu_-GvKHJoAX9F_ui1&oe=5EDE8214&oh=7920ac3339d5bf313e898c3cbec85fa2
Here are two urls. The first one is copied from the sources of a web page, while the second one is copied from the data scraped by pyquery. They come from a same Instagram post, same path, but they are totally different. The first one works well, but the second one doesn't. How can I solve this? How can I get a right video url?
I am searching for a long time on net. But no use. Please help or try to give some ideas how to achieve this.
Here is my code related to the question
def getUrls(url):
URL = str(url)
html = get_html(URL)
doc = pq(html)
urls = []
items = doc('script[type="text/javascript"]').items()
for item in items:
if item.text().strip().startswith('window._sharedData'):
js_data = json.loads(item.text()[21:-1], encoding='utf-8')
shortcode_media = js_data["entry_data"]["PostPage"][0]["graphql"]["shortcode_media"]
edges = shortcode_media['edge_sidecar_to_children']['edges']
for edge in edges:
is_video = edge['node']['is_video']
if is_video:
video_url = edge['node']['video_url']
video_url.replace(r'\u0026', "&")
urls.append(video_url)
else:
display_url = edge['node']['display_resources'][-1]['src']
display_url.replace(r'\u0026', "&")
urls.append(display_url)
return urls
Thanks in advance.
There's nothing wrong with your code. This is a known intermittent issue with Instagram, and other people have encountered it too: https://github.com/arc298/instagram-scraper/issues/545
There doesn't appear to be a known fix yet.
Also, while unrelated to your question, it's worth mentioning that you don't need to inspect the display_resources object to get the URL of the image:
display_url = edge['node']['display_resources'][-1]['src']
There is already a display_url property available (I'm guessing you saw this, based on the variable name?). So you can simply do:
display_url = edge['node']['display_url']
I've seen this sometimes when using this Python module instead of HTML-scraping. At least with that module, edge["node"]["videos"]["standard_resolution"]["url"] usually (but not always) gives a non-corrupted value.

Is there a way to fetch the url from google search result when a csv file full of keyword is uploaded in Python?

Is it possible to obtain the url from Google search result page, given the keyword? Actually, I have a csv file that contains a lot of companies name. And I want there website which shows up on the top of search result in google, when I upload that csv file it fetch the company name/keyword and put it on the search field.
For eg: - stack overflow, this is one of the entry in my csv file and it should be fetched and put in the search field, and it should return the best match/first url from search result. Eg: - www.stackoverflow.com
And this returned result should be stored in the same file which I have uploaded and next to the keyword for it searched.
I am not aware much about these concepts, so any help will be very appreciated.
Thanks!
google package has one dependency on beautifulsoup which need to be installed first.
then install :
pip install google
search(query, tld='com', lang='en', num=10, start=0, stop=None, pause=2.0)
query : query string that we want to search for.
tld : tld stands for top level domain which means we want to search our result on google.com or google.in or some other domain.
lang : lang stands for language.
num : Number of results we want.
start : First result to retrieve.
stop : Last result to retrieve. Use None to keep searching forever.
pause : Lapse to wait between HTTP requests. Lapse too short may cause Google to block your IP. Keeping significant lapse will make your program slow but its safe and better option.
Return : Generator (iterator) that yields found URLs. If the stop parameter is None the iterator will loop forever.
Below code is the solution for your question.
import pandas
from googlesearch import search
df = pandas.read_csv('test.csv')
result = []
for i in range(len(df['keys'])):
for j in search(df['keys'][i], tld="com", num=10, stop=1, pause=2):
result.append(j)
dict1 = {'keys': df['keys'], 'url': result}
df = pandas.DataFrame(dict1)
df.to_csv('test.csv')
Sample input format file image:
Output File Image:

Use pdfplumber to find text in PDF, return page number, then return table

I downloaded 42 PDFs which are each formatted similarly. Each has various tables, one of which is labeled "Campus Reported Incidents." That particular table is on a different page in each PDF. I want to write a function that will search for the page that has "Campus Reported Incidents" and scrape that table so that I can put it into a dataframe.
I figured that I could use PDFPlumber to search for the string "Campus Reported Incidents" and return the page number. I would then write a function that uses the page number to scrape the table I want, and I would loop that function through every PDF. However, I keep on getting the error "argument is not iterable" or "type object is not subscriptable." I looked through the PDFPlumber documentation but it didn't help my problem.
Here is one example of code that I tried:
url = "pdfs/example.pdf"
import pdfplumber
pdf = pdfplumber.open(url)
for page in range[0:len(pdf.pages)]:
if 'Total number of physical restraints' in pdf.pages[page]:
print(pdf.page_number)
I see this post was from a while ago but maybe this response will still help you or someone else.
The error looks like it's coming from the way you are looping through the pages. The range object is not a list, which is why you're seeing the "type object is not subscriptable" error message. Instead, try to "Enumerate" through the pages. The "i" will give you access to the index (aka current count in the loop). The "pg", will give you access to the page object in the PDF pages. I didn't use the "pg" variable below, but you could use that instead of "pages[i]" if you want.
The code below should print the tables from each page, as well as give you access to the tables to manipulate them further.
import pdfplumber
pdf_file = "pdfs/example.pdf"
tables=[]
with pdfplumber.open(pdf_file) as pdf:
pages = pdf.pages
for i,pg in enumerate(pages):
tbl = pages[i].extract_tables()
print(f'{i} --- {tbl}')
This is nothing to do with pdfplumber.
It should be range() not range[].
Please try below:
url = "pdfs/example.pdf"
import pdfplumber
pdf = pdfplumber.open(url)
for page in range(0:len(pdf.pages)):
if 'Total number of physical restraints' in pdf.pages[page]:
print(pdf.page_number)

Web-scraping with recursion - where to put return function

I am trying to scrape a web-site for text. Each page contains a link to a next page, i.e. first page has link "/chapter1/page2.html", which has link "/chapter1/page3.html" Last page has no link. I am trying to write a program that accesses url, prints text of the page, searches through the text and finds url to next page and loops until the very last page which has no url. I try to use if, else and return function, but I do not understand where I need to put it.
def scrapy(url):
result = requests.get(url, timeout=30.0)
result.encoding = 'cp1251'
page = result.text
link = re.findall(r"\bh(.+?)html", page) # finds link to next page between tags
print("Continue parsing next page!")
url = link
print(url)
return(url)
url = "http://mywebpage.com/chapter1/page1.html"
result = requests.get(url, timeout=30.0)
result.encoding = 'cp1251'
page = result.text
link = re.findall(r"\bh(.+?)html", page)
if link == -1:
print("No url!")
else:
scrapy(url)
Unfortunately it doesn't work; it makes only one loop. Could you kindly tell me what I am doing wrong?
A couple of things: To be recursive, scrapy needs to call itself. Second, recursive functions need branching logic for a base case and a recursive case. In other words, you need part of your function to look like this (pseudocode):
if allDone
return
else
recursiveFunction(argument)
for scrapy, you'd want this branching logic below the line where you find the link (the one where you call re.findall). If you don't find a link, then scrapy is done. If you find a link, then you call scrapy again, passing your newly found link. Your scrapy function will probably need a few more small fixes, but hopefully that will get you unstuck with recursion.
If you want to get really good at thinking in terms of recursion, this book is a good one: https://www.amazon.com/Little-Schemer-Daniel-P-Friedman/dp/0262560992

Resources