os.listdir() won't go after nine files - python-3.x

Hello i am trying to make a program that would automaticly go to imgur, enter the name that you typed and download top 10 images.Everything is working except the os library.When i try to do os.listdir() after nine files it wont show anymore files.I tried googling and found nothing if you see something that i messed up please tell me.Thanks in advance.Sorry for bad grammar.
Here is the code sample:
#! python3
import requests, os, sys
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
os.chdir('imgur/')
broswer = webdriver.Chrome(executable_path=r'C:\Users\{YOUR USERNAME}\Downloads\chromedriver.exe')
broswer.get('https://imgur.com/')
broswer.maximize_window()
search_bar = broswer.find_element_by_tag_name('input')
search_bar.send_keys('happy')
search_bar.send_keys(Keys.ENTER)
pictures = broswer.find_elements_by_tag_name('img')
for i in range(1, 11):
res = requests.get(pictures[i].get_attribute('src'))
try:
res.raise_for_status()
except:
print('Link doesnt exist')
if os.listdir() == []:
picture = open('picture1.png', 'wb')
else:
picture = open('picture' + str(int(os.listdir()[-1][7:-4]) + 1) + '.png', 'wb')
print(os.listdir())
for chunk in res.iter_content(100000):
picture.write(chunk)
picture.close()

os.listdir(".") #u missed adding address

Related

Selenium can't find a CSS selector

Selenium catches a NoSuchElementException after retrieving exactly 9 entries from the website. I think the problem might be in that the page contents doesn't have enough time to load, but I'm not sure.
I've written the code following this YouTube tutorial (nineteenths minute).
import requests
import json
import re
from bs4 import BeautifulSoup
from selenium import webdriver
import time
driver = webdriver.Chrome()
URL = 'https://www.alibaba.com//trade/search?fsb=y&IndexArea=product_en&CatId=&SearchText=white+hoodie'
time.sleep(1)
driver.get(URL)
driver.execute_script('window.scrollTo(0, document.body.scrollHeight)')
time.sleep(2)
driver.execute_script('window.scrollTo(0, document.body.scrollHeight)')
time.sleep(2)
items = driver.find_elements_by_css_selector('.J-offer-wrapper')
num = 1
for i in items:
print(num)
product_name = i.find_element_by_css_selector('h4').text
price = i.find_element_by_css_selector('.elements-offer-price-normal').text
time.sleep(0.5)
num += 1
print(price, product_name)
#driver.close()
If you have a clue why Selenium stops at the 10th entry and how to overcome this issue, please, share.
You are getting that because the 10th item is not like the rest. It's an ad thingy and not a hoodie as you've searched for. I suspect you'd want to exclude this so you are left only with the results you are actually interested in.
All you need to do is change the way you identify items (this just one of the options):
items = driver.find_elements_by_css_selector('.img-switcher-parent')
You need to update for the error handling as below:
for i in items:
print(num)
try:
product_name = i.find_element_by_css_selector('h4').text
except:
product_name=''
try:
price = i.find_element_by_css_selector('.elements-offer-pricenormal').text
except:
price=''
time.sleep(0.5)
num += 1
print(price, product_name)

How to Download webpage as .mhtml

I am able to successfully open a URL and save the resultant page as a .html file. However, I am unable to determine how to download and save a .mhtml (Web Page, Single File).
My code is:
import urllib.parse, time
from urllib.parse import urlparse
import urllib.request
url = ('https://www.example.com')
encoded_url = urllib.parse.quote(url, safe='')
print(encoded_url)
base_url = ("https://translate.google.co.uk/translate?sl=auto&tl=en&u=")
translation_url = base_url+encoded_url
print(translation_url)
req = urllib.request.Request(translation_url, headers={'User-Agent': 'Mozilla/6.0'})
print(req)
response = urllib.request.urlopen(req)
time.sleep(15)
print(response)
webContent = response.read()
print(webContent)
f = open('GoogleTranslated.html', 'wb')
f.write(webContent)
print(f)
f.close
I have tried to use wget using the details captured in this question:
How to download a webpage (mhtml format) using wget in python but the details are incomplete (or I am simply unabl eto understand).
Any suggestions would be helpful at this stage.
Did you try using Selenium with a Chrome Webdriver to save page?
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.expected_conditions import visibility_of_element_located
from selenium.webdriver.support.ui import WebDriverWait
import pyautogui
URL = 'https://en.wikipedia.org/wiki/Python_(programming_language)'
FILE_NAME = ''
# open page with selenium
# (first need to download Chrome webdriver, or a firefox webdriver, etc)
driver = webdriver.Chrome()
driver.get(URL)
# wait until body is loaded
WebDriverWait(driver, 60).until(visibility_of_element_located((By.TAG_NAME, 'body')))
time.sleep(1)
# open 'Save as...' to save html and assets
pyautogui.hotkey('ctrl', 's')
time.sleep(1)
if FILE_NAME != '':
pyautogui.typewrite(FILE_NAME)
pyautogui.hotkey('enter')
I have a better solution, which will not involve any possible manual operation and specify the path to hold the mhtml file. I learn this from a chinese blog . The key idea is to use chrome-dev-tools command.
The code is shown below as an example.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.qq.com/')
# Execute Chrome dev tool command to obtain the mhtml file
res = driver.execute_cdp_cmd('Page.captureSnapshot', {})
# 2. write file locally
with open('./store/qq.mhtml', 'w', newline='') as f:
f.write(res['data'])
driver.quit()
Hope this will help!
more things about chrome dev protocols
save as mhtml, need to add argument '--save-page-as-mhtml'
options = webdriver.ChromeOptions()
options.add_argument('--save-page-as-mhtml')
driver = webdriver.Chrome(options=options)
I wrote it just the way it was. Sorry if it's wrong.
I created a class, so you can use it. The example is in the three lines below.
Also, you can change the number of seconds to sleep as you like.
Incidentally, non-English keyboards such as Japanese and Hangul keyboards are also supported.
import chromedriver_binary
from selenium import webdriver
import pyautogui
import pyperclip
import uuid
class DonwloadMhtml(webdriver.Chrome):
def __init__(self):
super().__init__()
self._first_save = True
time.sleep(2)
def save_page(self, url, filename=None):
self.get(url)
time.sleep(3)
# open 'Save as...' to save html and assets
pyautogui.hotkey('ctrl', 's')
time.sleep(1)
if filename is None:
pyperclip.copy(str(uuid.uuid4()))
else:
pyperclip.copy(filename)
time.sleep(1)
pyautogui.hotkey('ctrl', 'v')
time.sleep(2)
if self._first_save:
pyautogui.hotkey('tab')
time.sleep(1)
pyautogui.press('down')
time.sleep(1)
pyautogui.press('up')
time.sleep(1)
pyautogui.hotkey('enter')
time.sleep(1)
self._first_save = False
pyautogui.hotkey('enter')
time.sleep(1)
# example
dm = DonwloadMhtml()
dm.save_page('https://en.wikipedia.org/wiki/Python_(programming_language)', 'wikipedia_python') # create file named "wikipedia_python.mhtml"
dm.save_page('https://www.python.org/') # file named randomly based on uuid4
python3.8.10
selenium==4.4.3

How can I fix encoding problems without a metric-ton of .replace()? Python3 Chrome-Driver BS4?

The print() command prints the scraped website perfectly to the IDLE shell. However, write/writelines/print will not write to a file without throwing many encode errors or super-geek-squad code.
Tried various forms of .encode(encoding='...',errors='...') to no avail.
When I tried many different encodings they would turn into super-geek-squad formats or multiple ?'s inside the text file.
If I wanted to spend 10 years doing .replace('...','...'), as shown in the code of text = ... I can get this to completely work:
#! python3
import os
import os.path
from os import path
import requests
import bs4 as BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.chrome.options import Options
def Close():
driver.stop_client()
driver.close()
driver.quit()
CHROMEDRIVER_PATH = 'E:\Downloads\chromedriver_win32\chromedriver.exe'
# start raw html
NovelName = 'Novel/Isekai-Maou-to-Shoukan-Shoujo-Dorei-Majutsu'
BaseURL = 'https://novelplanet.com'
url = '%(U)s/%(N)s' % {'U': BaseURL, "N": NovelName}
options = Options()
options.add_experimental_option("excludeSwitches",["ignore-certificate-errors"])
#options.add_argument("--headless") # Runs Chrome in headless mode.
#options.add_argument('--no-sandbox') # Bypass OS security model
#options.add_argument('--disable-gpu') # applicable to windows os only
options.add_argument('start-maximized') #
options.add_argument('disable-infobars')
#options.add_argument("--disable-extensions")
driver = webdriver.Chrome(CHROMEDRIVER_PATH, options=options)
driver.get(url)
# wait for title not be equal to "Please wait 5 seconds..."
wait = WebDriverWait(driver, 10)
wait.until(lambda driver: driver.title != "Please wait 5 seconds...")
soup = BeautifulSoup.BeautifulSoup(driver.page_source, 'html.parser')
# End raw html
# Start get first chapter html coded
i=0
for chapterLink in soup.find_all(class_='rowChapter'):
i+=1
cLink = chapterLink.find('a').contents[0].strip()
print(driver.title)
# end get first chapter html coded
# start navigate to first chapter
link = driver.find_element_by_link_text(cLink)
link.click()
# end navigate to first chapter
# start copy of chapter and add to a file
def CopyChapter():
wait = WebDriverWait(driver, 10)
wait.until(lambda driver: driver.title != "Please wait 5 seconds...")
print(driver.title)
soup = BeautifulSoup.BeautifulSoup(driver.page_source, 'html.parser')
readables = soup.find(id='divReadContent')
text = readables.text.strip().replace('混','').replace('魔','').replace('族','').replace('デ','').replace('イ','').replace('ー','').replace('マ','').replace('ン','').replace('☆','').replace('ッ','Uh').replace('『','[').replace('』',']').replace('“','"').replace('”','"').replace('…','...').replace('ー','-').replace('○','0').replace('×','x').replace('《',' <<').replace('》','>> ').replace('「','"').replace('」','"')
name = driver.title
file_name = (name.replace('Read ',"").replace(' - NovelPlanet',"")+'.txt')
print(file_name)
#print(text) # <-- This shows the correct text in the shell with no errors
with open(file_name,'a+') as file:
print(text,file=file) # <- this never works without a bunch of .replace() where text is defined
global lastURL
lastURL = driver.current_url
NextChapter()
# end copy of chapter and add to a file
# start goto next chapter if exists then return to copy chapter else Close()
def NextChapter():
soup = BeautifulSoup.BeautifulSoup(driver.page_source, 'html.parser')
a=0
main = soup.find(class_='wrapper')
for container in main.find_all(class_='container'):
a+=1
row = container.find(class_='row')
b=0
for chapterLink in row.find_all(class_='4u 12u(small)'):
b+=1
cLink = chapterLink.find('a').contents[0].strip()
link = driver.find_element_by_link_text(cLink)
link.click()
wait = WebDriverWait(driver, 10)
wait.until(lambda driver: driver.title != "Please wait 5 seconds...")
global currentURL
currentURL = driver.current_url
if currentURL != lastURL:
CopyChapter()
else:
print('Finished!!!')
Close()
# end goto next chapter if exists then return to copy chapter else Close()
CopyChapter()
#EOF
Expected results would have the Text file output exactly the same as the IDLE print(text) with absolutely no changes. Then I would be able to test if every chapter gets copied for offline viewing and that it stops at the last chapter posted.
At the current time unless I keep adding more and more .replace() for every novel and chapter this won't ever be working properly. I wouldn't mind manually removing the Ad descriptions by using .replace() but if there is also a better way to do that then how can it be done?
Windows 10
Python 3.7.0
There was some reason for os and os.path in an earlier version of this script but now I don't remember if it is still needed or not.

How do I stop argparser from printing default search?

Link to my code: https://pastebin.com/y4zLD2Dp
The imports that have not been used are going to be used as I progress through my project, I just like to have all imports I need ready to go. The goal for this program will be a youtube video downloader into mp3 format first. This is my first big project to my standards, only been coding for just over 2 months.
from bs4 import BeautifulSoup
from selenium import webdriver
import requests
import sqlite3
import argparse
import sys
from selenium import webdriver
from apiclient.discovery import build
from apiclient.errors import HttpError
from oauth2client.tools import argparser
#To get a developer key visit https://console.developers.google.com.
DEVELOPER_KEY = ""
YOUTUBE_API_SERVICE_NAME = "youtube"
YOUTUBE_API_VERSION = "v3"
'''Searches for results on youtube and stores them in the database.'''
def yt_search(options):
youtube = build(YOUTUBE_API_SERVICE_NAME, YOUTUBE_API_VERSION,
developerKey=DEVELOPER_KEY)
search = youtube.search().list(q=options.q, part="id,snippet",
maxResults=options.max_results).execute()
video_id = []
#Add results to a list and print them.
for result in search.get("items", []):
if result["id"]["kind"] == "youtube#video":
video_id.append("%s (%s)" % (result["snippet"]["title"],
result["id"]["videoId"]))
else:
continue
print("Videos:\n", "\n".join(video_id), "\n")
def download(args):
print(args)
"""Arguments for the program."""
if __name__ == '__main__':
parser = argparse.ArgumentParser(description= "This program searches for
Youtube links and allows you to download songs from said list. " +
"Please remember to be specific in your
searches for best results.")
parser.add_argument("--q", help="Search term", default="Aeryes")
parser.add_argument("--max-results", help="Max results", default=25)
parser.add_argument("--d", type=download, help="Download a video from
search results.")
args = parser.parse_args()
if len(sys.argv) < 2:
parser.parse_args(['--help'])
sys.exit(1)
try:
yt_search(args)
except HttpError:
print("HTTP error")
The problem that I am having is that upon running the --d cmd in the CLI it works and prints the arg as expected (This is just a test to see that functions are working with the parser) but after it prints a list of default youtube links from --q default which I do not want it to do. How do I stop this from happening. Should I use subparser or is there something that I am missing?
If anyone has good resources for argparser module other than official doc support please share.

Save HTML Source Code to File

How can I copy the source code of a website into a text file in Python 3?
EDIT:
To clarify my issue, here's what I have:
import urllib.request
def extractHTML(url):
f = open('temphtml.txt', 'w')
page = urllib.request.urlopen(url)
pagetext = page.read()
f.write(pagetext)
f.close()
extractHTML('http:www.google.com')
I get the following error for the f.write() function:
builtins.TypeError: must be str, not bytes
import urllib.request
site = urllib.request.urlopen('http://somesite.com')
data = site.read()
file = open("file.txt","wb") #open file in binary mode
file.writelines(data)
file.close()
Untested but should work.
EDIT: Updated for python3
Try this.
import urllib.request
def extractHTML(url):
urllib.request.urlretrieve(url, 'temphtml.txt')
It is easier, but if you still want to do it that way. This is the solution:
import urllib.request
def extractHTML(url):
f = open('temphtml.txt', 'w')
page = urllib.request.urlopen(url)
pagetext = str(page.read())
f.write(pagetext)
f.close()
extractHTML('https://www.google.com')
Your script gave an error saying it must be a string. Just convert bytes to a string with str().
Next I got an error saying no host was given. Google is a secured site so https: not http: and most importantly you forgot to include // at the end of https:.
probably you wanted to create something like that:
import urllib.request
class ExtractHtml():
def Page(self):
print("enter the web page name starting with 'http://': ")
url=input()
site=urllib.request.urlopen(url)
data=site.read()
file =open("D://python_projects/output.txt", "wb")
file.write(data)
file.close()
w=ExtractHtml()
w.Page()

Resources