Download video in Python using requests module results in black video - python-3.x

This is the URL I tried to download: https://www.instagram.com/p/B-jEqo9Bgk9/?utm_source=ig_web_copy_link
This is a minimal reproducible example:
import os
import requests
def main():
filename = 'test.mp4'
r = requests.get('https://www.instagram.com/p/B-jEqo9Bgk9/?utm_source=ig_web_copy_link', stream=True)
with open(os.path.join('.', filename), 'wb') as f:
print('Dumping "{0}"...'.format(filename))
for chunk in r.iter_content(chunk_size=1024):
print(chunk)
if chunk:
f.write(chunk)
f.flush()
if __name__ == '__main__':
main()
The code runs fine but the video does not play. What am I doing wrong?

Your code is running perfectly fine, but you did not provide the correct link for the video. The link you used is for the Instagram web page, not the video. So you should not save the content as 'test.mp4', but rather as 'test.html'. If you open the file in a text editor (for example Notepad++), you will see that it contains the HTML code of the web page.
You'll need to parse the HTML to acquire the actual video URL, and then you can use the same code to download the video using that URL.
Currently, the line that starts with meta property="og:video" content= contains the actual video URL, but that may change in the future.
(Note that copyright may apply for videos on Instagram. I assume you have the rights to download and save this video.)

Related

Python Request: Image Downloaded via my code becomes corrupted image when opened in photoshop

Attached is my script that I used to pass each link I've collected using Python Selenium and then used requests.get() to download the images.
# Download Images
for i, idv_link in zip(range(len(all_links)), all_links):
r = requests.get(idv_link, stream=True)
file_name = f"{str(sku_no) + str(i)}.jpg"
# print(file_name)
if r.status_code == 200:
file_path =os.path.join(SKU_file, str(file_name))
with open(file_path, 'wb') as f:
for chunk in r:
f.write(chunk)
i += 1
print(f'Downloaded {len(all_links)} images')
The problem is all the image files downloaded via this method can be opened on any device or even uploaded to Google Drive but it cannot be opened in Photoshop as it would be displayed as corrupted image files.
Hope to discuss and see if its really a requests.get() issue? In that case, I may have to re-code to simulate a right-click into the link and download it method via Python Selenium?

Using Python to save a file that is sent to browser when visiting URL

When I visit the URL below in a browser, it automatically downloads a CSV. As the contents are updated daily, I want to write a Python command to get the latest file each time.
I've tried wget, requests and urllib.request - all without luck.
url = 'https://coronavirus.data.gov.uk/api/v1/data?filters=areaType=overview&structure=%7B%22areaType%22:%22areaType%22,%22areaName%22:%22areaName%22,%22areaCode%22:%22areaCode%22,%22date%22:%22date%22,%22newPeopleVaccinatedFirstDoseByPublishDate%22:%22newPeopleVaccinatedFirstDoseByPublishDate%22,%22cumPeopleVaccinatedFirstDoseByPublishDate%22:%22cumPeopleVaccinatedFirstDoseByPublishDate%22%7D&format=csv'
Anyone got any ideas? TIA
This works just fine for me:
import requests
url = 'https://coronavirus.data.gov.uk/api/v1/data?filters=areaType=overview&structure=%7B%22areaType%22:%22areaType%22,%22areaName%22:%22areaName%22,%22areaCode%22:%22areaCode%22,%22date%22:%22date%22,%22newPeopleVaccinatedFirstDoseByPublishDate%22:%22newPeopleVaccinatedFirstDoseByPublishDate%22,%22cumPeopleVaccinatedFirstDoseByPublishDate%22:%22cumPeopleVaccinatedFirstDoseByPublishDate%22%7D&format=csv'
r = requests.get(url)
with open("uk_data.csv", "wb") as f:
f.write(r.content)
The content is a bytes object, so you need to open the file in binary mode.

are there alternate ways to download pdfs from the internet without them being corrupted?

I have written code for a web scraper program that is as follows (in python) -
import requests, bs4 #you probably need to install requests and bs4, just go online and type beautiful soup 4 installation and requests installation
link_list = []
res = requests.get('https://scholar.google.com/scholar?start=0&q=traffic+light+reinforcement+learning&hl=en&as_sdt=1,5&as_ylo=2019&as_vis=1')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
for link in soup.find_all('a'):
if('pdf' in link.get('href')):
link_list.append(link.get('href'))
for x in range(1,100):
i = str(x*10)
url = f'https://scholar.google.com/scholar?start={i}&q=traffic+light+reinforcement+learning&hl=en&as_sdt=1,5&as_ylo=2019&as_vis=1'
res_2 = requests.get(url)
soup = bs4.BeautifulSoup(res_2.text, 'html.parser')
for link in soup.find_all('a'):
if('pdf' in link.get('href')):
link_list.append(link.get('href'))
if(link_list):
for x in range(0,len(link_list)):
res_3 = requests.get(link_list[x])
with open(f'/Users/atharvanaik/Desktop/Cursed/{x}.pdf', 'wb') as f: #parameter 1 of the open function is set to a file path that is available only on my computer
f.write(res_3.content)
print(x) #Set it to something that is accessible on your computer.
else: #Your final path should be -
print('sorry, unavailable') #Something\something\something\{x}.pdf
#Do not change the last part !
For context, I am trying to bulk download pdfs from a google scholar search, instead of doing it manually.
I manage to download a vast majority of the pdfs, but some of the pdfs, when I tried opening them, gave me this message -
"It may be damaged or use a file format that Preview doesn’t recognise."
As seen in the above code, I am using requests to download the content and write to the file. Is there a way to work around this ?

How can I download a PDF file from an URL where the PDF is embedded into the HTML?

What I'm trying to do: I want to scrape a web page to get the amount of a financial transaction from a PDF file that is loaded with javascript from a website. Example website: http://www.nebraskadeedsonline.us/document.aspx?g5savSPtTDnumMn1bRBWoKqN6Gu65tBhDE9%2fVs5YdPg=
When I click the 'View Document' button, the PDF file loads into my browser's window (I'm using Google Chrome). I can right-click on the PDF and save it to my computer, but I want to automate that process by either having Selenium (or similar package) download that file and then process it for OCR.
If I can get it saved, I will be able to do the OCR part (I hope). I just can't get the file saved.
From here, I found and modified this code:
def download_pdf(lnk):
from selenium import webdriver
from time import sleep
options = webdriver.ChromeOptions()
download_folder = "C:\\Users\\rickc\\Documents\\Scraper2\\screenshots\\"
profile = {"plugins.plugins_list": [{"enabled": False,
"name": "Chrome PDF Viewer"}],
"download.default_directory": download_folder,
"download.extensions_to_open": ""}
options.add_experimental_option("prefs", profile)
print("Downloading file from link: {}".format(lnk))
driver = webdriver.Chrome(chrome_options = options)
driver.get(lnk)
filename = lnk.split("/")[3].split(".aspx")[0]+".pdf"
print("File: {}".format(filename))
print("Status: Download Complete.")
print("Folder: {}".format(download_folder))
driver.close()
download_pdf('http://www.nebraskadeedsonline.us/document.aspx?g5savSPtTDnumMn1bRBWoKqN6Gu65tBhDE9fVs5YdPg=')
But it isn't working. My old college professor once said, "If you've spent more than two hours on the problem and haven't made headway, it's time to look for help elsewhere." So I'm looking for help.
Other info: The link above will take you to a web page, but you can't access the PDF document until you click on the 'View Document' button. I've tried using Selenium's webdriver.find_element_by_ID('btnDocument').click() to make things happen, and it just loads the page but doesn't do anything with it.
You can download pdf using requests and BeautifulSoup libraries. In code below replace /Users/../aaa.pdf with full path where document will be downloaded:
import requests
from bs4 import BeautifulSoup
url = 'http://www.nebraskadeedsonline.us/document.aspx?g5savSPtTDnumMn1bRBWoKqN6Gu65tBhDE9%2fVs5YdPg='
response = requests.post(url)
page = BeautifulSoup(response.text, "html.parser")
VIEWSTATE = page.select_one("#__VIEWSTATE").attrs["value"]
VIEWSTATEGENERATOR = page.select_one("#__VIEWSTATEGENERATOR").attrs["value"]
EVENTVALIDATION = page.select_one("#__EVENTVALIDATION").attrs["value"]
btnDocument = page.select_one("[name=btnDocument]").attrs["value"]
data = {
'__VIEWSTATE': VIEWSTATE,
'__VIEWSTATEGENERATOR': VIEWSTATEGENERATOR,
'__EVENTVALIDATION': EVENTVALIDATION,
'btnDocument': btnDocument
}
response = requests.post(url, data=data)
with open('/Users/../aaa.pdf', 'wb') as f:
f.write(response.content)

In python 3, requests.get().content works to download images, but not for this type of url

I've been using different versions of a web scraper to download anime images from a number of websites I like using beautifulsoup, urllib, and requests.
when I have the image link i use requests.get(name_of_url).content and write the file to a directory on my computer. It has been working for other sites but not on this new one. With this new site, the program runs fine, but the file is not written correctly, as I am unable to view it with any image viewers. Here is my code without all of the html parsing, just the url to image download section:
import requests
import os
img_data = requests.get("https://cs.sankakucomplex.com/data/ba/bc/babc83a0361198bb43a9b367273b3ef7.jpg?e=1510027320&m=euskBFzOAk-YJJjfbP-26A").content
completename = os.path.join('C:\\', 'Users', 'jesse', '.spyder-py3', 'Image_scraper','sankaku', 'testtesttest.jpg')
with open(completename, 'wb') as handler:
handler.write(img_data)
I'm fairly certain that the issue is coming from the different url structure this sight has. If you notice after the ".jpg?" there is more url information, which the other sites I was looking through did not previously have. I'm open to using urllib2 or another library, I'm just learning to use python to interface with html over the last 2-3 weeks. Any ideas or suggestions are appreciated~
thank you

Resources