For loop to download and extract zip files from url - python-3.x

Does anyone have some suggestions for why I can't get this code to do what I want it to do? I'm trying to write a script that will save me several hours each week. I need to download 83 zip files, extract them, import them into ArcGIS Pro, and then run the files through a series of geoprocessing tools, and then compile the results. Right now I'm doing this manually, and I'd love to automate this process as much as possible.
I can use the following snippet of code to download and extract one file. I can't seem to get it to work with a for loop though.
import requests, zipfile
from io import BytesIO
url = 'https://www.deq.state.mi.us/gis-data/downloads/waterwells/Alcona_WaterWells.zip'
filename = url.split('/')[-1]
req = requests.get(url)
zipfile = zipfile.ZipFile(BytesIO(req.content))
zipfile.extractall(r'C:\Users\UserName\Downloads\Water_Wells')
I have created a url list of all 83 urls. These don't change, and content is updated regularly. This for loop only returns the first county, just like the above snippet of code. I'm only including a few of the files here.
url_list = ['https://www.deq.state.mi.us/gis-data/downloads/waterwells/Alcona_WaterWells.zip',
'https://www.deq.state.mi.us/gis-data/downloads/waterwells/Alger_WaterWells.zip',
'https://www.deq.state.mi.us/gis-data/downloads/waterwells/Allegan_WaterWells.zip']
for link in url_list:
filename = url.split('/')[-1]
req = requests.get(url)
zipfile = zipfile.ZipFile(BytesIO(req.content))
zipfile.extractall(r'C:\Users\UserName\Downloads\Water_Wells')

Related

When I parse a large XML sitemap on Beautifulsoup in Python, it only parses part of the file

I have written code that pulls out URLs of a very large sitemap xml file (10mb) using Beautiful Soup, and it works exactly how I want it, but it only seems to do a small amount of the overall file. This is my code:
`sitemap = "sitemap1.xml"
from bs4 import BeautifulSoup as bs
import lxml
content = []
with open(sitemap, "r") as file:
# Read each line in the file, readlines() returns a list of lines
content = file.readlines()
# Combine the lines in the list into a string
content = "".join(content)
bs_content = bs(content, "xml")
result = bs_content.find_all("loc")
for result in result:
print(result.text)
`
I have changed my IDE to allow for larger files, it just seems to start the process at a random point towards the end of the XML file and only extracts from there on.
I just wanted to say I ended up sorting this out. I used the read XML function in pandas and it worked well. The original XML file was corrupted.
... I also realised that the console was just printing from a certain point because it's such a large file, and it was still actually processing the whole file.
Sorry about this - I'm new :)

are there alternate ways to download pdfs from the internet without them being corrupted?

I have written code for a web scraper program that is as follows (in python) -
import requests, bs4 #you probably need to install requests and bs4, just go online and type beautiful soup 4 installation and requests installation
link_list = []
res = requests.get('https://scholar.google.com/scholar?start=0&q=traffic+light+reinforcement+learning&hl=en&as_sdt=1,5&as_ylo=2019&as_vis=1')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
for link in soup.find_all('a'):
if('pdf' in link.get('href')):
link_list.append(link.get('href'))
for x in range(1,100):
i = str(x*10)
url = f'https://scholar.google.com/scholar?start={i}&q=traffic+light+reinforcement+learning&hl=en&as_sdt=1,5&as_ylo=2019&as_vis=1'
res_2 = requests.get(url)
soup = bs4.BeautifulSoup(res_2.text, 'html.parser')
for link in soup.find_all('a'):
if('pdf' in link.get('href')):
link_list.append(link.get('href'))
if(link_list):
for x in range(0,len(link_list)):
res_3 = requests.get(link_list[x])
with open(f'/Users/atharvanaik/Desktop/Cursed/{x}.pdf', 'wb') as f: #parameter 1 of the open function is set to a file path that is available only on my computer
f.write(res_3.content)
print(x) #Set it to something that is accessible on your computer.
else: #Your final path should be -
print('sorry, unavailable') #Something\something\something\{x}.pdf
#Do not change the last part !
For context, I am trying to bulk download pdfs from a google scholar search, instead of doing it manually.
I manage to download a vast majority of the pdfs, but some of the pdfs, when I tried opening them, gave me this message -
"It may be damaged or use a file format that Preview doesn’t recognise."
As seen in the above code, I am using requests to download the content and write to the file. Is there a way to work around this ?

Download multiple Dropbox zip files from csv file

In have a .csv file that contains ~100 links to dropbox files. The current method I have downloads the files missing the ?dl=0 extension that seems to be critical
#import packages
import pandas as pd
import wget
#read the .csv file, iterate through each row and download it
data = pd.read_csv("BRAIN_IMAGING_SUMSTATS.csv")
for index, row in data.iterrows():
print(row['Links'])
filename = row['Links']
wget.download(filename)
Output:
https://www.dropbox.com/s/xjtu071g7o6gimg/metal_roi_volume_dec12_2018_pheno1.txt.zip?dl=0
https://www.dropbox.com/s/9oc9j8zhd4mn113/metal_roi_volume_dec12_2018_pheno2.txt.zip?dl=0
https://www.dropbox.com/s/0jkdrb76i7rixa5/metal_roi_volume_dec12_2018_pheno3.txt.zip?dl=0
https://www.dropbox.com/s/gu5p46bakgvozs5/metal_roi_volume_dec12_2018_pheno4.txt.zip?dl=0
https://www.dropbox.com/s/8zfpfscp8kdwu3h/metal_roi_volume_dec12_2018_pheno5.txt.zip?dl=0
These look like the correct links, but the download files are in the format
metal_roi_volume_dec12_2018_pheno1.txt.zip instead of metal_roi_volume_dec12_2018_pheno1.txt.zip?dl=0, so I cannot unzip them. Any ideas how to download the actual dropbox files?
By default (without extra URL parameters, or with dl=0 like in your example), Dropbox shared links point to an HTML preview page for the linked file, not the file data itself. Your code as-is will download the HTML, not the actual zip file data.
You can modify these links for direct file access though, as documented in this Dropbox help center article.
So, you should modify the link, e.g., to use raw=1 instead of dl=0, before calling wget.download on it.
Quick fix would be something like:
#import packages
import pandas as pd
import wget
import os
from urllib.parse import urlparse
#read the .csv file, iterate through each row and download it
data = pd.read_csv("BRAIN_IMAGING_SUMSTATS.csv")
for index, row in data.iterrows():
print(row['Links'])
filename = row['Links']
parsed = urlparse(filename)
fname = os.path.basename(parsed.path)
wget.download(filename, fname)
Basically, you extract filename from the URL and then use that filename as the output param in the wget.download fn.

How to specify the link I want to download the file in python3

I found several questions related to my question but none helped me.
I have to download .hdf files for a few years. This year I have at least 3 files per month.
I tried to loop in the shell, but it will take 1000 years until I can download all the files, outside that will block my band.
When I access the link I need to find only the link that I want to download. But this link has my variations (date, time, Julian day, and seconds), but it has the same start and ending times every year.
In the link below I show the site:
view-source:https://e4ftl01.cr.usgs.gov/MOLT/MOD09A1.005/2010.02.02/
Where it varies:
https://e4ftl01.cr.usgs.gov/MOLT/MOD09A1.005/ year . month . day/
Below I show the link that I need to get.
MOD09A1.A2010033.h13v12.005.2010042232737.hdf
However, this file name varies greatly.
MOD09A1. varies .h13v12.005. varies .hdf
One of my problems is in how do I make python find just what does not vary and fill in what varies. Another is how to download a .hdf file. And finally how do I log in and password (it is necessary to download the file). I'm new to python and I believe I have to do this, but if I do not I'll have to unfortunately use the hand to download all.
My script is:
import requests
import wget
import os
from urllib.parse import urlparse
from bs4 import BeautifulSoup, SoupStrainer
url = 'https://e4ftl01.cr.usgs.gov/MOLT/MOD09A1.005/2013.08.21/'
resp = requests.get(url)
html = resp.text
soup = BeautifulSoup(html, 'lxml')
tags = soup.find_all('a')
t = tags[471] # where is the name I want
down = t.attrs['href']
full_path = url + down
r = requests.post(full_path, auth=('lucas', 'xxxx'), verify=True)
r.status_code
with open(full_path, 'wb') as fd:
for chunk in r.iter_content():
fd.write(chunk)
But still to make my r.status_code still 401, and is not connecting and you are not downloading .hdf.
Thanks!

Read text files from website with Python

Hello I have got problem I want to get all data from the web but this is too huge to save it to variable. I save data making it like this:
r = urlopen("http://download.cathdb.info/cath/releases/all-releases/v4_2_0/cath-classification-data/cath-domain-list-v4_2_0.txt")
r = BeautifulSoup(r, "lxml")
r = r.p.get_text()
some operations
This was working good until I have to get data from this website:
http://download.cathdb.info/cath/releases/all-releases/v4_2_0/cath-classification-data/cath-domain-description-file-v4_2_0.txt
When I run same code as above on this page my program is stopping at line
r = BeautifulSoup(r, "lxml")
and this is taking forever, nothing happen. I don't know how to get this whole data not saving it to file to make on this some operations of searching key words and printing them. I can't save this to file I have to get this from website.
I will be very thankful for every help.
I think the code below can do what you want. Like mentioned in a comment by #alecxe, you don't need to use BeautifulSoup. This problem should be a problem to retrieve content from text files online and is answered in this Given a URL to a text file, what is the simplest way to read the contents of the text file?
from urllib.request import urlopen
r = urlopen("http://download.cathdb.info/cath/releases/all-releases/v4_2_0/cath-classification-data/cath-domain-list-v4_2_0.txt")
for line in r:
do_somthing()

Resources