Download multiple Dropbox zip files from csv file - python-3.x

In have a .csv file that contains ~100 links to dropbox files. The current method I have downloads the files missing the ?dl=0 extension that seems to be critical
#import packages
import pandas as pd
import wget
#read the .csv file, iterate through each row and download it
data = pd.read_csv("BRAIN_IMAGING_SUMSTATS.csv")
for index, row in data.iterrows():
print(row['Links'])
filename = row['Links']
wget.download(filename)
Output:
https://www.dropbox.com/s/xjtu071g7o6gimg/metal_roi_volume_dec12_2018_pheno1.txt.zip?dl=0
https://www.dropbox.com/s/9oc9j8zhd4mn113/metal_roi_volume_dec12_2018_pheno2.txt.zip?dl=0
https://www.dropbox.com/s/0jkdrb76i7rixa5/metal_roi_volume_dec12_2018_pheno3.txt.zip?dl=0
https://www.dropbox.com/s/gu5p46bakgvozs5/metal_roi_volume_dec12_2018_pheno4.txt.zip?dl=0
https://www.dropbox.com/s/8zfpfscp8kdwu3h/metal_roi_volume_dec12_2018_pheno5.txt.zip?dl=0
These look like the correct links, but the download files are in the format
metal_roi_volume_dec12_2018_pheno1.txt.zip instead of metal_roi_volume_dec12_2018_pheno1.txt.zip?dl=0, so I cannot unzip them. Any ideas how to download the actual dropbox files?

By default (without extra URL parameters, or with dl=0 like in your example), Dropbox shared links point to an HTML preview page for the linked file, not the file data itself. Your code as-is will download the HTML, not the actual zip file data.
You can modify these links for direct file access though, as documented in this Dropbox help center article.
So, you should modify the link, e.g., to use raw=1 instead of dl=0, before calling wget.download on it.

Quick fix would be something like:
#import packages
import pandas as pd
import wget
import os
from urllib.parse import urlparse
#read the .csv file, iterate through each row and download it
data = pd.read_csv("BRAIN_IMAGING_SUMSTATS.csv")
for index, row in data.iterrows():
print(row['Links'])
filename = row['Links']
parsed = urlparse(filename)
fname = os.path.basename(parsed.path)
wget.download(filename, fname)
Basically, you extract filename from the URL and then use that filename as the output param in the wget.download fn.

Related

For loop to download and extract zip files from url

Does anyone have some suggestions for why I can't get this code to do what I want it to do? I'm trying to write a script that will save me several hours each week. I need to download 83 zip files, extract them, import them into ArcGIS Pro, and then run the files through a series of geoprocessing tools, and then compile the results. Right now I'm doing this manually, and I'd love to automate this process as much as possible.
I can use the following snippet of code to download and extract one file. I can't seem to get it to work with a for loop though.
import requests, zipfile
from io import BytesIO
url = 'https://www.deq.state.mi.us/gis-data/downloads/waterwells/Alcona_WaterWells.zip'
filename = url.split('/')[-1]
req = requests.get(url)
zipfile = zipfile.ZipFile(BytesIO(req.content))
zipfile.extractall(r'C:\Users\UserName\Downloads\Water_Wells')
I have created a url list of all 83 urls. These don't change, and content is updated regularly. This for loop only returns the first county, just like the above snippet of code. I'm only including a few of the files here.
url_list = ['https://www.deq.state.mi.us/gis-data/downloads/waterwells/Alcona_WaterWells.zip',
'https://www.deq.state.mi.us/gis-data/downloads/waterwells/Alger_WaterWells.zip',
'https://www.deq.state.mi.us/gis-data/downloads/waterwells/Allegan_WaterWells.zip']
for link in url_list:
filename = url.split('/')[-1]
req = requests.get(url)
zipfile = zipfile.ZipFile(BytesIO(req.content))
zipfile.extractall(r'C:\Users\UserName\Downloads\Water_Wells')

Python: Access a zipped XL file without extracting it

Is there a way I can process an open the excel file within a zip file without first extracting it. I am not interested in modifying it.
from zipfile import ZipFile
from openpyxl import load_workbook
procFile ="C:\\Temp2\\XLFile-Demo-PW123.zip"
xl_file = "XLFile-Demo.xlsx"
myzip = ZipFile(procFile)
myzip.setpassword(bytes('123', 'utf-8'))
# line below returns an error
with load_workbook(myzip.open(xl_file)) as wb_obj:
print(wb_obj.sheetnames)
Most of the examples that perform this only directly open text files.
I would like to simulate the behaviour of archiving programs such as WinRar and 7zip.
Thanks

pandas : read_csv not accepting relative path

I have python code in Jupyter notebook and accompanying data in the same folder. I will be bundling both the code and data into a zip file and submitting for evaluation. I am trying to read the data inside the Notebook using pandas.read_csv using a relative path and thats not working. the API doesnt seem to work with relative path. What is the correct way to handle this?
Update:
My findings so far seem to suggest that, I should be using os.chdir() to set the current working directory. But I wouldn't know where the zip file will get extracted. The code is supposed to be read-only..So I cannot expect the receiver to update the path as appropriate.
You could append the current working directory with the relative path to avoid problem as such:
import os
import pandas as pd
BASE_DIR = os.getcwd()
csv_path = "csvname.csv"
df = pd.read_csv(os.path.join(BASE_DIR, csv_path)
where csv_path is the relative path.
I think first of all you should make a unzip file then you can run.
You may use the below code to unzip file,
from zipfile import ZipFile
file_name = "folder_name.zip"
with ZipFile(file_name, 'r') as zip:
zip.extractall()
print("Done !")

How to match a list to file names, then move matched files to new directory in Python?

I have a folder of 90,000 PDF documents with sequential numeric titles (e.g. 02.100294.PDF). I have a list of around 70,000 article titles drawn from this folder. I want to build a Python program that matches titles from the list to titles in the folder and then moves the matched files to a new folder.
For example, say I have the following files in "FOLDER";
1.100.PDF
1.200.PDF
1.300.PDF
1.400.PDF
Then, I have a list with of the following titles
1.200.PDF
1.400.PDF
I want a program that matches the two document titles from the list (1.200 and 1.400) to the documents in FOLDER, and then move these two files to "NEW_FOLDER".
Any idea how to do this in Python?
Thank you!
EDIT: This is the code I currently have. The source directory is 'scr', and 'dst' is the new destination. "Conden_art" is the list of files I want to move. I am trying to see if the file in 'scr' matches a name listed in 'conden_art'. If it does, I want to move it to 'dst'. Right now, the code is finding no matches and is only printing 'done'. This issue is different from just moving files because I need to match file names to a list, and then move them.
import shutil
import os
for file in scr:
if filename in conden_art:
shutil.copy(scr, dst)
else:
print('done')
SOLVED!
Here is the code I used that ended up working. Thanks for all of your help!
import shutil
import os
import pandas as pd
scr = filepath-1
dst = filepath-2
files = os.listdir(scr)
for f in files:
if f in conden_art:
shutil.move(scr + '\\' + f, dst)
Here's a way to do it -
from os import listdir
from os.path import isfile, join
import shutil
files = [f for f in listdir(src) if isfile(join(src, f))] # this is your list of files at the source path
for i in Conden_art:
if i in files:
shutil.move(i,dst+i) # moving the files in conden_art to dst/
src and dst here are your paths for source and destination. Make sure you are at the src path before running the for loop. Otherwise, python will be unable to find the file.
Rather than looping through the files in the source directory it would be quicker to loop through the filenames you already have. You can use os.path.exists() to check if a file is available to be moved.
from os import path
import shutil
for filename in conden_art:
src_fp, dst_fp = path.join(src, filename), path.join(dst, filename)
if path.exists(filepath):
shutil.move(src_fp, dst_fp)
print(f'{src_fp} moved to {dst}')
else:
print(f'{src_fp} does not exist')

read_csv one file from several files in a gzip?

I have several files in my tar.gz zip file. I want to read only one of them into a pandas data frame. Is there any way to do that?
Pandas can read a file inside a gz. But seems like there is no way to tell it specifically read one of them if there are several files inside the gz.
Would appreciate any thoughts.
Babak
To read a specific file in any compressed folder we just need to give its name or position for e.g to read a specific csv file in a zipped folder we can just open that file and read the content.
from zipfile import ZipFile
import pandas as pd
# opening the zip file in READ mode
with ZipFile("results.zip") as z:
read = pd.read_csv(z.open(z.infolist()[2].filename))
print(read)
Here the folder structure of results looks like and I want to read test.csv :
$ data_description.txt sample_submission.csv test.csv train.csv
If you use pardata, you can do this in one line:
import pardata
data = pardata.load_dataset_from_location('path-to-zip.zip')['table/csv']
The returned data variable should be a dictionary of all csv files in the zip archive.
Disclaimer: I'm one of the main co-authors of pardata.

Resources