I'm new to Python and I'm trying to build a program that downloads and extracts zip files from various websites. I've pasted the two programs I've written to do this. The first program is a "child" program names "urls", which I import to the second program. I'm trying to iterate through each of the urls, and within each url iterate through each data file, and finally check if the "keywords" list is a part of the file name, and if yes, download and extract that file. I'm getting stuck on the part where I need to loop through the list of "keywords" to check against the file names I want to download. Would you be able to help? I appreciate any of your suggestions or guidance. Thank you. Andy
**Program #1 called "urls":**
urls = [
"https://www.dentoncad.com/content/data-extracts/1-appraisal-data-extracts/1-2019/1-preliminary/2019-preliminary" \
"-protax-data.zip",
"http://www.dallascad.org/ViewPDFs.aspx?type=3&id=//DCAD.ORG\WEB\WEBDATA\WEBFORMS\DATA%20PRODUCTS\DCAD2020_" \
"CURRENT.ZIP"
]
keywords = [
"APPRAISAL_ENTITY_INFO",
"SalesExport",
"account_info",
"account_apprl_year",
"res_detail",
"applied_std_exempt",
"land",
"acct_exempt_value"
]`enter code here`
enter code here
**Program #2 (primary program):**
import requests
import zipfile
import os
import urls
def main():
print_header()
dwnld_zfiles_from_web()
def print_header():
print('---------------------------------------------------------------------')
print(' DOWNLOAD ZIP FILES FROM THE WEB APP')
print('---------------------------------------------------------------------')
print()
def dwnld_zfiles_from_web():
file_num = 0
dest_folder = "C:/Users/agbpi/OneDrive/Desktop/test//"
# loop through each url within the url list, assigning it a unique file number each iteration
for url in urls.urls:
file_num = file_num + 1
url_resp = requests.get(url, allow_redirects=True, timeout=5)
if url_resp.status_code == 200:
saved_archive = os.path.basename(url)
with open(saved_archive, 'wb') as f:
f.write(url_resp.content)
# for match in urls.keywords:
print("Extracting...", url_resp.url)
with zipfile.ZipFile('file{0}'.format(str(file_num)), "r") as z:
zip_files = z.namelist()
# print(zip_files)
for content in zip_files:
while urls.keywords in content:
z.extract(path=dest_folder, member=content)
# while urls.keywords in zip_files:
# for content in zip_files:
# z.extract(path=dest_folder, member=content)
print("Finished!")
if __name__ == '__main__':
main()
Okay, updated answer based on updated question.
Your code is fine until this part:
with zipfile.ZipFile('file{0}'.format(str(file_num)), "r") as z:
zip_files = z.namelist()
# print(zip_files)
for content in zip_files:
while urls.keywords in content:
z.extract(path=dest_folder, member=content)
Issue 1
You already have the zip file name as saved_archive, but you try to open something else as a zipfile. Why 'file{0}'.format(str(file_num))? You should just with zipfile.ZipFile(saved_archive, "r") as z:
Issue 2
while is kind of an if statement, but it does not work as a filter (it seems you wanted that). What while does is that it checks if the condition of the statement (after the while part) is True-ish and if so, it executes the indented code. And as soon as the first False-ish evaluation kicks in, the code execution moves on. So if your condition evaluation would yield these results [True, False, True], the first would trigger the indented code to run, the second would result an exit, and the third one would be ignored due the previous exit condition. But the condition is invalid which leads to:
Issue 3
url.keywords is a list and content is a str. A list in string will never make sense. It is like ['apple', 'banana'] in 'b'. 'b' won't have such members. You could reverse the logic, but keep in mind that 'b' in ['apple', 'banana'] will be False, 'banana' in ['apple', 'banana'] will be True.
Which means in your case that this condition: '_SalesExport.txt' in urls.keywords will be False! Why? Because url.keywords is:
[
"APPRAISAL_ENTITY_INFO",
"SalesExport",
"account_info",
"account_apprl_year",
"res_detail",
"applied_std_exempt",
"land",
"acct_exempt_value"
]
and SalesExport is not _SalesExport.txt.
To achieve partial match check, you need to compare list items (strings) against a string. "SalesExport" in "_SalesExport.txt" is True, but "SalesExport" in ["_SalesExport.txt"] is False because SalesExport is not a member of the list.
There are three things you could do:
update your keywords list to exact filenames so content in kw_list could work (this means that if there is a directory structure in the zip file, you must include that one too)
for content in zip_files:
if content in urls.keywords:
z.extract(path=dest_folder, member=content)
implement a for cycle in for cycle
for content in zip_files:
for kw in urls.keywords:
if kw in content:
z.extract(path=dest_folder, member=content)
use a generator
matches = [x for x in zip_files if any(y for y in urls.keywords if y in x)]
for m in matches:
z.extract(path=dest_folder, member=m)
Finally, a recommendation:
Timeouts
Be careful with
url_resp = requests.get(url, allow_redirects=True, timeout=5).
"timeout" controls two things, connection timeout and read timeout. Since response may take longer than 5 sec, you may want a longer read timeout. You can specify timeout as tuple: (connect timeout, read timeout). So a better parameter would be:
url_resp = requests.get(url, allow_redirects=True, timeout=(5, 120))
Related
in this program i am iterating the function and adding the result into the file it works fine, no issue whatsoever but when i am trying to take the value from the return of last call, it just return nothing even though the variable is not empty.because the else part only runs for a single time.
#this is an ipynb file so spacing means they are getting executed from different blocks
def intersection(pre,i=0,point=0,count=0,result=dt):
index=-1
prefer=[]
# print(i)
if(0<i):
url = "../data/result.csv"
result= pd.read_csv(url,names=["a","b","c","d","e"])
if(i<len(pre)):
for j in result[pre[i]]:
index=index+1
if(demand[pre[i]][1] >= j):
prefer.append(result.iloc[index,:])
i=i+1
file = open('../data/result.csv', 'w+', newline ='')
header = ["a","b","c","d","e"]
writer = csv.DictWriter(file, fieldnames = header)
# writing data row-wise into the csv file
writer.writeheader()
# writing the data into the file
with file:
write = csv.writer(file)
write.writerows(prefer)
count=count+1
# print(prefer,count) print the outputs step by step
intersection(pre,i,point,count,result)
else:
print("Else Part",type(result))
print(result)
return result
#
pre=["a","b","c"]
rec=intersection(pre)
print(rec)
Output
it prints all the value of result from else part i have excluded it in snapshot because it was too vast and i have few fields here but it wil not effect, for the problem which i am getting... please answer if you know how can i take the value of result into rec.
OK. The code is a bit more complex than I thought. I was trying to work through it just now, and I hit some bugs. Maybe you can clear them up for me.
In the function call, def intersection(pre,i=0,point=0,count=0,result=dt):, dt isn't defined. What should it be?
On the fourth line, i<0 - the default value of i is zero so, unless i is given a value on calling the function, this piece of code will never run.
I notice that the file being read and the file being written are the same: ../data/result.csv - is this correct?
There's another undefined variable, demand, on line 14. Can you fill that in?
Let's see where we are after that.
I'm trying to create a metadata scraper to enrich my e-book collection, but am experiencing some problems. I want to create a dict (or whatever gets the job done) to store the index (only while testing), the path and the series name. This is the code I've written so far:
from bs4 import BeautifulSoup
def get_opf_path():
opffile=variables.items
pathdict={'index':[],'path':[],'series':[]}
safe=[]
x=0
for f in opffile:
x+=1
pathdict['path']=f
pathdict['index']=x
with open(f, 'r') as fi:
soup=BeautifulSoup(fi, 'lxml')
for meta in soup.find_all('meta'):
if meta.get('name')=='calibre:series':
pathdict['series']=meta.get('content')
safe.append(pathdict)
print(pathdict)
print(safe)
this code is able to go through all the opf files and get the series, index and path, I'm sure of this, since the console output is this:
However, when I try to store the pathdict to the safe, no matter where I put the safe.append(pathdict) the output is either:
or
or
What do I have to do, so that the safe=[] has the data shown in image 1?
I have tried everything I could think of, but nothing worked.
Any help is appreciated.
I believe this is the correct way:
from bs4 import BeautifulSoup
def get_opf_path():
opffile = variables.items
pathdict = {'index':[], 'path':[], 'series':[]}
safe = []
x = 0
for f in opffile:
x += 1
pathdict['path'] = f
pathdict['index'] = x
with open(f, 'r') as fi:
soup = BeautifulSoup(fi, 'lxml')
for meta in soup.find_all('meta'):
if meta.get('name') == 'calibre:series':
pathdict['series'] = meta.get('content')
print(pathdict)
safe.append(pathdict.copy())
print(safe)
For two main reasons:
When you do:
pathdict['series'] = meta.get('content')
you are overwriting the last value in pathdict['series'] so I believe this is where you should save.
You also need to make a copy of it, if you donĀ“t it will change also in the list. When you store the dict you really are storing a reeference to it (in this case, a reference to the variable pathdict.
Note
If you want to print the elements of the list in separated lines you can do something like this:
print(*save, sep="\n")
I'm trying to reconvert a program that I wrote but getting rid of all for loops.
The original code reads a file with thousands of lines that are structured like:
Ex. 2 lines of a file:
As you can see, the first line starts with LPPD;LEMD and the second line starts with DAAE;LFML. I'm only interested in the very first and second element of each line.
The original code I wrote is:
# Libraries
import sys
from collections import Counter
import collections
from itertools import chain
from collections import defaultdict
import time
# START
# #time=0
start = time.time()
# Defining default program argument
if len(sys.argv)==1:
fileName = "file.txt"
else:
fileName = sys.argv[1]
takeOffAirport = []
landingAirport = []
# Reading file
lines = 0 # Counter for file lines
try:
with open(fileName) as file:
for line in file:
words = line.split(';')
# Relevant data, item1 and item2 from each file line
origin = words[0]
destination = words[1]
# Populating lists
landingAirport.append(destination)
takeOffAirport.append(origin)
lines += 1
except IOError:
print ("\n\033[0;31mIoError: could not open the file:\033[00m %s" %fileName)
airports_dict = defaultdict(list)
# Merge lists into a dictionary key:value
for key, value in chain(Counter(takeOffAirport).items(),
Counter(landingAirport).items()):
# 'AIRPOT_NAME':[num_takeOffs, num_landings]
airports_dict[key].append(value)
# Sum key values and add it as another value
for key, value in airports_dict.items():
#'AIRPOT_NAME':[num_totalMovements, num_takeOffs, num_landings]
airports_dict[key] = [sum(value),value]
# Sort dictionary by the top 10 total movements
airports_dict = sorted(airports_dict.items(),
key=lambda kv:kv[1], reverse=True)[:10]
airports_dict = collections.OrderedDict(airports_dict)
# Print results
print("\nAIRPORT"+ "\t\t#TOTAL_MOVEMENTS"+ "\t#TAKEOFFS"+ "\t#LANDINGS")
for k in airports_dict:
print(k,"\t\t", airports_dict[k][0],
"\t\t\t", airports_dict[k][1][1],
"\t\t", airports_dict[k][1][0])
# #time=1
end = time.time()- start
print("\nAlgorithm execution time: %0.5f" % end)
print("Total number of lines read in the file: %u\n" % lines)
airports_dict.clear
takeOffAirport.clear
landingAirport.clear
My goal is to simplify the program using map, reduce and filter. So far I have sorted teh creation of the two independent lists, one for each first element of each file line and another list with the second element of each file line by using:
# Creates two independent lists with the first and second element from each line
takeOff_Airport = list(map(lambda sub: (sub[0].split(';')[0]), lines))
landing_Airport = list(map(lambda sub: (sub[0].split(';')[1]), lines))
I was hoping to find the way to open the file and achieve the exact same result as the original code by been able to opemn the file thru a map() function, so I could pass each list to the above defined maps; takeOff_Airport and landing_Airport.
So if we have a file as such
line 1
line 2
line 3
line 4
and we do like this
open(file_name).read().split('\n')
we get this
['line 1', 'line 2', 'line 3', 'line 4', '']
Is this what you wanted?
Edit 1
I feel this is somewhat reduntant but since map applies a function to each element of an iterator we will have to have our file name in a list, and we ofcourse define our function
def open_read(file_name):
return open(file_name).read().split('\n')
print(list(map(open_read, ['test.txt'])))
This gets us
>>> [['line 1', 'line 2', 'line 3', 'line 4', '']]
So first off, calling split('\n') on each line is silly; the line is guaranteed to have at most one newline, at the end, and nothing after it, so you'd end up with a bunch of ['all of line', ''] lists. To avoid the empty string, just strip the newline. This won't leave each line wrapped in a list, but frankly, I can't imagine why you'd want a list of one-element lists containing a single string each.
So I'm just going to demonstrate using map+strip to get rid of the newlines, using operator.methodcaller to perform the strip on each line:
from operator import methodcaller
def readFile(fileName):
try:
with open(fileName) as file:
return list(map(methodcaller('strip', '\n'), file))
except IOError:
print ("\n\033[0;31mIoError: could not open the file:\033[00m %s" %fileName)
Sadly, since your file is context managed (a good thing, just inconvenient here), you do have to listify the result; map is lazy, and if you didn't listify before the return, the with statement would close the file, and pulling data from the map object would die with an exception.
To get around that, you can implement it as a trivial generator function, so the generator context keeps the file open until the generator is exhausted (or explicitly closed, or garbage collected):
def readFile(fileName):
try:
with open(fileName) as file:
yield from map(methodcaller('strip', '\n'), file)
except IOError:
print ("\n\033[0;31mIoError: could not open the file:\033[00m %s" %fileName)
yield from will introduce a tiny amount of overhead over directly iterating the map, but not much, and now you don't have to slurp the whole file if you don't want to; the caller can just iterate the result and get a split line on each iteration without pulling the whole file into memory. It does have the slight weakness that opening the file will be done lazily, so you won't see the exception (if there is any) until you begin iterating. This can be worked around, but it's not worth the trouble if you don't really need it.
I'd generally recommend the latter implementation as it gives the caller flexibility. If they want a list anyway, they just wrap the call in list and get the list result (with a tiny amount of overhead). If they don't, they can begin processing faster, and have much lower memory demands.
Mind you, this whole function is fairly odd; replacing IOErrors with prints and (implicitly) returning None is hostile to API consumers (they now have to check return values, and can't actually tell what went wrong). In real code, I'd probably just skip the function and insert:
with open(fileName) as file:
for line in map(methodcaller('strip', '\n'), file)):
# do stuff with line (with newline pre-stripped)
inline in the caller; maybe define split_by_newline = methodcaller('split', '\n') globally to use a friendlier name. It's not that much code, and I can't imagine that this specific behavior is needed in that many independent parts of your file, and inlining it removes the concerns about when the file is opened and closed.
I'm trying to make a python script that renames files randomly from a list and I used numbers.remove(place) on it but it keeps choosing values that are supposed to have been removed.
I used to just use random.randint but now I have moved to choosing from a list then removing the chosen value from the list but it seems to keep choosing chosen values.
'''python
from os import chdir, listdir, rename
from random import choice
def main():
chdir('C:\\Users\\user\\Desktop\\Folders\\Music')
for f in listdir():
if f.endswith('.mp4'):
numbers = [str(x) for x in range(0, 100)]
had = []
print(f'numbers = {numbers}')
place = choice(numbers)
print(f'place = {place}')
numbers.remove(place)
print(f'numbers = {numbers}')
while place in had:
input('Place has been had.')
place = choice(numbers)
had.append(place)
name = place + '.mp4'
print(f'name = {name}')
print(f'\n\nRenaming {f} to {name}.\n\n')
try:
rename(f, name)
except FileExistsError:
pass
if __name__ == '__main__':
main()
'''
It should randomly number the files without choosing the same value for a file twice but it does that and I have no idea why.
When you call listdir() the first time, that's the same list that you're iterating over the entire time. Yes, you're changing the contents of the directory, but python doesn't really care about that because you only asked for the contents of the directory at a specific point in time - before you began modifying it.
I would do this in two separate steps:
# get the current list of files in the directory
dirlist = os.listdir()
# choose a new name for each file
to_rename = zip(
dirlist,
[f'{num}.mp4' for num in random.sample(range(100), len(dirlist))]
)
# actually rename each file
for oldname, newname in to_rename:
try:
os.rename(oldname, newname)
except FileExistsError:
pass
This method is more concise than the one you're using. First, I use random.sample() on the iterable range(100) to generate non-overlapping numbers from that range (without having to do the extra step of using had like you're doing now). I generate exactly as many as I need, and then use the built-in zip() function to bind together the original filenames with these new numbers.
Then, I do the rename() operations all at once.
I am trying to write a python program that takes n number of text files , each file contains names , each name on a separate line like this
Steve
Mark
Sarah
what the program does is that it prints out only the names that exist in all the inputted files .
I am new to programming so I don't really know how to implement this idea , but I thought in recursion , still the program seems to run in an infinite loop , I am not sure what's the problem . is the implementation wrong ? if so , do you have a better idea of how to implement it ?
import sys
arguments = sys.argv[1:]
files = {}
file = iter(arguments)
for number in range(len(sys.argv[1:])):
files[number] = open(next(file))
def close_files():
for num in files:
files[num].close()
def start_next_file(line,files,orderOfFile):
print('starting next file')
if orderOfFile < len(files): # to avoid IndexError
for line_searched in files[orderOfFile]:
if line_searched.strip():
line_searched = line_searched[:-1]
print('searched line = '+line_searched)
print('searched compared to = ' + line)
if line_searched == line:
#good now see if that name exists in the other files as well
start_next_file(line,files,orderOfFile+1)
elif orderOfFile >= len(files): # when you finish searching all the files
print('got ya '+line) #print the name that exists in all the files
for file in files:
# to make sure the cursor is at the beginning of the read files
#so we can loop through them again
files[file].seek(0)
def start_find_match(files):
orderOfFile = 0
for line in files[orderOfFile] :
# for each name in the file see if it exists in all other files
if line.strip():
line = line[:-1]
print ('starting line = '+line)
start_next_file(line,files,orderOfFile+1)
start_find_match(files)
close_files()
I'm not sure how to fix your code exactly but here's one conceptual way to think about it.
listdir gets all the files in the directory as a list. We narrow that to only .txt files. Next, open, read, split on newlines, and lower to make a larger list containing names. So, files will be a list of lists. Last, find the intersection across all lists using some set logic.
import os
folder = [f for f in os.listdir() if f[-4:] == '.txt']
files = []
for i,file in enumerate(folder):
with open(file) as f:
files.append([name.lower() for name in f.read().splitlines()])
result = set.intersection(*map(set, files))
Example:
#file1.txt
john
smith
mary
sue
pretesh
ashton
olaf
Elsa
#file2.txt
David
Lorenzo
Cassy
Grant
elsa
Felica
Salvador
Candance
Fidel
olaf
Tammi
Pasquale
#file3.txt
Jaleesa
Domenic
Shala
Berry
Pamelia
Kenneth
Georgina
Olaf
Kenton
Milly
Morgan
elsa
Returns:
{'olaf', 'elsa'}