Good afternoon, I am new to stack overflow so I apologize in advance if my question is not in the right format.
I have a list of URLs such as these (but many more),
master_urls =
['https://www.sec.gov/Archives/edgar/daily-index/2020/QTR1/master.20190102.idx',
'https://www.sec.gov/Archives/edgar/daily-index/2020/QTR1/master.20190103.idx]
and I want to write the content onto one single txt.file.
Using one of these URLs works perfectly fine. I do the steps below to achieve it:
file_url = r"https://www.sec.gov/Archives/edgar/daily-index/2019/QTR2/master.20190401.idx"
content = requests.get(file_url).content
with open('master_20190401.txt', 'wb') as f:
f.write(content)
The txt.file looks like this (this is just a small sample of the text file, but it's all the same as shown below just with different company names ...etc):
CIK|Company Name|Form Type|Date Filed|File Name
--------------------------------------------------------------------------------
1000045|NICHOLAS FINANCIAL INC|8-K|20190401|edgar/data/1000045/0001193125-19-093800.txt
1000209|MEDALLION FINANCIAL CORP|SC 13D/A|20190401|edgar/data/1000209/0001193125-19-094732.txt
1000228|HENRY SCHEIN INC|4|20190401|edgar/data/1000228/0001209191-19-021970.txt
1000275|ROYAL BANK OF CANADA|424B2|20190401|edgar/data/1000275/0001140361-19-006199.txt
I tried the following code to get the content of all URLs onto one text file
for file in master_urls:
content = requests.get(file).content
with open('complete_list.txt', 'w') as f:
f.write(content)
but it does not work.
Can anyone help me get the content of each URL in my list of URLs onto one single text file?
Thank you in advance.
Since you are opening your file inside the loop for every URL, the file is getting overwrriten.
try this :
with open('complete_list.txt', 'wb') as f:
for url in master_urls:
content = requests.get(url).content
f.write(content)
Related
I have written code that pulls out URLs of a very large sitemap xml file (10mb) using Beautiful Soup, and it works exactly how I want it, but it only seems to do a small amount of the overall file. This is my code:
`sitemap = "sitemap1.xml"
from bs4 import BeautifulSoup as bs
import lxml
content = []
with open(sitemap, "r") as file:
# Read each line in the file, readlines() returns a list of lines
content = file.readlines()
# Combine the lines in the list into a string
content = "".join(content)
bs_content = bs(content, "xml")
result = bs_content.find_all("loc")
for result in result:
print(result.text)
`
I have changed my IDE to allow for larger files, it just seems to start the process at a random point towards the end of the XML file and only extracts from there on.
I just wanted to say I ended up sorting this out. I used the read XML function in pandas and it worked well. The original XML file was corrupted.
... I also realised that the console was just printing from a certain point because it's such a large file, and it was still actually processing the whole file.
Sorry about this - I'm new :)
I have been trying all day.
# successfully writes the data from line 17 and next lines
# to new (temp) file named and saved in the os
import os
import glob
files = glob.glob('/Users/path/Documents/test/*.txt')
for myspec in files:
temp_filename = 'foo.temp.txt'
with open(myspec) as f:
for n in range(17):
f.readline()
with open(temp_filename, 'w') as w:
w.writelines(f)
os.remove(myspec)
os.rename(temp_filename, myspec)
# delete original file and rename the temp file so it replaces the original file
print("done")
The above works and it works well! I love it. I am very happy.
But this below does NOT work (same files, I am preprocessing files) :
# trying unsuccessfully to remove the last line which is line
# 2048 in all files and save again like above
import os
import glob
files = glob.glob('/Users/path/Documents/test/*.txt')
for myspec in files:
temp_filename = 'foo.temp.txt'
with open(myspec) as f:
for n in range(-1):
f.readline()
with open(temp_filename, 'w') as w:
w.writelines(f)
os.remove(myspec)
os.rename(temp_filename, myspec)
# delete original file and rename the temp file so it replaces the original file
print("done")
This does not work. It doesn't give an error, it prints done, but it does not change the file. I have tried range(-1), all the way up to range(-7), thinking maybe there were blank lines at the end I could not see. This is the only difference between the two blocks of code. If anyone could help that would be great.
To summarize, I got rid of permanently the headers and now I still have a 1 line footer I can not get rid of permanently.
Thank you so much for any help. I need to write permanently edited files. Because I have a ton of code that wants 2 or 3 column files without all the header footer junk, and the junk and file types vary widely. So if I lose the junk permanently ASCII can guess correctly the file types. And I really do not want to try and rewrite that code right now, it's very complicated and involves uncertainty and it took me months to get working correctly. I don't read the files until I'm inside a function and there are many files that are displayed in multiple drop downs. Thank you! All day I've been at this, I have tried other methods. I'd like to make THIS the above method work. To pop off the last and write it back to a permanent file. It doesn't like the -1. Right now it is just one specific line, it is (specifically line 2048 after the header is removed.) Therefore just removing line 2048 would be fine too. Its the last line of the files which are a batch of TSV files that are CCD readouts. Thanks in advance!
Apologies in advance as I am very new to programming (and stackoverflow), and was not sure how to phrase the title for this question.
I have a db.sql file with some outdated URLS.
I would like to update those URLs to new ones using a Python script.
The old URLs don't share any similarities with the new URLs.
Example: "http://old.com/gs7Ubvg" needs to become "http://new.com/static.file"
I have created two text files, one with the old URLs and one with the new URLs.
So far I have managed to make a python script that uses the text files to associate the "old" URL with the "new", but am completely stumped at how to write this change to the db.sql file.
Here is what I have so far...
with open('old_urls.txt') as file:
old_url = file.readlines()
file.close()
with open('new_urls.txt') as file:
new_url = file.readlines()
file.close()
with open("db.sql", "rt") as fin:
with open("db2.sql", "wt") as fout:
for x in range(len(old_url)):
old = old_url[x]
new = new_url[x]
print(old, "is now", new, end="")
for line in fin:
fout.write(line.replace(old,new))
fout.close()
fin.close()
The "print" part is a test and it works. The old URL and new URL match up. But trying to write that change to db2.sql does nothing. If I change it to...
fout.write(line.replace("literal","text"))
... it replaces the text in db2.sql just fine. How do I do this using the contents of the variables "old" and "new" instead of literal text?
Am I missing a loop?
Hello I have got problem I want to get all data from the web but this is too huge to save it to variable. I save data making it like this:
r = urlopen("http://download.cathdb.info/cath/releases/all-releases/v4_2_0/cath-classification-data/cath-domain-list-v4_2_0.txt")
r = BeautifulSoup(r, "lxml")
r = r.p.get_text()
some operations
This was working good until I have to get data from this website:
http://download.cathdb.info/cath/releases/all-releases/v4_2_0/cath-classification-data/cath-domain-description-file-v4_2_0.txt
When I run same code as above on this page my program is stopping at line
r = BeautifulSoup(r, "lxml")
and this is taking forever, nothing happen. I don't know how to get this whole data not saving it to file to make on this some operations of searching key words and printing them. I can't save this to file I have to get this from website.
I will be very thankful for every help.
I think the code below can do what you want. Like mentioned in a comment by #alecxe, you don't need to use BeautifulSoup. This problem should be a problem to retrieve content from text files online and is answered in this Given a URL to a text file, what is the simplest way to read the contents of the text file?
from urllib.request import urlopen
r = urlopen("http://download.cathdb.info/cath/releases/all-releases/v4_2_0/cath-classification-data/cath-domain-list-v4_2_0.txt")
for line in r:
do_somthing()
I am trying to download a vcard to a specific location on my desktop, with a specific file name (which I define).
I have the code the can download the file to my desktop.
url = "http://www.kirkland.com/vcard.cfm?itemid=10485&editstatus=0"
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList",2)
fp.set_preference("browser.download.manager.showWhenStarting",False)
fp.set_preference("browser.download.dir", os.getcwd())
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "text/x-vcard")
browser = webdriver.Firefox(firefox_profile=fp)
browser.get(url)
Note, the URL above is a link to a vcard.
This is saving to the same directory where the code itself exists, and using a file name that was generated by the site I am downloading from.
I want to specify the directory where the file goes, and the name of the file.
Specifically, I would like to call the file something.txt
Also Note, I realize there are much easier ways to do this (using urllib, or urllib2). I need to do it this specific way (if possible) b/c some links are javascript, which require me to use Selenium. I used the above URL as an example to simplify the situation. I can provide other examples/code to show more complex situations if necessary.
Finally, thank you very much for the help I am sure I will get for this post, and for all the help you have provided me for the last year. I dont know how I would have learned all I have learned in this last year had it not been for this community.
I have code that works. Its more of a hack than a solution, but here it is:
# SET FIREFOX PROFILE
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList",2)
fp.set_preference("browser.download.manager.showWhenStarting",False)
fp.set_preference("browser.download.dir", os.getcwd())
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "text/x-vcard")
#OPEN URL
browser = webdriver.Firefox(firefox_profile=fp)
browser.get(url)
#FIND MOST RECENT FILE IN (YOUR) DIR AND RENAME IT
os.chdir("DIR-STRING")
files = filter(os.path.isfile, os.listdir("DIR-STRING"))
files = [os.path.join("DIR-STRING", f) for f in files]
files.sort(key=lambda x: os.path.getmtime(x))
newest_file = files[-1]
os.rename(newest_file, "NEW-FILE-NAME"+"EXTENSION")
#GET THE STRING, AND DELETE THE FILE
f = open("DIR-STRING"+"NEW-FILE-NAME"+"EXTENSION", "r")
string = f.read()
#DO WHATEVER YOU WANT WITH THE STRING/TEXT FROM THE DOWNLOAD
f.close()
os.remove("DIR-STRING"+"NEW-FILE-NAME"+"EXTENSION")
DIR-STRING is the path to the directory where the file is saved
NEW-FILE-NAME is the name of the file you want
EXTENSION is the .txt, etc.