Python3 – RecursionError: maximum recursion depth exceeded during compilation - python-3.x

I am running a scraper that parses data from specified websites and populates the HTML of a webpage to be later updated. The amount of data that is being scraped and parse is rather large, and the String containing the HTML of the webpage is a couple of hundred lines long.
I am using lists of dictionaries to insert the correct values into the HTML String through concatenation. After adding a certain amount of concatenations, I tried running my script to confirm that it was working properly, and I encountered this error:
RecursionError: maximum recursion depth exceeded during compilation
When I delete the added lines which add to the String containing the HTML of the page, the script runs fine.
I tried using this solution, but nothing changed.
I also tried to separate the String concatenation into a separate .py file and then importing that file and calling the function, but it continues to throw the RecursionError.
On a third shot, I attempted to use subprocess.run(), but I received ValueErrors because dictionaries cannot be passed as command line arguments.
I am running MacOS 11.4. Any help would be greatly appreciated :)

Related

Ending pdf to txt conversion if process exceeds a given time threshold

I am trying to convert a corpus of .pdf documents into a corpus of .txt documents using the pdfminer pdf2txt package. The process works well on most documents, but some of the PDFs are taking an exceptionally long time to convert. Some never actually seem to finish converting, and the process gets stuck. I'm trying to figure out how stop the conversion if it exceeds more than a few minutes of processing time. I can create a timer function, but how do I get pdf2txt to skip a document that is taking too long and move on to the next document?
I've included the code for my for loop here without any timer function.
import os
import subprocess as sp
import requests
documents = <list of .pdf filenames>
dir = '../data/'
for doc in documents:
txt = dir+doc[0:-3]+'txt'
cmd = "pdf2txt.py "+dir+doc" > "+txt
sp.run([cmd], shell=True)
A large number of these documents are scans, so not text-based PDFs. pdf2text is able to handle most of those, but for a few the code is getting stuck on the shell command.
subprocess.check_out has a timeout parameter.
Documentation Code Example
To further improve your processing time, you can do asynchronous process calls instead of waiting for processing each file before processing the next.
Code Example(Check Update2 in the question)

How to use a conditional statement while scraping?

I'm trying to scrape the MTA website and need a little help scraping the "Train Lines Row." (Website for reference: https://advisory.mtanyct.info/EEoutage/EEOutageReport.aspx?StationID=All
The train line information is stored as image files (1 line subway, A line subway, etc) describing each line that's accessible through a particular station. I've had success scraping info out of rows in which only one train passes through, but I'm having difficulty figuring out how to iterate through the columns which have multiple trains passing through it...using a conditional statement to test for whether it has one line or multiple lines.
tableElements = table.find_elements_by_tag_name('tr')
that's the table i'm iterating through
tableElements[2].find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_element_by_tag_name('img').get_attribute('alt')
this successfully gives me the values if only one value exists in the particular column
tableElements[8].find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img')
this successfully gives me a list of values I can successfully iterate through to extract my needed values.
Now I try and combine these lines of code together in a forloop to extract all the information without stopping.
for info in tableElements[1:]:
if info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img')[1] == True:
for images in info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img'):
print(images.get_attribute('alt'))
else:
print(info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_element_by_tag_name('img').get_attribute('alt'))
I'm getting the error message: "list index out of range." I dont know why, as every iteration done in isolation seems to work. My hunch is I haven't correctly used the boolean operation properly here. My idea was that if find_elements_by_tag_name had an index of [1] that would mean multiple image text for me to iterate through. Hence, why I want to use this boolean operation.
Hi All, thanks so much for your help. I've uploaded my full code to Github and attached a link for your reference: https://github.com/tsp2123/MTA-Scraping/blob/master/MTA.ElevatorData.ipynb
The end goal is going to be to put this information into a dataframe using some formulation of and having a for loop that will extract the image information that I want.
dataframe = []
for elements in tableElements:
row = {}
columnName1 = find_element_by_class_name('td')
..
Your logic isn't off here.
"My hunch is I haven't correctly used the boolean operation properly here. My idea was that if find_elements_by_tag_name had an index of [1] that would mean multiple image text for me to iterate through."
The problem is it can't check if the statement is True if there's nothing in index position [1]. Hence the error at this point.
if info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img')[1] == True:
What you want to do is use try: So something like:
for info in tableElements[1:]:
try:
if info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img')[1] == True:
for images in info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img'):
print(images.get_attribute('alt'))
else:
print(info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_element_by_tag_name('img').get_attribute('alt'))
except:
#do something else
print ('Nothing found in index position.')
Is it also possible to back to your question and provide the full code? When I try this, I'm getting 11 table elements, so want to test it with the specific table you're trying to scrape.

How to extract python output out of the cmd?

I am using cmd in Windows 7 and I have encounter the following problem:
I write the command python in cmd to enter my code in python, then follows:
import requests
r=requests.get("https://nameofthepege.com")
r.text
After that the whole console gets full of hmtl code. The last 200 to 300 linesof the output are visible but the rest are not. How can I see more lines?
Moreover, is there any way to extract the html code produced by the r.textcommand in a new file from within the python environment or the cmd?
Regarding your first question.
After that the whole console gets full of html code. The last 200 to
300 lines of the output are visible but the rest are not. How can I
see more lines?
Response: The CMD default buffer is limited to 300 lines. You should increase the CMD prompt buffer size.
The below tutorial explains how to do that:
https://www.tenforums.com/tutorials/94089-change-command-prompt-screen-buffer-size-windows.html
Regarding your second question.
Moreover, is there any way to extract the html code produced by the
r.text command in a new file from within the python environment or the
cmd?
Response: You can write the content from r.text into a file by creating a file with Python open() function. More information about Reading and Writing Files in the below link:
https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files

read login data from text file into dictionary error

Using the answer on Stack Overflow shown on this link: https://stackoverflow.com/a/4804039, I have attempted to read in the file contents into a dictionary. There is an error that I cannot seem to fix.
Code
def login():
print("====Login====")
userinfo={}
with open("userinfo.txt","r") as f:
for line in f:
(key,val)=line.split()
userinfo[key]=val
print(userinfo)
File Contents
{'user1': 'pass'}
{'user2': 'foo'}
{'user3': 'boo'}
Error:
(key,val)=line.split()
ValueError: not enough values to unpack (expected 2, got 0)
I have a question to which I would very much appreciate a two fold answer
What is the best and most efficient way to read in file contents, as shown, into a dictionary, noting that it has already been stored in dictionary format.
Is there a way to WRITE to a dictionary to make this "reading" easier? My code for writing to the userinfo.txt file in the first place is shown below
Write code
with open("userinfo.txt","a",newline="")as fo:
writer=csv.writer(fo)
writer.writerow([{username:password}])
Could any answers please attempt the following
Provide a solution to the error using the original code
Suggest the best method to do the same thing (simplest for teaching purposes) Note, that I do not wish to use pickle, json or anything other than very basic file handling (so only reading from a text file or csv reader/writer tools). For instance, would it be best to read the file contents into a list and then convert the list into a dictionary? Or is there any other way?
Is there a method of writing a dictionary to a text file using csv reader or other basic txt file handling, so that the reading of the file contents into a dictionary could be done more effectively on the other end.
Update:
Blank line removed, and the code works but produces the erroneous output:
{"{"Vjr':": "'open123'}", "{'mvj':": "'mvv123'}"}
I think I need to understand the split and strip commands and how to use them in this context to produce the desired result (reading the contents into a dictionary userinfo)
Well let's start with the basics first. The error message:
ValueError: not enough values to unpack (expected 2, got 0)
means a line was empty, so do you have a blank line in the file?
Yes, there are other options on saving your dictionary out and bringing it back, but first you should understand this, and may work just fine for you. :-) The split() is acting on the string you read from the file, and by default will split on the space, so that is what you are seeing. You could format your text file like 'username:pass' instead and then use split(':").
File Contents
user1:pass
user2:foo
user3:boo
Code
def login():
print("====Login====")
userinfo={}
with open("userinfo.txt","r") as f:
for line in f:
(key,val)=line.split(':')
userinfo[key]=val.strip()
print(userinfo)
if __name__ == '__main__':
login()
This simple format may be best if you want to be able to edit the text file by hand, and I like to keep it simple as possible. ;-)

Overwriting specific lines in Python

I have a simple program that manipulates some stored data on some text files. However I have to store the name and the password on different files for python to read.
I was wondering if I could get these two words (The name and the password) on two separate lines on one file and get python to overwrite just one of the lines based on what I choose to overwrite (either the password or the name).
I can get python to read specific lines with:
linenumber=linecache.getline("example.txt",4)
Ideally id like something like this:
linenumber=linecache.writeline("example.txt","Hello",4)
So this would just write "Hello" in "example.txt" only on line 4.
But unfortunately it doesn't seem to be as simple as that, I can get the words to be stored on separate files but overall doing this on a larger scale, I'm going to have a lot of text files all named differently and with different words on them.
If anyone would be able to help, it would be much appreciated!
Thanks, James.
You can try with built in open() function:
def overwrite(filename,newline,linenumber):
try:
with open(filename,'r') as reading:
lines = reading.readlines()
lines[linenumber]=newline+'\n'
with open(filename,'w') as writing:
for i in lines:
writing.write(i)
return 0
except:
return 1 #when reading/writing gone wrong, eg. no such a file
Be careful! It is writing all the lines all over again in a loop and when it comes to exception example.txt may already be blank. You may want to store all the lines in list all the time to write them back to file in exception. Or keep backup of your old files.

Resources