I'm trying to scrape the MTA website and need a little help scraping the "Train Lines Row." (Website for reference: https://advisory.mtanyct.info/EEoutage/EEOutageReport.aspx?StationID=All
The train line information is stored as image files (1 line subway, A line subway, etc) describing each line that's accessible through a particular station. I've had success scraping info out of rows in which only one train passes through, but I'm having difficulty figuring out how to iterate through the columns which have multiple trains passing through it...using a conditional statement to test for whether it has one line or multiple lines.
tableElements = table.find_elements_by_tag_name('tr')
that's the table i'm iterating through
tableElements[2].find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_element_by_tag_name('img').get_attribute('alt')
this successfully gives me the values if only one value exists in the particular column
tableElements[8].find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img')
this successfully gives me a list of values I can successfully iterate through to extract my needed values.
Now I try and combine these lines of code together in a forloop to extract all the information without stopping.
for info in tableElements[1:]:
if info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img')[1] == True:
for images in info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img'):
print(images.get_attribute('alt'))
else:
print(info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_element_by_tag_name('img').get_attribute('alt'))
I'm getting the error message: "list index out of range." I dont know why, as every iteration done in isolation seems to work. My hunch is I haven't correctly used the boolean operation properly here. My idea was that if find_elements_by_tag_name had an index of [1] that would mean multiple image text for me to iterate through. Hence, why I want to use this boolean operation.
Hi All, thanks so much for your help. I've uploaded my full code to Github and attached a link for your reference: https://github.com/tsp2123/MTA-Scraping/blob/master/MTA.ElevatorData.ipynb
The end goal is going to be to put this information into a dataframe using some formulation of and having a for loop that will extract the image information that I want.
dataframe = []
for elements in tableElements:
row = {}
columnName1 = find_element_by_class_name('td')
..
Your logic isn't off here.
"My hunch is I haven't correctly used the boolean operation properly here. My idea was that if find_elements_by_tag_name had an index of [1] that would mean multiple image text for me to iterate through."
The problem is it can't check if the statement is True if there's nothing in index position [1]. Hence the error at this point.
if info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img')[1] == True:
What you want to do is use try: So something like:
for info in tableElements[1:]:
try:
if info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img')[1] == True:
for images in info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_elements_by_tag_name('img'):
print(images.get_attribute('alt'))
else:
print(info.find_elements_by_tag_name('td')[1].find_element_by_tag_name('h4').find_element_by_tag_name('img').get_attribute('alt'))
except:
#do something else
print ('Nothing found in index position.')
Is it also possible to back to your question and provide the full code? When I try this, I'm getting 11 table elements, so want to test it with the specific table you're trying to scrape.
Related
I am willing to parse https://2gis.kz , and I encountered the problem that I am getting error while using .text or any methods used to extract text from a class
I am typing the search query such as "fitness"
My window variable is
all_cards = driver.find_elements(By.CLASS_NAME,"_1hf7139")
for card_ in all_cards:
card_.click()
window = driver.find_element(By.CLASS_NAME, "_18lzknl")
This is a quite simplified version of how I open a mini-window with all of the essential information inside it. Below I am attaching the piece of code where I am trying to extract text from a phone number holder.
texts = window.find_elements(By.CLASS_NAME,'_b0ke8')
print(texts) # this prints out something from where I am concluding that this thing is accessible
try:
print(texts.text)
except:
print(".text")
try:
print(texts.text())
except:
print(".text()")
try:
print(texts.get_attribute("innerHTML"))
except:
print('getAttribute("innerHTML")')
try:
print(texts.get_attribute("textContent"))
except:
print('getAttribute("textContent")')
try:
print(texts.get_attribute("outerHTML"))
except:
print('getAttribute("outerHTML")')
Hi, guys, I solved an issue. The .text was not working for some reason. I guess developers somehow managed to protect information from using this method. I used a
get_attribute("innerHTML") # afaik this allows us to get a html code of a particular class
and now it works like a charm.
texts = window.find_elements(By.TAG_NAME, "bdo")
with io.open("t.txt", "a", encoding="utf-8") as f:
for text in texts:
nums = re.sub("[^0-9]", "",
text.get_attribute("innerHTML"))
f.write(nums+'\n')
f.close()
So the problem was that:
I was trying to print a list of items just by using print(texts)
Even when I tried to print each element of texts variable in a for loop, I was getting an error due to the fact that it was decoded in utf-8.
I hope someone will find it useful and will not spend a plethora of time trying to fix such a simple bug.
find_elements method returns a list of web elements. So this
texts = window.find_elements(By.CLASS_NAME,'_b0ke8')
gives you texts a list of web elements.
You can not apply .text method directly on list.
In order to get each element text you will have to iterate over elements in the list and extract that element text, like this:
text_elements = window.find_elements(By.CLASS_NAME,'_b0ke8')
for element in text_elements:
print(element.text)
Also, I'm not sure about locators you are using.
_1hf7139, _18lzknl and _b0ke8 class names are seem to be dynamic class names i.e they may change each browsing session.
This problem seems very simple, but I'm having trouble finding an already existing solution on StackOverflow.
When I run a sqlalchemy command like the following
valid_columns = db.session.query(CrmLabels).filter_by(user_id=client_id).first()
I get back a CrmLabels object that is not iterable. If I print this object, I get a list
[Convert Source, Convert Medium, Landing Page]
But this is not iterable. I would like to get exactly what I've shown above, except as a list of strings
['Convert Source', 'Convert Medium', 'Landing Page']
How can I run a query that will return this result?
Below change should do it:
valid_columns = (
db.session.query(CrmLabels).filter_by(user_id=client_id)
.statement.execute() # just add this
.first()
)
However, you need to be certain about the order of columns, and you can use valid_columns.keys() to make sure the values are in the expected order.
Alternatively, you can create a dictionary using dict(valid_columns.items()).
I have a dictionary file called “labels” that contains text objects.
Screen capture of file
When I display the contents of this file, I get the following:
{'175.123.98.240': Text(-0.15349206308126684, -0.6696533109609498, '175.123.98.240'),
'54.66.152.105': Text(-1.0, -0.5455880938500245, '54.66.152.105'),
'62.97.116.82': Text(0.948676253595717, 0.6530664635187481, '62.97.116.82'),
'24.73.75.234': Text(0.849485905682265, -0.778703553136851, '24.73.75.234'),
'1.192.128.23': Text(0.2883091762715677, -0.03432011446968225, '1.192.128.23'),
'183.82.9.19': Text(-0.8855214994079628, 0.7201660238351776, '183.82.9.19'),
'14.63.160.219': Text(-0.047457773060320695, 0.655032585063581, '14.63.160.219')}
I want to change the IP address in the text object portion such that the file looks like this:
{'175.123.98.240': Text(-0.15349206308126684, -0.6696533109609498, 'xxx.123.98.240'),
'54.66.152.105': Text(-1.0, -0.5455880938500245, 'xxx.66.152.105'),
'62.97.116.82': Text(0.948676253595717, 0.6530664635187481, 'xxx.97.116.82'),
'24.73.75.234': Text(0.849485905682265, -0.778703553136851, 'xxx.73.75.234'),
'1.192.128.23': Text(0.2883091762715677, -0.03432011446968225, 'xxx.192.128.23'),
'183.82.9.19': Text(-0.8855214994079628, 0.7201660238351776, 'xxx.82.9.19'),
'14.63.160.219': Text(-0.047457773060320695, 0.655032585063581, 'xxx.63.160.219')}
This file is used for printing labels on a networkx graph.
I have a couple of questions.
Can the contents of a text object be modified?
If so, can it be changed without iterating through the file since the number of changes could range from 3 to 6,000, depending on what I am graphing?
How would I do it?
I did consider changing the IP address before I created my node and edge files but that resulted in separate IP address being clustered incorrectly. For example: 173.6.48.24 and 1.6.48.24 would both be converted to xxx.6.48.24.
Changing the IP address at the time of printing the labels seems like the only sensible method.
I am hoping someone could point me in the right direction. I have never dealt with text objects and I am out of my depth on this one.
Thanks
Additional information
The original data set is a list of IP addresses that have attack several honeypots I am running. I have taken the data and catalogued the data based on certain attack criteria.
The data that I showed was just one of the small attack networks. The label file was generated using the code:
labels = nx.draw_networkx_labels(compX, pos=pos_df)
Where compX is the file containing the data to be graphed and pos_df is the layout of the graph. In this case, I used nx.spring_layout().
I can display the contents of the label file using:
for k,v in labels.items():
print(v)
However, “v” contains the text object, which I do not seem to be able to work with. The content of “v” is a follows:
Text(-0.15349206308126684, -0.6696533109609498, '175.123.98.240')
Text(-1.0, -0.5455880938500245, '54.66.152.105')
Text(0.948676253595717, 0.6530664635187481, '62.97.116.82')
Text(0.849485905682265, -0.778703553136851, '24.73.75.234')
Text(0.2883091762715677, -0.03432011446968225, '1.192.128.23')
Text(-0.8855214994079628, 0.7201660238351776, '183.82.9.19')
Text(-0.047457773060320695, 0.655032585063581, '14.63.160.219')
This is where I am stuck. I do not seem to be able to come up with any code that does not return some kind of “'Text' object has no attribute xxxx”.
As for replacing the first octet, I have the following code that works on a dataframe and I have just been experimenting to see if I can adapt it but so far, no luck:
df[column_ID] = df[column_ID].apply(lambda x: "xxx."+".".join(x.split('.')[1:4])) # Replace First octet
As I said, I would prefer not to iterate through the file. This cluster has seven entries; others can contain up to 6,000 nodes – granted the graph looks like a hairball with this many nodes, but most are between 3 and 25 nodes. I have a total of 60 clusters and as I collect more information, this number will rise.
I found a solution to replacing text inside a text object:
1) Convert text object to string
2) Find the position to be changed and make the change
3) Use set_text() to make the change to the text object
Example code
# Anonymize Source IP address
for k,v in labels.items():
a = str(v)
a = a[a.find(", '"):]
a = 'xxx' + a[a.find("."):][:-2]
v.set_text(a)
I'm using Selenium for extracting comments of Youtube.
Everything went well. But when I print comment.text, the output is the last sentence.
I don't know who to save it for further analyze (cleaning and tokenization)
path = "/mnt/c/Users/xxx/chromedriver.exe"
This is the path that I saved and downloaded my chrome
chrome = webdriver.Chrome(path)
url = "https://www.youtube.com/watch?v=WPni755-Krg"
chrome.get(url)
chrome.maximize_window()
scrolldown
sleep = 5
chrome.execute_script('window.scrollTo(0, 500);'
time.sleep(sleep)
chrome.execute_script('window.scrollTo(0, 1080);')
time.sleep(sleep)
text_comment = chrome.find_element_by_xpath('//*[#id="contents"]')
comments = text_comment.find_elements_by_xpath('//*[#id="content-text"]')
comment_ids = []
Try this approach for getting the text of all comments. (the forloop part edited- there was no indention in the previous code.)
for comment in comments:
comment_ids.append(comment.get_attribute('id'))
print(comment.text)
when I print, i can see all the texts here. but how can i open it for further study. Should i always use for loop? I want to tokenize the texts but the output is only last sentence. Is there a way to save this .text file with the whole texts inside it and open it again? I googled it a lot but it wasn't successful.
So it sounds like you're just trying to store these comments to reference later. Your current solution is to append them to a string and use a token to create substrings? I'm not familiar with pythons data structures, but this sounds like a great job for an array or a list depending on how you plan to reference this data.
I looked around at other questions but couldn't find out that addresses the issue I'm having. I am cleaning a data set in an ipython notebook. When I run the cleaning tasks individually they work as expected, but I am having trouble with the replace() and drop() functions when they are included in a UDF. Specifically, these lines aren't doing anything within the UDF, however, a dataframe is returned that completes the other tasks as expected (i.e. reads in the file, sets the index, and filters select dates out).
Any help is much appreciated!
Note that in this problem the df.drop() and df.replace() commands both work as expected when executed outside of the UDF. The function is below for your reference. The issue is with the last two lines "station.replace()" and "station.drop()".
def read_file(file_path):
'''Function to read in daily x data'''
if os.path.exists(os.getcwd()+'/'+file_path) == True:
station = pd.read_csv(file_path)
else:
!unzip alldata.zip
station = pd.read_csv(file_path)
station.set_index('date',inplace=True) #put date in the index
station = station_data[station_data.index > '1984-09-29'] #removes days where there is no y-data
station.replace('---','0',inplace=True)
station.drop(columns=['Unnamed: 0'],axis=1,inplace=True) #drop non-station columns
There was a mistake here:
station = station_data[station_data.index > '1984-09-29']
I was using an old table index. I corrected it to:
station = station[station.index > '1984-09-29']
Note, I had to restart the notebook and re-run it from the top for it to work. I believe it was an issue with conflicting table names in the UDF vs. what was already stored in memory.