How to save the output of text from selenium chrome (Python) - python-3.x

I'm using Selenium for extracting comments of Youtube.
Everything went well. But when I print comment.text, the output is the last sentence.
I don't know who to save it for further analyze (cleaning and tokenization)
path = "/mnt/c/Users/xxx/chromedriver.exe"
This is the path that I saved and downloaded my chrome
chrome = webdriver.Chrome(path)
url = "https://www.youtube.com/watch?v=WPni755-Krg"
chrome.get(url)
chrome.maximize_window()
scrolldown
sleep = 5
chrome.execute_script('window.scrollTo(0, 500);'
time.sleep(sleep)
chrome.execute_script('window.scrollTo(0, 1080);')
time.sleep(sleep)
text_comment = chrome.find_element_by_xpath('//*[#id="contents"]')
comments = text_comment.find_elements_by_xpath('//*[#id="content-text"]')
comment_ids = []
Try this approach for getting the text of all comments. (the forloop part edited- there was no indention in the previous code.)
for comment in comments:
comment_ids.append(comment.get_attribute('id'))
print(comment.text)
when I print, i can see all the texts here. but how can i open it for further study. Should i always use for loop? I want to tokenize the texts but the output is only last sentence. Is there a way to save this .text file with the whole texts inside it and open it again? I googled it a lot but it wasn't successful.

So it sounds like you're just trying to store these comments to reference later. Your current solution is to append them to a string and use a token to create substrings? I'm not familiar with pythons data structures, but this sounds like a great job for an array or a list depending on how you plan to reference this data.

Related

Unable to access text from a class using selenium on python

I am willing to parse https://2gis.kz , and I encountered the problem that I am getting error while using .text or any methods used to extract text from a class
I am typing the search query such as "fitness"
My window variable is
all_cards = driver.find_elements(By.CLASS_NAME,"_1hf7139")
for card_ in all_cards:
card_.click()
window = driver.find_element(By.CLASS_NAME, "_18lzknl")
This is a quite simplified version of how I open a mini-window with all of the essential information inside it. Below I am attaching the piece of code where I am trying to extract text from a phone number holder.
texts = window.find_elements(By.CLASS_NAME,'_b0ke8')
print(texts) # this prints out something from where I am concluding that this thing is accessible
try:
print(texts.text)
except:
print(".text")
try:
print(texts.text())
except:
print(".text()")
try:
print(texts.get_attribute("innerHTML"))
except:
print('getAttribute("innerHTML")')
try:
print(texts.get_attribute("textContent"))
except:
print('getAttribute("textContent")')
try:
print(texts.get_attribute("outerHTML"))
except:
print('getAttribute("outerHTML")')
Hi, guys, I solved an issue. The .text was not working for some reason. I guess developers somehow managed to protect information from using this method. I used a
get_attribute("innerHTML") # afaik this allows us to get a html code of a particular class
and now it works like a charm.
texts = window.find_elements(By.TAG_NAME, "bdo")
with io.open("t.txt", "a", encoding="utf-8") as f:
for text in texts:
nums = re.sub("[^0-9]", "",
text.get_attribute("innerHTML"))
f.write(nums+'\n')
f.close()
So the problem was that:
I was trying to print a list of items just by using print(texts)
Even when I tried to print each element of texts variable in a for loop, I was getting an error due to the fact that it was decoded in utf-8.
I hope someone will find it useful and will not spend a plethora of time trying to fix such a simple bug.
find_elements method returns a list of web elements. So this
texts = window.find_elements(By.CLASS_NAME,'_b0ke8')
gives you texts a list of web elements.
You can not apply .text method directly on list.
In order to get each element text you will have to iterate over elements in the list and extract that element text, like this:
text_elements = window.find_elements(By.CLASS_NAME,'_b0ke8')
for element in text_elements:
print(element.text)
Also, I'm not sure about locators you are using.
_1hf7139, _18lzknl and _b0ke8 class names are seem to be dynamic class names i.e they may change each browsing session.

How to get a right video url of an Instagram post using python

I am trying to build a program which runs a function that input a url of a post, output the links of images and videos the post contain. It works really good for images. However, when it comes to get the links of videos, it return me a wrong url. I have no idea how to handle this situation.
https://scontent-lax3-2.cdninstagram.com/v/t50.2886-16/86731551_2762014420555254_3542675879337307555_n.mp4?efg=eyJ2ZW5jb2RlX3RhZyI6InZ0c192b2RfdXJsZ2VuLjcyMC5jYXJvdXNlbF9pdGVtIiwicWVfZ3JvdXBzIjoiW1wiaWdfd2ViX2RlbGl2ZXJ5X3Z0c19vdGZcIl0ifQ&_nc_ht=scontent-lax3-2.cdninstagram.com&_nc_cat=106&_nc_ohc=WDuXskvIuLEAX9rj7MU&vs=17877888256532912_3147883953&_nc_vs=HBksFQAYJEdCOXJLd1gyMVdhWUNkQUpBS090UGo3eEhDb3hia1lMQUFBRhUAAsgBABUAGCRHTFBXTUFVTXNPaG5XcW9EQU5KUEE5bEZVdVZxYmtZTEFBQUYVAgLIAQAoABgAGwGIB3VzZV9vaWwBMBUAABgAFuD4nJGH9sE%2FFQIoAkMzLBdAEszMzMzMzRgSZGFzaF9iYXNlbGluZV8xX3YxEQB17gcA&_nc_rid=97e769e058&oe=5EDF10A5&oh=3713c35f89fca1aa9554a281aa3421ed
https://scontent-gmp1-1.cdninstagram.com/v/t50.2886-16/0_0_0_\x00.mp4?_nc_ht=scontent-gmp1-1.cdninstagram.com&_nc_cat=100&_nc_ohc=Wnu_-GvKHJoAX9F_ui1&oe=5EDE8214&oh=7920ac3339d5bf313e898c3cbec85fa2
Here are two urls. The first one is copied from the sources of a web page, while the second one is copied from the data scraped by pyquery. They come from a same Instagram post, same path, but they are totally different. The first one works well, but the second one doesn't. How can I solve this? How can I get a right video url?
I am searching for a long time on net. But no use. Please help or try to give some ideas how to achieve this.
Here is my code related to the question
def getUrls(url):
URL = str(url)
html = get_html(URL)
doc = pq(html)
urls = []
items = doc('script[type="text/javascript"]').items()
for item in items:
if item.text().strip().startswith('window._sharedData'):
js_data = json.loads(item.text()[21:-1], encoding='utf-8')
shortcode_media = js_data["entry_data"]["PostPage"][0]["graphql"]["shortcode_media"]
edges = shortcode_media['edge_sidecar_to_children']['edges']
for edge in edges:
is_video = edge['node']['is_video']
if is_video:
video_url = edge['node']['video_url']
video_url.replace(r'\u0026', "&")
urls.append(video_url)
else:
display_url = edge['node']['display_resources'][-1]['src']
display_url.replace(r'\u0026', "&")
urls.append(display_url)
return urls
Thanks in advance.
There's nothing wrong with your code. This is a known intermittent issue with Instagram, and other people have encountered it too: https://github.com/arc298/instagram-scraper/issues/545
There doesn't appear to be a known fix yet.
Also, while unrelated to your question, it's worth mentioning that you don't need to inspect the display_resources object to get the URL of the image:
display_url = edge['node']['display_resources'][-1]['src']
There is already a display_url property available (I'm guessing you saw this, based on the variable name?). So you can simply do:
display_url = edge['node']['display_url']
I've seen this sometimes when using this Python module instead of HTML-scraping. At least with that module, edge["node"]["videos"]["standard_resolution"]["url"] usually (but not always) gives a non-corrupted value.

Selenium unusual output in python

I am getting comments from website I tried it already and it worked well but now it gives me unusual output.
Part of my code:
comments = driver.find_elements_by_class_name("comment-text")
time.sleep(1)
print(comments[1])
The output:
<selenium.webdriver.remote.webelement.WebElement (session="bb6ae0409dd8ec8c191f9bd84f79bea7", element="5f5d4a2f-7a93-41fe-9ca5-aa4e7c525792")>
You want
print(comments[1].text)
You were printing the element itself which is just some GUID (I think). I'm assuming you want the text contained in the element which means you need .text.

How to scrape a pdf file to make a DF

I need to build a database with several data. Most of those data is contained in PDF Files. Those PDF files are all the same, but change only on the data. (for example, one of the files i have to work in: https://documentos.serviciocivil.cl/actas/dnsc/documentService/downloadWs?uuid=aecfeb7c-d494-4631-ade4-584d67ea120e)
I have been trying to extract the data with PyPDF, tabula, pdfminer (even tried with textract but it didn't work through Anaconda) and other stuff, but i didn't get what i want.
Then i tried to transform those pdf files in txt files and then mining it, but didn't get anything. Also tried with regex but didn't understand how to use it, although the code doesn't show errors when running:
import re
import sys
recording = False
your_file = "D:\Magister\Tercer semestre\Tesis I\Txt\ResultadoConcurso1.txt"
start_pattern = 'apellidos:'
stop_pattern = '1.2'
output_section = []
for line in open(your_file).readlines():
if recording is False:
if re.search(start_pattern, line) is not None:
recording = True
output_section.append(line.strip())
elif recording is True:
if re.search(stop_pattern, line) is not None:
recording = False
sys.exit()
output_section.append(line.strip())
print("".join(output_section))
As you can see in the upper link i left, pdf files have different sections. I need to get the info that's inside those sections. For example, one of the fields in my database it's going to be "Nombre y apellido" (name and lastname). It's contained between "apellidos:" and "1.2".
what should i do? Can i work directly from PDF format? Or should i work in txt files? And then, what should i use to get the info? (Python 3.XX; Anaconda)
Thanks

Why is the same function in python-chess returning different results?

I'm new to working with python-chess and I was perusing the official documentation. I noticed this very weird thing I just can't make sense of. This is from the documentation:
import chess.pgn
pgn = open("data/pgn/kasparov-deep-blue-1997.pgn")
first_game = chess.pgn.read_game(pgn)
second_game = chess.pgn.read_game(pgn)
So as you can see the exact same function pgn.read_game() results in two different games to show up. I tried with my own pgn file and sure enough first_game == second_game resulted in False. I also tried third_game = chess.pgn.read_game() and sure enough that gave me the (presumably) third game from the pgn file. How is this possible? If I'm using the same function shouldn't it return the same result every time for the same file? Why should the variable name matter(I'm assuming it does) unless programming languages changed overnight or there's a random function built-in somewhere?
The only way that this can be possible is if some data is changing. This could be data that chess.pgn.read_game reads from elsewhere, or could be something to do with the object you're passing in.
In Python, file-like objects store where they are in the file. If they didn't, then this code:
with open("/home/wizzwizz4/Documents/TOPSECRET/diary.txt") as f:
line = f.readline()
while line:
print(line, end="")
line = f.readline()
would just print the first line over and over again. When data's read from a file, Python won't give you that data again unless you specifically ask for it.
There are multiple games in this file, stored one after each other. You're passing in the same file each time, but you're not resetting the read cursor to the beginning of the file (f.seek(0)) or closing and reopening the file, so it's going to read the next data available – i.e., the next game.

Resources