Adding str to every item in list python3 - python-3.x

Introduction
Since im starting to get familiar with scrapy, i try to crawl some links out of random webpages.
Problem
The links im saving to my items.py file, are written without: "https://", but i need them as a hyperlink.
So i want to add "https://" before the actual links, so its formatted to a hyperlink.
My Code
def parse_target_page(self, response):
card = response.xpath('//div[#class="text-center artikelbox"]')
for a in card:
items = LinkcollectItem()
link = ('a/#href')
items ['Title'] = a.xpath('.//h5[#class="title"]/a/text()').get()
items ['Link'] = a.xpath('.//h5[#class="title"]/a/#href').get()
yield items
I tried with insert my string at index 0, but it didnt work
My output should print all links as hyperlink in csv-file.

if you need only add https:// for each link, you can do following:
link = a.xpath('.//h5[#class="title"]/a/#href').get()
items ['Link'] = "https://" + link if link else link

Related

how to copy to clipboard text between h2 tags in selenium python

what i try to do here is get email code for verification. so I log in to the email, select and copy the 6 digits code from the email and paste it to the other tab. everything is done except i can not double click to select the 6 digit code and copy it to clipboard. the code is between h2 tag and nothing else, like this: 639094 where 639094 is actually the code which i need to be copied. how can i find the code element or whatever and copy it? here is a screen shot of the email and the chrome inspect element if anything helps.
this is the code that I use to copy the code:
codeID = driver.find_element(By.XPATH,
'//table[#class="main"]//tr//td//p//h2').text
ActionChains = ActionChains(driver)
ActionChains.double_click(codeID).perform()
time.sleep(2)
codeID.send_keys(Keys.CONTROL + 'c')
text = pyperclip.paste()
print(text)
screen shot
element is found however looks like can not be copied. the error is Element is not reachable by keyboard. if i do everything automatically up until the element is selected with double click and copy the element with my actual keyboard the element is copied, however when selenium try to copy i get the error from above. the code i use to double click the element is:
codeID = driver.find_element(By.XPATH, '//*[#id="message-htmlpart1"]/div/table/tbody/tr/td[2]/div/table/tbody/tr/td/table/tbody/tr/td/h2')
ActionChains = ActionChains(driver)
ActionChains.double_click(codeID).perform()
time.sleep(2)
and to do the copy is :
codeID.send_keys(Keys.CONTROL + 'c')
text = pyperclip.paste()
print(text)
this is the part where the error ocur:
codeID.send_keys(Keys.CONTROL + 'c')
text = pyperclip.paste()
print(text)
for some reason it says "Element is not reachable by keyboard" but the element/code numbers are selected.
if I use print(text) they are also printed in the console.
driver.find_element_by_xpath('//table[#class="main"]//tr//td//h2').text this will give you the text/code
Hey i will analyse this problem with you
For the first part :
try to take that XPath you have and past it in the Xpath helper (google chrome extension)
=> If you find that element , than the problem in your code
=> if you don’t than the element is already in a frame or in a table
The solution is to change your drive to the new frame and relocate the element inside the frame
Exemple :
iframe_xpath = driver.find_element_by_xpath('//iframe')
driver.switch_to.frame('iframe_xpath')
Now try to relocate the element starting from the iframe
For the second part :
You say it’s a table so you need to mention the /td[i] and /tr[j] value where the number is located so you can get it
Exemple
d = driver.find_element_by_xpath( "//tr[i]/td[j]").text
I hope that’s help

Click Java Button in URL with Excel VBA

Trying to achive downloading table from company website. I can download first page. However, cannot jump to second page.
HTML CODE for Page Number
1
HTML CODE
[![HTML CODE FOR TABLE][1]][1]
page numbers are inside table and increasing one by one. at the first time when page one is active link href is not visible and shows as
<span>1</span>
I use below code to click page however I cannot succeded.
Set doc = ie.document
i = 0
For Each link In doc.Links
'doing downloading stuff here
i = i + 1
link.innerText = "javascript:__doPostBack('ctl00$View$gv','Page$" & i
link.Click
Next
When I check the page also there is a javascript function.
Javasript CODE
//<![CDATA[
var theForm = document.forms['aspnetForm'];
if (!theForm) {
theForm = document.aspnetForm;
}
function __doPostBack(eventTarget, eventArgument) {
if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
theForm.__EVENTTARGET.value = eventTarget;
theForm.__EVENTARGUMENT.value = eventArgument;
theForm.submit();
}
}
//]]>
after first page downloaded, macro click irrelevant page links even never click same page for each time.
Extra Question
also is there any way to get href values instead of innertext on below code
User Name
Thanks
Open any page by parameter of the url:
Look if you can open any page directly by a parameter of the url for the page number like this:
https://yourUrl.com?page=2
Then the walk through all pages is very easy. The only thing you must check at first is the number of the pages or a html code that only is in the page code when you try to open a page that is not available.
How to get href
You can't click innertext. That is only a string. You ask for a way to get the href and that is the right thought. If you want get the href of the first a-tag you can use this:
'Part of your code to open the page
'...
Dim nodeFirstLink as Object
Set nodeFirstLink = doc.getElementsByTagName("a")(0)
Debug.Print nodeFirstLink.href
'More of your code
'...
Here is an example how to change the href
But I don't know if this works also with JS links:
Sub ChangeHref()
Dim htmlDoc As Object
Dim nodeFirstLink As Object
'Set a short HTML Document for this example
Set htmlDoc = CreateObject("HtmlFile")
htmlDoc.body.innerHTML = "<a href='https://amazon.com'>Amazon</a>"
Set nodeFirstLink = htmlDoc.getElementsByTagName("a")(0) 'Get the first Link
Debug.Print nodeFirstLink.outerhtml 'The HTML of the first link in the html document
Debug.Print nodeFirstLink.href 'Only the href of the first link in the html document
nodeFirstLink.href = "https://ebay.com" 'Changing the href in the first link
Debug.Print nodeFirstLink.outerhtml 'The innertext is still Amazon
Debug.Print nodeFirstLink.href 'The href is the new one
End Sub

How to use same django filter class(filters.py) in two different views

I have a filter class defined below.
filters.py
class CTAFilter(django_filters.FilterSet):
id = django_filters.NumberFilter(label="DSID")
class Meta:
model = CTA
fields = ['id', 'EmailID','id','Shift_timing']
Now I want to use this CTAFilter in normal template(table data)view and in download views.
I have observed that It is working fine for normal render view but when I am using it in my download views it is not working and I am getting all model data in the .xls file.
Please find the below questions which I have posted.
how to use Django filtered class data to 2 seperate view
I am not able to resolve this problem I have tried to check if I can define it globally so that the filter will work for all views(like RESTAPI).
Is there any way I can make my download view as a child view class of normal render view so that I will use the below code from the parent view(as it is working fine)?
cta_list = CTA.objects.all()
cta_filter = CTAFilter(request.GET, queryset=cta_list) allcta = cta_filter.qs
A>Normal View where the filter is working fine.
def retrievecta_view(request):
if request.method == 'GET':
allcta = CTA.objects.all()
allcta1 = allcta
allctagen = allcta1.filter(Shift_timing__exact='General')
allctamor = allcta1.filter(Shift_timing__exact='Morning')
allctseve = allcta1.filter(Shift_timing__exact='Evening')
allctatotal = allcta1.filter(Shift_timing__exact='Total')
# For filtering using 'django_filters',
cta_list = CTA.objects.all()
cta_filter = CTAFilter(request.GET, queryset=cta_list)
allcta = cta_filter.qs
paginator = Paginator(allcta, 50)
page_number = request.GET.get('page')
try:
allcts = paginator.page(page_number)
except PageNotAnInteger:
allcts = paginator.page(1)
except EmptyPage:
allcts = paginator.page(paginator.num_pages)
return render(request, 'abcd/cta.html', {'allcta': allcta, 'cta_filter': cta_filter, 'allcta1': allcta1,
'allctagen': allctagen, 'allctamor': allctamor,
'allctaeve': allctaeve,
'allctatotal': allctatotal})
b> Download view where I am trying to use the same filter but it is giving me all records.
def exportcts_data(request):
response = HttpResponse(content_type='application/ms-excel')
response['Content-Disposition'] = 'attachment; filename="CTA_ShiftTiming.xls"'
wb = xlwt.Workbook(encoding='utf-8')
ws = wb.add_sheet('CTS_ShiftChange Data') # this will make a sheet named Users Data
# Sheet header, first row
row_num = 0
font_style = xlwt.XFStyle()
font_style.font.bold = True
columns = ['id','idk','Shift_timing','EmailID','Vendor_Company','Project_name','SerialNumber','Reason','last_updated_time']
for col_num in range(len(columns)):
ws.write(row_num, col_num, columns[col_num], font_style) # at 0 row 0 column
# Sheet body, remaining rows
font_style = xlwt.XFStyle()
cts_list = CTA.objects.all()
cts_filter = CTAFilter(request.GET, queryset=cts_list)
allcts = cts_filter.qs
rows = allcts.values_list('id', 'idk', 'Shift_timing', 'EmailID', 'Vendor_Company', 'Project_name',
'SerialNumber', 'Reason', 'last_updated_time')
for row in rows:
row_num += 1
for col_num in range(len(row)):
ws.write(row_num, col_num, row[col_num], font_style)
wb.save(response)
return response
I'm not quite following why you want to have separate view for downloads which ultimately should be rendering the same data as the normal view if they are using the same filter. Maybe it is just my misunderstanding so I'm not sure if this will help you but let's see.
First off let me explain a little background. This is a task management application and in there I have an html page where the person logged in can view all of their completed tasks. (Nice and simple.) However the user may have tasks from many different projects so I have created a dropdown list that allows them to filter by a single project. They may also want to only see a specific period of tasks so I have allowed them to set a date range by providing a start and end date. (Nothing startling or earth shattering here.) Once the parameters are set, the user clicks a search button and the filtered results are displayed. The page also has an Export button which downloads the results of the filtered list to a .xls spreadsheet.
So how do I do this? Well first of all, I am using Django-Tables2 for rendering my tables. I simple predefine the table in tables.py and throw it the data I want from my views and it takes care of everything. Therefore my view code is minimal and very simple and looks like this.
from django_tables2.export.export import TableExport
from .tables import CompletedTable
def completedlist(request, page='0', name=''):
#Check to see if we have clicked a button inside the form
if request.method == 'POST':
return redirect ('tasks:tasklist')
else:
# Pre-filtering of user and Master = True etc is done in the MasterListFilter in filters.py
# Then we compile the list for Filtering by.
f = CompletedListFilter(request.GET, queryset=Task.objects.all(),request=request)
# Then we apply the complete list to the table, configure it and then render it.
completedtable = CompletedTable(f.qs)
rows = len(completedtable.rows)
if int(page) > 0:
RequestConfig(request, paginate={'page': page, 'per_page': 10}).configure(completedtable)
else:
RequestConfig(request, paginate={'page': 1, 'per_page': 10}).configure(completedtable)
export_format = request.GET.get('_export', None)
if TableExport.is_valid_format(export_format):
exporter = TableExport(export_format, completedtable)
return exporter.response('Completed Tasks.{}'.format(export_format))
return render (request,'tasks/completedlist.html',{'completedtable': completedtable, 'filter': f, 'rows': rows})
As you can see, every time the user hits either the search or export buttons, I am recompiling the queryset in variable f with the following line:
f = CompletedListFilter(request.GET, queryset=Task.objects.all(),request=request)
I have predefined the .xls format in the html page with this code:
<button class="btn btn-primary btn-xs" name="_export" value="xls" type="submit">Export</button>
So then I can test to see if the user clicked the Export button or not by getting the value of _export from the request like this:
export_format = request.GET.get('_export', None)
If the user did not click the export button, export_format will default to none. If they did, it will be .xls as defined in the html. Then I simply either export the data in line with the filters set by the user or I render the page with the same filtered list of data like this:
if TableExport.is_valid_format(export_format):
exporter = TableExport(export_format, completedtable)
return exporter.response('Completed Tasks.{}'.format(export_format))
return render (request,'tasks/completedlist.html',{'completedtable': completedtable, 'filter': f, 'rows': rows})
So there you have it. As you say your filter is working for the normal view I have not detailed my filter as that would seem to be unnecessary.
Maybe this solution is too simplistic for your requirements and yes, before I get shot down by other developers, there are several limitations, such as 'What if the user wants to use something other than .xls?' or 'What if they want to export more than one Project at a time?' Like everything, there is always room for improvement but when I'm bashing my head with an issue, I often find it helps to strip things back to basics and see what comes from that.

Openpyxl returns wrong hyperlink address after delete_rows()

Problem: I have a program that scrapes Twitter and returns the results in an excel file. Part of each entry is a column containing a hyperlink to the Tweet and image included in the Tweet if applicable. Entries and hyperlinks work fine except when I run the following code to remove duplicate posts:
#Remove duplicate posts.
values = []
i = 2
while i <= sheet.max_row:
if sheet.cell(row=i,column=3).value in values:
sheet.delete_rows(i,1)
else:
values.append(sheet.cell(row=i,column=3).value)
i+=1
After running the duplicate removal snippet the hyperlinks point to what I assume is the offset of deleted entries. Here is the code for creating a Twitter entry:
sheet.cell(row=row, column=8).hyperlink = "https://twitter.com/"+str(tweet.user.screen_name)+"/status/"+str(tweet.id)
sheet.cell(row=row, column=8).style = "Hyperlink"
Expected Results: Should be able to remove duplicate entries and keep the hyperlink pointed to the correct address.
The hyperlinks point to the correct addresses for whatever reason when I change the code to the this:
sheet.cell(row=row, column=8).value = "https://twitter.com/"+str(tweet.user.screen_name)+"/status/"+str(tweet.id)
sheet.cell(row=row, column=8).style = "Hyperlink"
Requires a rapid double click to work as a hyperlink in the excel sheet versus the one click when inserting using .hyperlink.
So fixed but not fixed.

Google search next pages using selenium

I'm trying to automate the clicking of the next page in google search, after I must have gone into the links in the 1st and 2nd search page.
I've so far been able to do the following:
Spin up the chrome browser
Go to the Google webpage
Type in the search words
Click on the search icon
Go into the links on the 1st and 2nd google page
See my code below:
from time import sleep
from selenium import webdriver
from parsel import Selector
from selenium.webdriver.common.keys import Keys
#path to the chromedriver
driver = webdriver.Chrome('/Users\my_path/chromedriver')
driver.get('https://www.gooogle.com')
#locate search form by name
search_query = driver.find_element_by_name('q')
#Input search words
search_query.send_keys('X-Men')
#Simulate return key
search_query.send_keys(Keys.RETURN)
Xmen_urls = driver.find_elements_by_class_name('iUh30')
for page in range(0,3):
Xmen_urls = [url.text for url in Xmen_urls]
#loop to iterate through all links in the google search query
for Xmen_url in Xmen_urls:
driver.get(Xmen_url)
sel = Selector(text = driver.page_source)
#Go back to google search
driver.get('https://www.gooogle.com')
#locate search form by name
search_query = driver.find_element_by_name('q')
#Input search words
search_query.send_keys('X-Men')
#Simulate return key
search_query.send_keys(Keys.RETURN)
#find next page icon in Google search
Next_Google_page = driver.find_element_by_link_text("Next").click()
page += 1
When I'm done collecting the links on the '2nd' search page, how do I tell the algorithm to start from the '2nd' search page and not the 1st search page (this will enable me go into >2 pages).
I know it's a 'for loop' and syntax re-arranging I'm missing somewhere but my brain is frozen at this point.
I saw this page: How to click the next link in google search results? but it only helps if I'm not navigating away from the google search page
What am I doing wrong?
There are two ways I can see:
Open each X-Men url in a separate window using window_handles, collect page_source, close the window and switch back to the original window.
driver.execute_script("window.open(X-Men_url, 'new_window')")
driver.switch_to.window(driver.window_handles[1])
sel = Selector(text = driver.page_source)
driver.close()
driver.switch_to.window(driver.window_handles[0])
The code above may not work exactly, but something to that effect.
The other way is to simulate a number of clicks on NEXT at the beginning of your FOR loop using a loop:
a = 0;
while a <= page:
driver.find_element_by_xpath("//*[contains(local-name(), 'span') and contains(text(), 'Next')]").click()
a = a+1

Resources