How to show separate values from appended list - python-3.x

I'm trying to display values from appended list of items that are scraped with bs4. Currently, my code only returns a whole set of data, but I'd like to have separate values displayed from the data. Now, all I get is:
NameError: value is not defined.
How to do it?
data = []
for e in soup.select('div:has(> div > a h3)'):
data.append({
'title':e.h3.text,
'url':e.a.get('href'),
'desc':e.next_sibling.text,
'email': re.search(r'[\w.+-]+#[\w-]+\.[\w.-]+', e.parent.text).group(0) if
re.search(r'[\w.+-]+#[\w-]+\.[\w.-]+', e.parent.text) else None
})
data
title = print(title) # name error
desc = print(desc) # name error
email = print(email) # name error

Main issue is that you try to reference only on keys without taking into account that there is data a list of dicts.
So you have to pick your dict by index from data if you like to print a specific one:
print(data[0]['title'])
print(data[0]['desc'])
print(data[0]['email'])
Alternative just iterate over data and print/operate on the values of each dict:
for d in data:
print(d['title'])
print(d['desc'])
print(d['email'])
or
for d in data:
title = d['title']
desc = d['desc']
email = d['email']
print(f'print title only: {title}')

You can do that like this way:
for e in soup.select('div:has(> div > a h3)'):
title=e.h3.text,
url=e.a.get('href'),
desc=e.next_sibling.text,
email= re.search(r'[\w.+-]+#[\w-]+\.[\w.-]+', e.parent.text).group(0) if re.search(r'[\w.+-]+#[\w-]+\.[\w.-]+', e.parent.text) else None
print(title)
print(desc)
print(email)
print(url)

Related

I wrote code using pandas with python. I want to convert the code into a new dataframe with the output seperated into two columns

I went from one data frame to another and performed calcs on the column next to name for each unique person. Now I have a output of the names and calcs next to it and I want to break it into two columns and put it in a data frame and print. I'm thinking I should put the entire for loop into a dictionary then a data frame, but not to sure of how to do that. I am a beginner at this and would really appreciate peoples help. See code from the for loop piece below:
names = df['Participant Name, Number'].unique()
for name in names:
unique_name_df = df[df['Participant Name, Number'] == name]
badge_types = unique_name_df['Dosimeter Location'].unique()
if 'Collar' in badge_types:
collar = unique_name_df[unique_name_df['Dosimeter Location'] == 'Collar']['Total DDE'].astype(float).sum()
if 'Chest' in badge_types:
chest = unique_name_df[unique_name_df['Dosimeter Location'] == 'Chest']['Total DDE'].astype(float).sum()
if len(badge_types) == 1:
if 'Collar' in badge_types:
value = collar
elif 'Chest' in badge_types:
value = chest
print(name, value)
If you expect len(badge_types)==1 in all the cases, try:
pd.DataFrame( df.groupby(['Participant Name, Number']).Total_DDE.sum() )
Otherwise, to get the sum per Dosimeter Location, add it on the groupby as
pd.DataFrame( df.groupby(['Participant Name, Number', 'Dosimeter Location']).Total_DDE.sum() )

access list values with index python

So, I encountered something really interesting. I splitted a string with .splitlines().
When I print that list, it works just fine and outputs values in that form: ['Freitag 24.11', '08:00'].
But if I try to access any values of that list by an index, e.g. list[0], it should give me the first value of that list. In this case 'Freitag 24.11'.
splittedParameters = tag.text.splitlines()
print(splittedParameters[0])
So as explained, if I dont use the index [0] it just works fine and outputs the whole list. But in that form with the index it says: "IndexError: list index out of range"
the whole code:
from requests_html import HTMLSession
startDate = None
endDate = None
summary = None
date = ''
splittedDate = ''
url = 'ANY_URL'
session = HTMLSession()
r = session.get(url)
r.html.render()
aTags = r.html.find("a.ui-link")
for tag in aTags:
splittedParameters = tag.text.splitlines()
print(splittedParameters[0])
splittedParameters
is returning nothing, as in its given no value.
there's no text in aTags
print(aTags)
>>> [<Element 'a' class=('logo', 'ui-link') href='index.php'>]
After reading your comment I tried:
for tag in aTags:
print(tag.text)
splittedParameters = tag.text.splitlines()
I got nothing printed
I found the error.
The problem was that in the first run of the for loop the list "splittedParameters" empty is.
if len(splittedParameters) > 0:
print(splittedParameters[0])
this will work.

Update Sqlite w/ Python: InterfaceError: Error binding parameter 0 and None Type is not subscriptable

I've scraped some websites and stored the html info in a sqlite database. Now, I want to extract and store the email addresses. I'm able to successfully extract and print the id and emails. But, I keep getting TypeError: "'NoneType' object is not subscriptable" and "sqlite3.InterfaceError: Error binding parameter 0 - probably unsupported type" when I try to update the database with these new email addresses.
I've verified that the data types I'm using in the update statement are the same as my database (id is class int and email is str). I've googled a bunch of different examples and mucked around with the syntax alot.
I also tried removing the Where Clause in the update statement but got the same errors.
import sqlite3
import re
conn = sqlite3.connect('spider.sqlite')
cur = conn.cursor()
x = cur.execute('SELECT id, html FROM Pages WHERE html is NOT NULL and email is NULL ORDER BY RANDOM()').fetchone()
#print(x)#for testing purposes
for row in x:
row = cur.fetchone()
id = row[0]
html = row[1]
email = re.findall(b'[a-z0-9\.\-+_]+#[a-z0-9\.\-+_]+\.[a-z]+', html)
#print(email)#testing purposes
if not email:
email = 'no email found'
print(id, email)
cur.execute('''UPDATE pages SET email = ? WHERE id = ? ''', (email, id))
conn.commit
I want the update statement to update the database with the extracted email addresses for the appropriate row.
There are a few things going on here.
First off, you don't want to do this:
for row in x:
row = cur.fetchone()
If you want to iterate over the results returned by the query, you should consider something like this:
for row in cur.fetchall():
id = row[0]
html = row[1]
# ...
To understand the rest of the errors you are seeing, let's take a look at them step by step.
TypeError: "'NoneType' object is not subscriptable":
This is likely generated here:
row = cur.fetchone()
id = row[0]
Cursor.fetchone returns None if the executed query doesn't match any rows or if there are no rows left in the result set. The next line, then, is trying to do None[0] which would raise the error in question.
sqlite3.InterfaceError: Error binding parameter 0 - probably unsupported type:
re.findall returns a list of non-overlapping matches, not an individual match. There's no support for binding a Python list to a sqlite3 text column type. To fix this, you'll need to get the first element from the matched list (if it exists) and then pass that as your email parameter in the UPDATE.
.findall() returns a list.
You want to iterate over that list:
for email in re.findall(..., str(html)):
print(id, email)
cur.execute(...)
Not sure what's going on with that b'[a-z...' expression.
Recommend you use a raw string instead: r'[a-z...'.
It handles regex \ backwhacks nicely.

Extracting text embedded within <td> under some class

Table from data to be extractedExtract text within under specific class and store in respective lists
I am trying to extract data from "https://www.airlinequality.com/airline-reviews/vietjetair/page/1/" . I am able to extract the summary, review and user info, but unable to get the tabular data. Tabular data needs to be stored in respective lists. Different user reviews have different number of ratings. Given in the code below are couple of things which I tried. All are giving empty lists.
Extracted review using xpath
(review = driver.find_elements_by_xpath('//div[#class="tc_mobile"]//div[#class="text_content "]') )
following are some xpaths which are giving empty list. Here I m=am trying to extract data/text corresponding to "Type Of Traveller "
tot = driver.find_elements_by_xpath('//div[#class="tc_mobile active"]//div[#class="review-stats"]//table[#class="review-ratings"]//tbody//tr//td[#class="review-rating-header type_of_traveller "]//td[#class="review-value "]')
tot1 = driver.find_elements_by_xpath('//div[#class="tc_mobile"]//div[#class="review-stats"]//table//tbody//tr//td[#class="review-rating-header type_of_traveller "]//td[#class="review-value "]')
tot2 = driver.find_elements_by_xpath('//div//div/table//tbody//tr//td[#class="review-rating-header type_of_traveller "]//td[#class = "review-value "]')
This code should do what you want. All the code is doing at a basic level is following the DOM structure and then iterating over each element at that layer.
It extracts the values into a dictionary for each review and then appends that to a results list:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.airlinequality.com/airline-reviews/vietjetair/page/1/")
review_tables = driver.find_elements_by_xpath('//div[#class="tc_mobile"]//table[#class="review-ratings"]//tbody') # Gets all the review tables
results = list() # A list of all rating results
for review_table in review_tables:
review_rows = review_table.find_elements_by_xpath('./tr') # Gets each row from the table
rating = dict() # Holds the rating result
for row in review_rows:
review_elements = row.find_elements_by_xpath('./td') # Gets each element from the row
if review_elements[1].text == '12345': # Logic to extract star rating as int
rating[review_elements[0].text] = len(review_elements[1].find_elements_by_xpath('./span[#class="star fill"]'))
else:
rating[review_elements[0].text] = review_elements[1].text
results.append(rating) # Add rating to results list
Sample entry of review data in results list:
{
"Date Flown": "January 2019",
"Value For Money": 2,
"Cabin Staff Service": 3,
"Route": "Ho Chi Minh City to Bangkok",
"Type Of Traveller": "Business",
"Recommended": "no",
"Seat Comfort": 3,
"Cabin Flown": "Economy Class",
"Ground Service": 1
}

Can't update value in dictionary - dict. changed size

I'm looping through a CSV, and would like to change "Gotham" to "Home". I've tried a couple ways, after searching around online, but can't seem to get it to work.
import csv
csv_file = "test.csv"
def process_csv(file):
headers=[]
data = []
csv_data = csv.reader(open(file))
for i, row in enumerate(csv_data):
if i == 0:
headers = row
continue;
field = []
for i in range(len(headers)):
field.append((headers[i],row[i]))
data.append(field)
return data
def create_merge_fast(city, country, contact):
lcl = locals()
## None of these do what I'd think - if city is "Gotham" change it to "Home"
for key, value in lcl.items():
if value == "Gotham":
lcl[value] = "Home"
print(key, value)
for value in lcl.values():
if value == "Gotham":
lcl[value] = "Home"
print(value)
def set_fields_create_doc(data):
city = data[4][1]
country = data[6][1]
contact = data[9][1]
create_merge_fast(city, country, contact)
data = process_csv(csv_file)
for i in data:
set_fields_create_doc(i)
I always seem to get
RuntimeError: dictionary changed size during iteration
right after
Gotham
is printed...
You cannot change your dict while iterating over it - the moment you change its state the iterator in for..in loop becomes invalid so it will pop the error from the title.
You can simply fix that by stopping the iteration once the match is found and changes were made to the dict, i.e.
for key, value in lcl.items():
if value == "Gotham":
lcl[key] = "Home"
break # exit here
print(key, value)
However, if it's possible to have multiple items that match this condition, simply breaking away won't work, but you can freeze the key list instead before you start iterating through it:
for key in list(lcl.keys()):
if lcl[key] == "Gotham":
lcl[key] = "Home"
Change
lcl[value] = "Home"
to
lcl[key] = "Home"
The former will actually create a new entry in your dictionary with the key as "Gotham" and value as "Home", rather than modifying the existing value from "Gotham" to "Home".
So, you want:
for key in lcl:
if lcl[key] == "Gotham":
lcl[key] = "Home"
print(key, lcl[key])
Also, you don't need the second loop.

Resources