Extracting text embedded within <td> under some class - python-3.x

Table from data to be extractedExtract text within under specific class and store in respective lists
I am trying to extract data from "https://www.airlinequality.com/airline-reviews/vietjetair/page/1/" . I am able to extract the summary, review and user info, but unable to get the tabular data. Tabular data needs to be stored in respective lists. Different user reviews have different number of ratings. Given in the code below are couple of things which I tried. All are giving empty lists.
Extracted review using xpath
(review = driver.find_elements_by_xpath('//div[#class="tc_mobile"]//div[#class="text_content "]') )
following are some xpaths which are giving empty list. Here I m=am trying to extract data/text corresponding to "Type Of Traveller "
tot = driver.find_elements_by_xpath('//div[#class="tc_mobile active"]//div[#class="review-stats"]//table[#class="review-ratings"]//tbody//tr//td[#class="review-rating-header type_of_traveller "]//td[#class="review-value "]')
tot1 = driver.find_elements_by_xpath('//div[#class="tc_mobile"]//div[#class="review-stats"]//table//tbody//tr//td[#class="review-rating-header type_of_traveller "]//td[#class="review-value "]')
tot2 = driver.find_elements_by_xpath('//div//div/table//tbody//tr//td[#class="review-rating-header type_of_traveller "]//td[#class = "review-value "]')

This code should do what you want. All the code is doing at a basic level is following the DOM structure and then iterating over each element at that layer.
It extracts the values into a dictionary for each review and then appends that to a results list:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.airlinequality.com/airline-reviews/vietjetair/page/1/")
review_tables = driver.find_elements_by_xpath('//div[#class="tc_mobile"]//table[#class="review-ratings"]//tbody') # Gets all the review tables
results = list() # A list of all rating results
for review_table in review_tables:
review_rows = review_table.find_elements_by_xpath('./tr') # Gets each row from the table
rating = dict() # Holds the rating result
for row in review_rows:
review_elements = row.find_elements_by_xpath('./td') # Gets each element from the row
if review_elements[1].text == '12345': # Logic to extract star rating as int
rating[review_elements[0].text] = len(review_elements[1].find_elements_by_xpath('./span[#class="star fill"]'))
else:
rating[review_elements[0].text] = review_elements[1].text
results.append(rating) # Add rating to results list
Sample entry of review data in results list:
{
"Date Flown": "January 2019",
"Value For Money": 2,
"Cabin Staff Service": 3,
"Route": "Ho Chi Minh City to Bangkok",
"Type Of Traveller": "Business",
"Recommended": "no",
"Seat Comfort": 3,
"Cabin Flown": "Economy Class",
"Ground Service": 1
}

Related

How to create a column for every value in a list?

I am using pandas dataframe to separate a column labeled genres by each genre name, as you can see below. The code to create a list with all the genre names runs fine, but I am also trying to generate a column for each individual genre and am having trouble getting it to print the way I want. Right now, it's only printing one column called x.
genre_list = []
for genre in goodreads_cl['genre']:
if genre not in genre_list:
genre_list.append(genre.split('|'))
print(genre_list)
x = genre
for x in genre_list:
goodreads_cl['x'] = ''
goodreads_cl.head()
You can provide columns keyword argument to DataFrame constructor:
pd.DataFrame(columns=['a', 'b', 'c'])
In your example,
goodreads_cl = pd.DataFrame(columns=genre_list)

How to show separate values from appended list

I'm trying to display values from appended list of items that are scraped with bs4. Currently, my code only returns a whole set of data, but I'd like to have separate values displayed from the data. Now, all I get is:
NameError: value is not defined.
How to do it?
data = []
for e in soup.select('div:has(> div > a h3)'):
data.append({
'title':e.h3.text,
'url':e.a.get('href'),
'desc':e.next_sibling.text,
'email': re.search(r'[\w.+-]+#[\w-]+\.[\w.-]+', e.parent.text).group(0) if
re.search(r'[\w.+-]+#[\w-]+\.[\w.-]+', e.parent.text) else None
})
data
title = print(title) # name error
desc = print(desc) # name error
email = print(email) # name error
Main issue is that you try to reference only on keys without taking into account that there is data a list of dicts.
So you have to pick your dict by index from data if you like to print a specific one:
print(data[0]['title'])
print(data[0]['desc'])
print(data[0]['email'])
Alternative just iterate over data and print/operate on the values of each dict:
for d in data:
print(d['title'])
print(d['desc'])
print(d['email'])
or
for d in data:
title = d['title']
desc = d['desc']
email = d['email']
print(f'print title only: {title}')
You can do that like this way:
for e in soup.select('div:has(> div > a h3)'):
title=e.h3.text,
url=e.a.get('href'),
desc=e.next_sibling.text,
email= re.search(r'[\w.+-]+#[\w-]+\.[\w.-]+', e.parent.text).group(0) if re.search(r'[\w.+-]+#[\w-]+\.[\w.-]+', e.parent.text) else None
print(title)
print(desc)
print(email)
print(url)

I wrote code using pandas with python. I want to convert the code into a new dataframe with the output seperated into two columns

I went from one data frame to another and performed calcs on the column next to name for each unique person. Now I have a output of the names and calcs next to it and I want to break it into two columns and put it in a data frame and print. I'm thinking I should put the entire for loop into a dictionary then a data frame, but not to sure of how to do that. I am a beginner at this and would really appreciate peoples help. See code from the for loop piece below:
names = df['Participant Name, Number'].unique()
for name in names:
unique_name_df = df[df['Participant Name, Number'] == name]
badge_types = unique_name_df['Dosimeter Location'].unique()
if 'Collar' in badge_types:
collar = unique_name_df[unique_name_df['Dosimeter Location'] == 'Collar']['Total DDE'].astype(float).sum()
if 'Chest' in badge_types:
chest = unique_name_df[unique_name_df['Dosimeter Location'] == 'Chest']['Total DDE'].astype(float).sum()
if len(badge_types) == 1:
if 'Collar' in badge_types:
value = collar
elif 'Chest' in badge_types:
value = chest
print(name, value)
If you expect len(badge_types)==1 in all the cases, try:
pd.DataFrame( df.groupby(['Participant Name, Number']).Total_DDE.sum() )
Otherwise, to get the sum per Dosimeter Location, add it on the groupby as
pd.DataFrame( df.groupby(['Participant Name, Number', 'Dosimeter Location']).Total_DDE.sum() )

Spotfire: How to increment variables to build scoring mechanism?

I'm trying to figure out how I could use variables in Spotfire (online version) to build a scoring mechanism and populate a calculated column with the final result.
I have a couple of values stored in columns that I would use to evaluate and attribute a score like this:
if column1<10 then segment1 = segment1 + 1
if column1>10 then segment2 = segment2+1
...ETC...
In the end each "segment" should have a score and I would like to simply display the name of the segment that has the highest score.
Ex:
Segment1 has a final value of 10
Segment2 has a final value of 22
Segment3 has a final value of 122
I would display Segment3 as value for the calculated column
Using only "IF" would lead me to a complicated IF structure so I'm more looking for something that looks more like a script.
Is there a way to achieve this with Spotfire?
Thanks
Laurent
To cycle through the data rows and calculate a running score, you can use an IronPython script. The script below is reading the numeric data from Col1 and Col2 of a data table named "Data Table". It calculates a score value for each row and writes it to a tab delimited text string. When done, it adds it to the Spotfire table using the Add Columns function. Note, the existing data needs to have a unique identifier. If not, the RowId() function can be used to create a calculated column for a unique row id.
from Spotfire.Dxp.Data import *
from System.IO import StringReader, StreamReader, StreamWriter, MemoryStream, SeekOrigin
from Spotfire.Dxp.Data.Import import *
from System import Array
def add_column(table, text, col_name):
# read the text data into memory
mem_stream = MemoryStream()
writer = StreamWriter(mem_stream)
writer.Write(text)
writer.Flush()
mem_stream.Seek(0, SeekOrigin.Begin)
# define the structure of the text data
settings = TextDataReaderSettings()
settings.Separator = "\t"
settings.SetDataType(0, DataType.Integer)
settings.SetColumnName(0, 'ID')
settings.SetDataType(1, DataType.Real)
settings.SetColumnName(1, col_name)
# create a data source from the in memory text data
data = TextFileDataSource(mem_stream, settings)
# define the relationship between the existing table (left) and the new data (right)
leftColumnSignature = DataColumnSignature("Store ID", DataType.Integer)
rightColumnSignature = DataColumnSignature("ID", DataType.Integer)
columnMap = {leftColumnSignature:rightColumnSignature}
ignoredColumns = []
columnSettings = AddColumnsSettings(columnMap, JoinType.LeftOuterJoin, ignoredColumns)
# now add the column(s)
table.AddColumns(data, columnSettings)
#get the data table
table=Document.Data.Tables["Data Table"]
#place data cursor on a specific column
cursorCol1 = DataValueCursor.CreateFormatted(table.Columns["Col1"])
cursorCol2 = DataValueCursor.CreateFormatted(table.Columns["Col2"])
cursorColId = DataValueCursor.CreateFormatted(table.Columns["ID"])
cursorsList = Array[DataValueCursor]([cursorCol1, cursorCol2, cursorColId])
text = ""
rowsToInclude = IndexSet(table.RowCount,True)
#iterate through table column rows to retrieve the values
for row in table.GetRows(rowsToInclude, cursorsList):
score = 0
# get the current values from the cursors
col1Val = cursorCol1.CurrentDataValue.ValidValue
col2Val = cursorCol2.CurrentDataValue.ValidValue
id = cursorColId.CurrentDataValue.ValidValue
# now apply rules for scoring
if col1Val <= 3:
score -= 3
elif col1Val > 3 and col2Val > 50:
score += 10
else:
score += 5
text += "%d\t%f\r\n" % (id, score)
add_column(table, text, 'Score_Result')
For an approach with no scripting, but also no accumulation, you can use calculated columns.
To get the scores, you can use a calculated column with case statements. For Segment 1, you might have:
case
when [Col1] > 100 then 10
when [Col1] < 100 and [Col2] > 600 then 20
end
The, once you have the scores, you can create a calculated column, say [MaxSegment]. The expression for this will be Max([Segment1],[Segment2],[Segment3]...). Then display the value of [MaxSegment].
The max function in this case is acting as a row expression and is calculating the max value across the row of the columns given.

Creating a dictionary of dictionaries from csv file

Hi so I am trying to write a function, classify(csv_file) that creates a default dictionary of dictionaries from a csv file. The first "column" (first item in each row) is the key for each entry in the dictionary and then second "column" (second item in each row) will contain the values.
However, I want to alter the values by calling on two functions (in this order):
trigram_c(string): that creates a default dictionary of trigram counts within the string (which are the values)
normal(tri_counts): that takes the output of trigram_c and normalises the counts (i.e converts the counts for each trigram into a number).
Thus, my final output will be a dictionary of dictionaries:
{value: {trigram1 : normalised_count, trigram2: normalised_count}, value2: {trigram1: normalised_count...}...} and so on
My current code looks like this:
def classify(csv_file):
l_rows = list(csv.reader(open(csv_file)))
classified = dict((l_rows[0], l_rows[1]) for rows in l_rows)
For example, if the csv file was:
Snippet1, "It was a dark stormy day"
Snippet2, "Hello world!"
Snippet3, "How are you?"
The final output would resemble:
{Snippet1: {'It ': 0.5352, 't w': 0.43232}, Snippet2: {'Hel' : 0.438724,...}...} and so on.
(Of course there would be more than just two trigram counts, and the numbers are just random for the purpose of the example).
Any help would be much appreciated!
First of all, please check classify function, because I can't run it. Here corrected version:
import csv
def classify(csv_file):
l_rows = list(csv.reader(open(csv_file)))
classified = dict((row[0], row[1]) for row in l_rows)
return classified
It returns dictionary with key from first column and value is string from second column.
So you should iterate every dictionary entry and pass its value to trigram_c function. I didn't understand how you calculated trigram counts, but for example if you just count the number of trigram appearence in string you could use the function below. If you want make other counting you just need to update code in the for loop.
def trigram_c(string):
trigram_dict = {}
start = 0
end = 3
for i in range(len(string)-2):
# you could implement your logic in this loop
trigram = string[start:end]
if trigram in trigram_dict.keys():
trigram_dict[trigram] += 1
else:
trigram_dict[trigram] = 1
start += 1
end += 1
return trigram_dict

Resources