Save text in JSON format from Python Selenium - python-3.x

I am trying to scrape data from a webpage and save the scraped text in JSON format.
I have reached until the step where i can gather text which i want but then i cant save it in expected format. Csv or txt format is also sufficient if possible
Please help me how to save scraped text in JSON. Here is my code which i have extracted
for k in range(0, len(op3)):
selectweek.select_by_index(k)
table = driver.find_element_by_xpath("//table[#class='list-table']")
for row in table.find_elements_by_xpath('//*[#id="dvFixtureInner"]/table/tbody/tr[2]/td[6]/a'):
row.click()
mainpage = driver.window_handles[0]
print(mainpage)
popup = driver.window_handles[1]
driver.switch_to.window(popup)
time.sleep(3)
#Meta details of match
team1 = driver.find_element_by_xpath('//*[#id="match-details"]/div/div[1]/div/div[2]/div[1]/div[1]/a') #Data to save
team2 = driver.find_element_by_xpath('//*[#id="match-details"]/div/div[1]/div/div[2]/div[3]/div[1]/a') #Data to save
ht = driver.find_element_by_xpath('//*[#id="dvHTScoreText"]') #Data to save
ft = driver.find_element_by_xpath('//*[#id="dvScoreText"]') #Data to save

Create dictionary and convert it into JSON format using json module.
import json
dictionary = {"team1" : team1, "team2": team2, "ht": ht, "ft": ft}
json_dump = json.dumps(dictionary)
with open("YourFilePath", "w") as f:
f.write(json_dump)

You can create a dictionary and add key-value to it. I don't know the structure of the json but this can give an idea:
json_data = dict()
ht = 1
ft = 2
json_data["team1"] = {"ht": ht, "ft": ft}
print(json_data)
>>> {'team1': {'ht': 1, 'ft': 2}}

Related

How to extract the specific numbers or text using regex in python?

I have written the code to extract the numbers and the company name from the extracted pdf file.
sample pdf content:
#88876 - Sample1, GTRHEUSKYTH, -99WED,-0098B
#99945 - SAMPLE2, DJWHVDFWHEF, -8876D,-3445G
The above example is what my pdf file contains. I wanted to extract the App number which is after # (i.e) five numbers(88876) and App name which is after the (-) (i.e) Sample1. An write that to an excel file as separate columns which is App_number and App_name.
Please refer the below code which I have tried.
import PyPDF2, re
import csv
for k in range(1,100):
pdfObj = open(r"C:\\Users\merge.pdf",'rb')
object = PyPDF2.PdfFileReader("C:\\Users\merge.pdf")
pdfReader = PyPDF2.PdfFileReader(pdfObj)
NumPages = object.getNumPages()
pdfReader.numPages
for i in range(0, NumPages):
pdfPageObj = pdfReader.getPage(i)
text = pdfPageObj.extractText()
x=re.findall('(?<=#).[0-9]+', text)
y=re.findall("(?<=\- )(.*?)(?=,)", text)
print(x)
print(y)
with open("out.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(x)
Please pour some suggestions.
Try this:
text = '#88876 - Sample1, GTRHEUSKYTH'
App_number = re.search('(?<=#).[0-9]+', text).group()
App_name = re.search("(?<=\- )(.*?)(?=,)", text).group()
In the first regex you get the first consecutive digits after #, in the second one you get everything between - and ,
Hope it helped

Python to search for a specific table in word document

I am new to python.. and have done a small hands-on on the python-docx module.
I am having a requirement in which I have to read a word document which contains multiple tables and text.
Out of this document I have to select specific table to read and that selection depends on the text written in the line just above the table, and then I have to process the data of that table.
I am able to read the table data by referring the table with its index, but in this case the table index is unknown and it can be at any position in the document. The only thing by which I can identify the table is the text written in the line just above the table.
Can you please help me achieving this?
I have a solution made using BeautifulSoup and not python-docx. What I have done here is traversed through OOXML of word(.docx) document.
from bs4 import BeautifulSoup
import zipfile
wordoc = input('Enter your file name here or name with path: ')
text1 = 'Enter your text written above the table'
text1 = ''.join(text1.split())
document = zipfile.ZipFile(wordoc)
xml_content = document.read('word/document.xml')
document.close()
soup = BeautifulSoup(xml_content, 'xml')
for document in soup.children:
for body in document.children:
for tag in body.children:
if tag.name == 'p' and (''.join(tag.text.split())) == text1:
table = tag.find_next_sibling('w:tbl')
table_contents = []
for wtc in table.findChildren('w:tc'):
cell_text = ''
for wr in wtc.findChildren('w:r'):
# We want to exclude striked-out text
if not wr.findChildren('w:strike'):
cell_text += wr.text
table_contents.append(cell_text)
print(table_contents)

Writing data to csv file from table scrape

I am having trouble figuring out how to write this file to csv. I am parsing data from a table, and can print it just fine, but when I try to write to a csv file i get the error "TypeError: write() argument must be str, not list". I'm not sure how to make my data points into a string.
Code:
from bs4 import BeautifulSoup
import urllib.request
import csv
html = urllib.request.urlopen("https://markets.wsj.com/").read().decode('utf8')
soup = BeautifulSoup(html, 'html.parser') # parse your html
filename = "products.csv"
f = open(filename, "w")
t = soup.find('table', {'summary': 'Major Stock Indexes'}) # finds tag table with attribute summary equals to 'Major Stock Indexes'
tr = t.find_all('tr') # get all table rows from selected table
row_lis = [i.find_all('td') if i.find_all('td') else i.find_all('th') for i in tr if i.text.strip()] # construct list of data
f.write([','.join(x.text.strip() for x in i) for i in row_lis])
Any suggestions?
w.write() takes only a string as an argument, but your passing it a list of lists of strings.
csv.writerows() will write lists to a csv file.
Change your file handle f to be :
f = csv.writer(open(filename,'wb'))
and use it by replacing the last line with:
f.writerows([[x.text.strip() for x in i] for i in row_lis])
will produce a csv with contents:

Can't store the scraped results in third and fourth column in a csv file

I've written a script which is scraping Address and Phone number of certain shops based on Name and Lid. The way it is searching is that It takes Name and Lid stored in column A and Column B respectively from a csv file. However, after fetching the result based on the search, I expected the parser to put that results in column C and column D respectively as it is shown in the second Image. At this point, I got stuck. I don't know how to manipulate Third and Fourth column using reading or writing method so that the data should be placed there. I'm trying with this now:
import csv
import requests
from lxml import html
Names, Lids = [], []
with open("mytu.csv", "r") as f:
reader = csv.DictReader(f)
for line in reader:
Names.append(line["Name"])
Lids.append(line["Lid"])
with open("mytu.csv", "r") as f:
reader = csv.DictReader(f)
for entry in reader:
Page = "https://www.yellowpages.com/los-angeles-ca/mip/{}-{}".format(entry["Name"].replace(" ","-"), entry["Lid"])
response = requests.get(Page)
tree = html.fromstring(response.text)
titles = tree.xpath('//article[contains(#class,"business-card")]')
for title in titles:
Address= title.xpath('.//p[#class="address"]/span/text()')[0]
Contact = title.xpath('.//p[#class="phone"]/text()')[0]
print(Address,Contact)
How my csv file looks like now:
My desired output is something like:
You can do it like this. Create a fresh output csv file whose header is based on the input csv, with the addition of the two columns. When you read a csv row it's available as a dictionary, in this case called entry. You can add the new values to this dictionary from the stuff you've gleaned on the 'net. Then write each newly created row out to file.
import csv
import requests
from lxml import html
with open("mytu.csv", "r") as f, open('new_mytu.csv', 'w', newline='') as g:
reader = csv.DictReader(f)
newfieldnames = reader.fieldnames + ['Address', 'Phone']
writer = csv.writer = csv.DictWriter(g, fieldnames=newfieldnames)
writer.writeheader()
for entry in reader:
Page = "https://www.yellowpages.com/los-angeles-ca/mip/{}-{}".format(entry["Name"].replace(" ","-"), entry["Lid"])
response = requests.get(Page)
tree = html.fromstring(response.text)
titles = tree.xpath('//article[contains(#class,"business-card")]')
#~ for title in titles:
title = titles[0]
Address= title.xpath('.//p[#class="address"]/span/text()')[0]
Contact = title.xpath('.//p[#class="phone"]/text()')[0]
print(Address,Contact)
new_row = entry
new_row['Address'] = Address
new_row['Phone'] = Contact
writer.writerow(new_row)

Trouble Writing to a new excel file

I'm very new to python and got an assignment asking me to:
Design your own code in do something here part to save the title, id, share count
and comment count of each news media in separated columns of a Excel (.xls) file.
Design your own code to read the share count and comment count from the Excel
file created in step 3, and calculate the average share count and comment count of
those news media websites.
Here is my current code:
from urllib import request
import json
from pprint import pprint
import xlwt
'''
import xlrd
from xlutils import copy
'''
website_list = [
'http://www.huffingtonpost.com/',
'http://www.cnn.com/',
'https://www.nytimes.com/',
'http://www.foxnews.com/',
'http://www.nbcnews.com/'
] # place your list of website urls, e.g., http://jmu.edu
for website in website_list:
url_str = 'https://graph.facebook.com/'+website # create the url for facebook graph api
response = request.urlopen(url_str) # read the reponse into computer
html_str = response.read().decode("utf-8") # convert the reponse into string
json_data = json.loads(html_str) # convert the string into json
pprint (json_data)
book = xlwt.Workbook()
sheet_test = book.add_sheet('keys')
sheet_test.write(0,0,'Title')
sheet_test.write(0,1,'ID')
sheet_test.write(0,2,'Share Count')
sheet_test.write(0,3,'Comment Count')
for i in range(0,5):
for website in website_list[i]:
sheet_test.write(i,0,json_data['og_object']['title'])
sheet_test.write(i,1,json_data['id'])
sheet_test.write(i,2,json_data['share']['share_count'])
sheet_test.write(i,3,json_data['share']['comment_count'])
book.save('C:\\Users\\stinesr\\Downloads\\Excel\\keys.xls')
'''
reading_book = xlrd.open_workbook('C:\\Users\\stinesr\\Downloads\\Excel\\key.xls')
sheet_read = reading_book.sheet_by_name('keys')
num_record = sheet_read.nrows
writing_book = copy(reading_book)
sheet_write = writing_book.get_sheet(0)
print(sheet_write.name)
for i in range(num_record):
row = sheet_read.row_values(i)
if i == 0:
sheet_write.write(0,4,'Share Count Average')
sheet_write.write(0,5,'Comment Count Average')
else:
sheet_write.write(i,4,row[2])
sheet_write.write(i,5,row[3])
writing_book.save('C:\\Users\\stinesr\\Downloads\\Excel\\keys.xls')
'''
Any and all help is appreciated, thank you.
The Traceback error says in the nested for-loops on lines 40-45 you are attempting to overwrite the row 0 from the previous lines. You need to start from row 1, since row 0 already contains the header.
But before that, json_data is only keeping the last response, you'll want to create a list of "responses" and append each response to that list.
You need only one for-loop at line 40:
In summary:
website_list = [
'http://www.huffingtonpost.com/',
'http://www.cnn.com/',
'https://www.nytimes.com/',
'http://www.foxnews.com/',
'http://www.nbcnews.com/'
] # place your list of website urls, e.g., http://jmu.edu
json_list = []
for website in website_list:
url_str = 'https://graph.facebook.com/' + website # create the url for facebook graph api
response = request.urlopen(url_str) # read the reponse into computer
html_str = response.read().decode("utf-8") # convert the reponse into string
json_data = json.loads(html_str) # convert the string into json
json_list.append(json_data)
pprint (json_list)
book = xlwt.Workbook()
sheet_test = book.add_sheet('keys')
sheet_test.write(0,0,'Title')
sheet_test.write(0,1,'ID')
sheet_test.write(0,2,'Share Count')
sheet_test.write(0,3,'Comment Count')
for i in range(len(json_list)):
sheet_test.write(i+1, 0, json_list[i]['og_object']['title'])
sheet_test.write(i+1, 1, json_list[i]['id'])
sheet_test.write(i+1, 2, json_list[i]['share']['share_count'])
sheet_test.write(i+1, 3, json_list[i]['share']['comment_count'])
book.save('C:\\Users\\stinesr\\Downloads\\Excel\\keys.xls')
Should give you an Excel document that resembles:

Resources