Colspan not working properly using Python Pandas - python-3.x

I have some data that i need to convert into an Excel sheet which needs to look like this at the end of the day:
I've tried the following code:
import pandas as pd
result = pd.read_html(
"""<table>
<tr>
<th colspan="2">Status N</th>
</tr>
<tr>
<td style="font-weight: bold;">Merchant</td>
<td>Count</td>
</tr>
<tr>
<td>John Doe</td>
<td>10</td>
</tr>
</table>"""
)
writer = pd.ExcelWriter('out/test_pd.xlsx', engine='xlsxwriter')
print(result[0])
result[0].to_excel(writer, sheet_name='Sheet1', index=False)
writer.save()
This issue here is that the colspan is not working properly. The output is like this instead:
Can someone help me on how i can use colspan on Python Pandas?
It would be better if i don't have to use read_html() and do it directly on python code but if it's not possible, i can use read_html()

Since Pandas can't recognize the values and columns title you should introduce them, if you convert HTML text to the standard format, then pandas can handle it correctly. use thead and tbody to split header and values like this.
result = pd.read_html("""
<table>
<thead>
<tr>
<th colspan="2">Status N</th>
</tr>
<tr>
<td style="font-weight: bold;">Merchant</td>
<td>Count</td>
</tr>
</thead>
<tbody>
<tr>
<td>John Doe</td>
<td>10</td>
</tr>
</tbody>
</table>
"""
)
To write Dataframe to an excel file you can use the pandas to_excel method.
result[0].to_excel("out.xlsx")

Related

How do I use Python csv to write multiple beautifulsoup table rows for only two specific columns?

I am wanting to use beautifulsoup to scrape HTML to pull out only two columns from every row in one table. However, each "tr" row has 10 "td" cells, and I only want the [1] and [8] "td" cell from each row. What is the most pythonic way to do this?
From my input below I've got one table, one body, three rows, and 10 cells per row.
Input
<table id ="tblMain">
<tbody>
<tr>
<td "text"</td>
<td "data1"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "data2"</td>
<td "text"</td>
<tr>
<td "text"</td>
<td "data1"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "data2"</td>
<td "text"</td>
<tr>
<td "text"</td>
<td "data1"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "text"</td>
<td "data2"</td>
<td "text"</td>
Things I Have Tried
I understand how to use the index of the cells in order to loop through and get "td" at [1] and [8]. However, I'm getting all confused when trying to get that data on one line written back to the csv.
table = soup.find('table', {'id':'tblMain'} )
table_body = table.find('tbody')
rows = table_body.findAll('tr')
data1_columns = []
data2_columns = []
for row in rows[1:]:
data1 = row.findAll('td')[1]
data1_columns.append(data1.text)
data2 = row.findAll('td')[8]
data2_columns.append(data2.text)
This is my current code which finds the table, rows, and all "td" cells and prints them correctly to a .csv. However, instead of writing all ten "td" cells per row back to the csv line, I just want to grab "td"[1] and "td"[8].
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table', {'id':'tblMain'} )
table_body = table.find('tbody')
rows = table_body.findAll('tr')
filename = '%s.csv' % reportname
with open(filename, "wt+", newline="") as f:
writer = csv.writer(f)
for row in rows:
csv_row = []
for cell in row.findAll("td"):
csv_row.append(cell.get_text())
writer.writerow(csv_row)
Expected Results
I want to be able to write "td"[1] and "td"[8] back to my csv_row in order to write each list back to a the csv writer.writerow.
Writing row back to csv_row which then writes to my csv file:
['data1', 'data2']
['data1', 'data2']
['data1', 'data2']
You almost did it
for row in rows:
row = row.findAll("td")
csv_row = [row[1].get_text(), row[8].get_text()]
writer.writerow(csv_row)
Full code
html ='''<table id ="tblMain">
<tbody>
<tr>
<td>text</td>
<td>data1</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>data2</td>
<td>text</td>
<tr>
<td>text</td>
<td>data1</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>data2</td>
<td>text</td>
<tr>
<td>text</td>
<td>data1</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>text</td>
<td>data2</td>
<td>text</td>
'''
from bs4 import BeautifulSoup
import csv
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table', {'id':'tblMain'} )
table_body = table.find('tbody')
rows = table_body.findAll('tr')
reportname = 'output'
filename = '%s.csv' % reportname
with open(filename, "wt+", newline="") as f:
writer = csv.writer(f)
for row in rows:
row = row.findAll("td")
csv_row = [row[1].get_text(), row[8].get_text()]
writer.writerow(csv_row)
You should be able to use nth-of-type pseudo class css selector
from bs4 import BeautifulSoup as bs
import pandas as pd
html = 'actualHTML'
soup = bs(html, 'lxml')
results = []
for row in soup.select('#tblMain tr'):
out_row = [item.text.strip() for item in row.select('td:nth-of-type(2), td:nth-of-type(9)')]
results.append(out_row)
df = pd.DataFrame(results)
print(df)
df.to_csv(r'C:\Users\User\Desktop\data.csv', sep=',', encoding='utf-8-sig',index = False )
Whenever I need to pull a table and it has the <table> tag, I let Pandas do the work for me, then just maniuplate the dataframe it returns if needed. That's what I would do here:
html = '''<table id ="tblMain">
<tbody>
<tr>
<td> text</td>
<td> data1</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> data2</td>
<td> text</td>
<tr>
<td> text</td>
<td> data1</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> data2</td>
<td> text</td>
<tr>
<td> text</td>
<td> data1</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> text</td>
<td> data2</td>
<td> text</td>'''
import pandas as pd
# .read_html() returns a list of dataframes
tables = pd.read_html(html)[0]
# we want the dataframe from that list in position [0]
df = tables[0]
# Use .iloc to say I want all the rows, and columns 1, 8
df = df.iloc[:,[1,8]]
# Write the dataframe to file
df.to_csv('path.filename.csv', index=False)

How to get all td[3] tags from the tr tags with selenium Xpath in python

I have a webpage HTML like this:
<table class="table_type1" id="sailing">
<tbody>
<tr>
<td class="multi_row"></td>
<td class="multi_row"></td>
<td class="multi_row">1</td>
<td class="multi_row"></td>
</tr>
<tr>
<td class="multi_row"></td>
<td class="multi_row"></td>
<td class="multi_row">1</td>
<td class="multi_row"></td>
</tr>
</tbody>
</table>
and tr tags are dynamic so i don't know how many of them exist, i need all td[3] of any tr tags in a list for some slicing stuff.it is much better iterate with built in tools if find_element(s)_by_xpath("") has iterating tools.
Try
cells = driver.find_elements_by_xpath("//table[#id='sailing']//tr/td[3]")
to get third cell of each row
Edit
For iterating just use a for loop:
print ([i.text for i in cells])
Try following code :
tdElements = driver.find_elements_by_xpath("//table[#id="sailing "]/tbody//td")
Edit : for 3rd element
tdElements = driver.find_elements_by_xpath("//table[#id="sailing "]/tbody/tr/td[3]")
To print the text e.g. 1 from each of the third <td> you can either use the get_attribute() method or text property and you can use either of the following solutions:
Using CssSelector and get_attribute():
print(driver.find_elements_by_css_selector("table.table_type1#sailing tr td:nth-child(3)").get_attribute("innerHTML"))
Using CssSelector and text property:
print(driver.find_elements_by_css_selector("table.table_type1#sailing tr td:nth-child(3)").text)
Using XPath and get_attribute():
print(driver.find_elements_by_xpath('//table[#class='table_type1' and #id="sailing"]//tr//following::td[3]').get_attribute("innerHTML"))
Using XPath and text property:
print(driver.find_elements_by_xpath('//table[#class='table_type1' and #id="sailing"]//tr//following::td[3]').text)
To get the 3 rd td of each row, you can try either with xpath
driver.find_elements_by_xpath('//table[#id="sailing"]/tbody//td[3]')
or you can try with css selector like
driver.find_elements_by_css_selector('table#sailing td:nth-child(3)')
As it is returning list you can iterate with for each,
elements=driver.find_elements_by_xpath('//table[#id="sailing"]/tbody//td[3]')
for element in elements:
print(element.text)

select specific rows with class names

I am parsing an HTML which has bunch of rows that I want to select. Here are example of those rows
<tr class="constantstring-randomvalue1-row" onmouseover="this.className='constantstring-light-row-cp-h'" onmouseout="this.className='constantstring-randomvalue1-row'" onclick="if(ignoreOnClick==false)window.location='find.ashx?cv3dsw'" valign="top">
<tr class="constantstring-randomvalue1-row" onmouseover="this.className='constantstring-light-row-cp-h'" onmouseout="this.className='constantstring-randomvalue1-row'" onclick="if(ignoreOnClick==false)window.location='find.ashx?cv3dsw'" valign="top">
<tr class="constantstring-randomvalue2-row-2" onmouseover="this.className='constantstring-light-row-cp-h'" onmouseout="this.className='constantstring-randomvalue2-row-2'" onclick="if(ignoreOnClick==false)window.location='find.ashx?cv3dsw'" valign="top">
<tr class="constantstring-randomvalue2-row-2" onmouseover="this.className='constantstring-light-row-cp-h'" onmouseout="this.className='constantstring-randomvalue2-row-2'" onclick="if(ignoreOnClick==false)window.location='find.ashx?cv3dsw'" valign="top">
What i was trying to do is use BeautifulSoup4 and find_all using a regex find_all(re.compile(regext))
However, the problem is that i am unable to come up with a good regext which will select all rows that i am interested in.
all the rows that i want start with constantstring-. I don't care what it is followed by. What would be the proper way, should i use re.compile and if so, what will be the correct regex?
If you want to accomplish this with RE the following will do, I added an extra row to demo it not picking up the final row.
http://rextester.com/OSSFB8621
from bs4 import BeautifulSoup
import re
html ="""
<tr class="constantstring-randomvalue1-row" onmouseover="this.className='constantstring-light-row-cp-h'" onmouseout="this.className='constantstring-randomvalue1-row'" onclick="if(ignoreOnClick==false)window.location='find.ashx?cv3dsw'" valign="top">
<tr class="constantstring-randomvalue1-row" onmouseover="this.className='constantstring-light-row-cp-h'" onmouseout="this.className='constantstring-randomvalue1-row'" onclick="if(ignoreOnClick==false)window.location='find.ashx?cv3dsw'" valign="top">
<tr class="constantstring-randomvalue2-row-2" onmouseover="this.className='constantstring-light-row-cp-h'" onmouseout="this.className='constantstring-randomvalue2-row-2'" onclick="if(ignoreOnClick==false)window.location='find.ashx?cv3dsw'" valign="top">
<tr class="constantstring-randomvalue2-row-2" onmouseover="this.className='constantstring-light-row-cp-h'" onmouseout="this.className='constantstring-randomvalue2-row-2'" onclick="if(ignoreOnClick==false)window.location='find.ashx?cv3dsw'" valign="top">
<tr class="axcconstantstring-randomvalue2-row-2" onmouseover="this.className='constantstring-light-row-cp-h'" onmouseout="this.className='constantstring-randomvalue2-row-2'" onclick="if(ignoreOnClick==false)window.location='find.ashx?cv3dsw'" valign="top">
"""
bs = BeautifulSoup(html,'lxml')
for tr in bs.find_all("tr", {"class" : re.compile('^(constantstring)')}):
print(tr)
Instead of regex you can use in-built string methods for the same task. Like,
rows = soup.find_all('tr)'
selected_rows = [i for i in rows if str(i).startswith('tr class="constantstring-randomvalue')]
If you miss str() the if condition will fail.
Hope this helps! Cheers!

Create Pyspark Dataframe from Key Value Pair log file with variable contents

I have a log file in key=value pair format and would like to read the contents into an rdd, process the rdd into a data frame, and perform aggregations/analysis with spark SQL. I can read the raw data to rdd but I haven't been able to find an example of how to process key value pairs into a tabular format.
To complicate matters, the log can and does have missing key value pairs, so the format is variable. I would hope to be able to get around this by having NULL values in rows where that 'column'/key=value is missing once processed to data frame.
Below is an example of the log :
"Date"="2017-07-11T15:55:07-07:00","recordType"="ap_data","apName"="ap1","numClients"="5","version"="2.1"
"Date"="2017-07-11T15:55:07-07:00","recordType"="ap_data","apName"="ap2","numClients"="4","version"="2.1"
"Date"="2017-07-11T15:55:07-07:00","recordType"="ap_data","apName"="ap3","version"="2.1"
Notice the third event is missing the "numClients" key-value pair.
All I've managed to do so far is read the raw content to RDD:
#Initializing PySpark
from pyspark import SparkContext, SparkConf
from pyspark.context import SparkContext
from pyspark.sql.types import Row
sc = SparkContext.getOrCreate()
# Read raw contents to a new RDD and print first 2 results
raw_data = sc.textFile("log_sample.log")
raw_data.take(2)
Kindly please provide some help with reading key-value pair formatted data and processing to tabular format. Else, if this is not the right approach, I'm open to suggestion(s). Thank you!
Below is the data frame structure I hope to produce:
EDIT: Apologies, for clarity I'm not trying to produce any HTML, just wanted to show an example of tabular result, not sure why the html is showing and not just rendering the table.
<style type="text/css">
.tg {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;}
.tg .tg-yw4l{vertical-align:top}
</style>
<table class="tg">
<tr>
<th class="tg-yw4l">Date</th>
<th class="tg-yw4l">recordType</th>
<th class="tg-yw4l">apName</th>
<th class="tg-yw4l">numClients</th>
<th class="tg-yw4l">version</th>
</tr>
<tr>
<td class="tg-yw4l">2017-07-11T15:55:07-07:00</td>
<td class="tg-yw4l">ap_data</td>
<td class="tg-yw4l">ap1</td>
<td class="tg-yw4l">5</td>
<td class="tg-yw4l">2.1</td>
</tr>
<tr>
<td class="tg-yw4l">2017-07-11T15:55:07-07:00</td>
<td class="tg-yw4l">ap_data</td>
<td class="tg-yw4l">ap2</td>
<td class="tg-yw4l">4</td>
<td class="tg-yw4l">2.1</td>
</tr>
<tr>
<td class="tg-yw4l">2017-07-11T15:55:07-07:00</td>
<td class="tg-yw4l">ap_data</td>
<td class="tg-yw4l">ap3</td>
<td class="tg-yw4l"></td>
<td class="tg-yw4l">2.1</td>
</tr>
</table>

Formatting HTML Tables for Excel

I have a page in asp that makes a xls table, however when open the table all the rows are stuffed into the default column width which I would like to set.
My table looks something like this:
<table>
<thead>
'A for loop makes a series of th
</thead>
'another loop pulls db values
<tr><td>value1</td><td>value2</td> 'etc </tr>
</table>
I have tried the following to set the space
width="3.29in"
&nsp; spam (barbaric but sometimes effective)
width="400px"
style="width:300px"
none of the above seem to work.
Additionally here is my header asp incase its relevant
Response.Clear()
Response.Buffer = False
Response.ContentType = "application/vnd.ms-excel"
Response.AddHeader "Content-Disposition", "attachment; filename=blah.xls"
Also on a side note for some reason when I have a dollar value printed as such
<td>$<%=dbvalue%></td>
for some reason this yields '$dollar value and I am not sure how to nuke the single quote.
Do you need the thead tag? Something like this should work:
<html>
<body>
<h1>Report Title</h1>
<table >
<tr>
<th style="width : 300px">header1</th>
<th style="width : 100px">header2</th>
<th style="width : 200px">header..</th>
<th style="width : 300px">....</th>
</tr>
<tr class="row1">
<td >value1</td>
<td >value2</td>
<td >value..</td>
<td >....</td>
</tr>
....
Optionally you can put the table row with th tags inside a <thead> tag

Resources