Apply BeautifulSoup to XML data from Pandas dataframe - python-3.x

I have Pandas DF which has XML data that needs to be parsed. When I apply BeautifulSoup to the pandas column that has XML, the results are not as expected. I see None when searching for an existing tag
Here is the dataframe, column CDATA has XML
here is a snippet of the code that prints None for the existing tag logicalModel
sampledata= df.loc[(df['PACKAGE_ID']==packageName)&(df['OBJECT_NAME']==viewName) ,['CDATA']].apply(str)
payload = sampledata.values[0]
bs4_xml = BeautifulSoup(payload, features="xml")
print(bs4_xml.logicalModel)
.apply(str) is including string in payload.
then I tried below to get rid of the index
sampledata= df.loc[(df['PACKAGE_ID']==packageName)&(df['OBJECT_NAME']==viewName) ,['CDATA']].values
BeautifulSoup(sampledata, features="xml")
but it gives the error:
{TypeError}cannot use astring pattern on a bytes-like object
When I read same XML from a flat file and apply BeautifulSoup, I can see Tag that I am looking for
with open("SampleXML") as fp:
y = BeautifulSoup(fp, features="xml")
print(y.logicalModel)

Related

Passing Key,Value into a Function

I want to check a YouTube video's views and keep track of them over time. I wrote a script that works great:
import requests
import re
import pandas as pd
from datetime import datetime
import time
def check_views(link):
todays_date = datetime.now().strftime('%d-%m')
now_time = datetime.now().strftime('%H:%M')
#get the site
r = requests.get(link)
text = r.text
tag = re.compile('\d+ views')
views = re.findall(tag,text)[0]
#get the digit number of views. It's returned in a list so I need to get that item out
cleaned_views=re.findall('\d+',views)[0]
print(cleaned_views)
#append to the df
df.loc[len(df)] = [todays_date, now_time, int(cleaned_views)]
#df = df.append([todays_date, now_time, int(cleaned_views)],axis=0)
df.to_csv('views.csv')
return df
df = pd.DataFrame(columns=['Date','Time','Views'])
while True:
df = check_views('https://www.youtube.com/watch?v=gPHgRp70H8o&t=3s')
time.sleep(1800)
But now I want to use this function for multiple links. I want a different CSV file for each link. So I made a dictionary:
link_dict = {'link1':'https://www.youtube.com/watch?v=gPHgRp70H8o&t=3s',
'link2':'https://www.youtube.com/watch?v=ZPrAKuOBWzw'}
#this makes it easy for each csv file to be named for the corresponding link
The loop then becomes:
for key, value in link_dict.items():
df = check_views(value)
That seems to work passing the value of the dict (link) into the function. Inside the function, I just made sure to load the correct csv file at the beginning:
#Existing csv files
df=pd.read_csv(k+'.csv')
But then I'm getting an error when I go to append a new row to the df (“cannot set a row with mismatched columns”). I don't get that since it works just fine as the code written above. This is the part giving me an error:
df.loc[len(df)] = [todays_date, now_time, int(cleaned_views)]
What am I missing here? It seems like a super messy way using this dictionary method (I only have 2 links I want to check but rather than just duplicate a function I wanted to experiment more). Any tips? Thanks!
Figured it out! The problem was that I was saving the df as a csv and then trying to read back that csv later. When I saved the csv, I didn't use index=False with df.to_csv() so there was an extra column! When I was just testing with the dictionary, I was just reusing the df and even though I was saving it to a csv, the script kept using the df to do the actual adding of rows.

Converting multiple .pdf files with multiple pages into 1 single .csv file

I am trying to convert .pdf data to a spreadsheet. Based on some research, some guys recommended transforming it into csv first in order to avoid errors.
So, I made the below coding which is giving me:
"TypeError: cannot concatenate object of type ''; only Series and DataFrame objs are valid"
Error appears at 'pd.concat' command.
'''
import tabula
import pandas as pd
import glob
path = r'C:\Users\REC.AC'
all_files = glob.glob(path + "/*.pdf")
print (all_files)
df = pd.concat(tabula.read_pdf(f1) for f1 in all_files)
df.to_csv("output.csv", index = False)
'''
Since this might be a common issue, I am posting the solution I found.
"""
df = []
for f1 in all_files:
df = pd.concat(tabula.read_pdf(f1))
"""
I believe that breaking the item iteration in two parts would generate the dataframe it needed and therefore would work.

How to write tagged string into text file in python?

I am using POS tagging, parse chunking and deep parsing in one step using TextBlob. I want to write the output in a txt file. The error I get is
"TypeError: a bytes-like object is required, not 'TaggedString'"
Following is the code I am using. The out variable contains information like this.
Technique:Plain/NNP/B-NP/O and/CC/I-NP/O enhanced-MPR/JJ/I-NP/O CT/NN/I-NP/O chest/NN/I-NP/O is/VBZ/B-VP/O
from textblob import TextBlob
with open('report1to8_1.txt', 'r') as myfile:
report=myfile.read().replace('\n', '')
out = TextBlob(report).parse()
tagS = 'taggedop.txt'
f = open('taggedop.txt', 'wb')
f.write(out)
I also want to split these tags into dataframe in this way.

Writing data to csv file from table scrape

I am having trouble figuring out how to write this file to csv. I am parsing data from a table, and can print it just fine, but when I try to write to a csv file i get the error "TypeError: write() argument must be str, not list". I'm not sure how to make my data points into a string.
Code:
from bs4 import BeautifulSoup
import urllib.request
import csv
html = urllib.request.urlopen("https://markets.wsj.com/").read().decode('utf8')
soup = BeautifulSoup(html, 'html.parser') # parse your html
filename = "products.csv"
f = open(filename, "w")
t = soup.find('table', {'summary': 'Major Stock Indexes'}) # finds tag table with attribute summary equals to 'Major Stock Indexes'
tr = t.find_all('tr') # get all table rows from selected table
row_lis = [i.find_all('td') if i.find_all('td') else i.find_all('th') for i in tr if i.text.strip()] # construct list of data
f.write([','.join(x.text.strip() for x in i) for i in row_lis])
Any suggestions?
w.write() takes only a string as an argument, but your passing it a list of lists of strings.
csv.writerows() will write lists to a csv file.
Change your file handle f to be :
f = csv.writer(open(filename,'wb'))
and use it by replacing the last line with:
f.writerows([[x.text.strip() for x in i] for i in row_lis])
will produce a csv with contents:

How to change datatype after parsing?

Here is my full working code.
When imported into Excel the data is imported as text.
How do I change the data type?
​​from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get("http://www.cbr.ru/hd_base/dv/?P1=4")
driver.find_element_by_id('UniDbQuery_FromDate').clear()
driver.find_element_by_id('UniDbQuery_FromDate').send_keys('13.01.2013')
driver.find_element_by_id('UniDbQuery_ToDate').clear()
driver.find_element_by_id('UniDbQuery_ToDate').send_keys('12.12.2017')
driver.find_element_by_id("UniDbQuery_searchbutton").click()
z=driver.page_source
driver.quit()
soup=BeautifulSoup(z)
x=[]
for tag in soup.tbody.findAll('td'):
x.append(tag.text)
y=x[1::2]
d=pd.Series(y)
You can use the .astype method for series:
d = d.astype(int)
or whatever datatype you're trying to convert to in place of int. Note that this will give you an error if there's anything in your series that it can't convert - you may need to drop null values or strip whitespace first if that happens.

Resources