How to change datatype after parsing? - python-3.x

Here is my full working code.
When imported into Excel the data is imported as text.
How do I change the data type?
​​from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get("http://www.cbr.ru/hd_base/dv/?P1=4")
driver.find_element_by_id('UniDbQuery_FromDate').clear()
driver.find_element_by_id('UniDbQuery_FromDate').send_keys('13.01.2013')
driver.find_element_by_id('UniDbQuery_ToDate').clear()
driver.find_element_by_id('UniDbQuery_ToDate').send_keys('12.12.2017')
driver.find_element_by_id("UniDbQuery_searchbutton").click()
z=driver.page_source
driver.quit()
soup=BeautifulSoup(z)
x=[]
for tag in soup.tbody.findAll('td'):
x.append(tag.text)
y=x[1::2]
d=pd.Series(y)

You can use the .astype method for series:
d = d.astype(int)
or whatever datatype you're trying to convert to in place of int. Note that this will give you an error if there's anything in your series that it can't convert - you may need to drop null values or strip whitespace first if that happens.

Related

Apply BeautifulSoup to XML data from Pandas dataframe

I have Pandas DF which has XML data that needs to be parsed. When I apply BeautifulSoup to the pandas column that has XML, the results are not as expected. I see None when searching for an existing tag
Here is the dataframe, column CDATA has XML
here is a snippet of the code that prints None for the existing tag logicalModel
sampledata= df.loc[(df['PACKAGE_ID']==packageName)&(df['OBJECT_NAME']==viewName) ,['CDATA']].apply(str)
payload = sampledata.values[0]
bs4_xml = BeautifulSoup(payload, features="xml")
print(bs4_xml.logicalModel)
.apply(str) is including string in payload.
then I tried below to get rid of the index
sampledata= df.loc[(df['PACKAGE_ID']==packageName)&(df['OBJECT_NAME']==viewName) ,['CDATA']].values
BeautifulSoup(sampledata, features="xml")
but it gives the error:
{TypeError}cannot use astring pattern on a bytes-like object
When I read same XML from a flat file and apply BeautifulSoup, I can see Tag that I am looking for
with open("SampleXML") as fp:
y = BeautifulSoup(fp, features="xml")
print(y.logicalModel)

Passing Key,Value into a Function

I want to check a YouTube video's views and keep track of them over time. I wrote a script that works great:
import requests
import re
import pandas as pd
from datetime import datetime
import time
def check_views(link):
todays_date = datetime.now().strftime('%d-%m')
now_time = datetime.now().strftime('%H:%M')
#get the site
r = requests.get(link)
text = r.text
tag = re.compile('\d+ views')
views = re.findall(tag,text)[0]
#get the digit number of views. It's returned in a list so I need to get that item out
cleaned_views=re.findall('\d+',views)[0]
print(cleaned_views)
#append to the df
df.loc[len(df)] = [todays_date, now_time, int(cleaned_views)]
#df = df.append([todays_date, now_time, int(cleaned_views)],axis=0)
df.to_csv('views.csv')
return df
df = pd.DataFrame(columns=['Date','Time','Views'])
while True:
df = check_views('https://www.youtube.com/watch?v=gPHgRp70H8o&t=3s')
time.sleep(1800)
But now I want to use this function for multiple links. I want a different CSV file for each link. So I made a dictionary:
link_dict = {'link1':'https://www.youtube.com/watch?v=gPHgRp70H8o&t=3s',
'link2':'https://www.youtube.com/watch?v=ZPrAKuOBWzw'}
#this makes it easy for each csv file to be named for the corresponding link
The loop then becomes:
for key, value in link_dict.items():
df = check_views(value)
That seems to work passing the value of the dict (link) into the function. Inside the function, I just made sure to load the correct csv file at the beginning:
#Existing csv files
df=pd.read_csv(k+'.csv')
But then I'm getting an error when I go to append a new row to the df (“cannot set a row with mismatched columns”). I don't get that since it works just fine as the code written above. This is the part giving me an error:
df.loc[len(df)] = [todays_date, now_time, int(cleaned_views)]
What am I missing here? It seems like a super messy way using this dictionary method (I only have 2 links I want to check but rather than just duplicate a function I wanted to experiment more). Any tips? Thanks!
Figured it out! The problem was that I was saving the df as a csv and then trying to read back that csv later. When I saved the csv, I didn't use index=False with df.to_csv() so there was an extra column! When I was just testing with the dictionary, I was just reusing the df and even though I was saving it to a csv, the script kept using the df to do the actual adding of rows.

import url from excel into python

I have this code but I need to import all the rows in a loop
I have an excel file with url links in one column, and I need to read those links to perform NLP on them. How can I use a loop to read those links? Here's my attempt so far:
import requests
link = 'https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-000413.txt'
f = requests.get(link)
print(f.text)
When I have understood your question correctly you want to read an excel-file with some urls and perform an operation with every single url. In this case I would read the file with pandas. Assuming your file is named file.xlsx and has a column named url that contains all the urls, you could do the following
import pandas as pd
df = pd.read_excel ('file.xlsx')
for url in df['url']:
print(url) #or do some useful stuff with it
import pandas as pd
import requests as req
#assuming column A has url links
mydata = pd.read_excel('/your_file.xlsx',usecols="A",header=None)
for index, row in mydata.iterrows():
print(req.get(row))

Error in reading a ascii encoded csv file?

I have a csv file named Qid-NamedEntityMapping.csv having data like this:
Q1000070 b'Myron V. George'
Q1000296 b'Fred (footballer, born 1979)'
Q1000799 b'Herbert Greenfield'
Q1000841 b'Stephen A. Northway'
Q1001203 b'Buddy Greco'
Q100122 b'Kurt Kreuger'
Q1001240 b'Buddy Lester'
Q1001867 b'Fyodor Stravinsky'
The second column is 'ascii' encoded, and when I am reading the file using the following code, then also it not being read properly:
import chardet
import pandas as pd
def find_encoding(fname):
r_file = open(fname, 'rb').read()
result = chardet.detect(r_file)
charenc = result['encoding']
return charenc
my_encoding = find_encoding('datasets/KGfacts/Qid-
NamedEntityMapping.csv')
df = pd.read_csv('datasets/KGfacts/Qid-
NamedEntityMapping.csv',error_bad_lines=False, encoding=my_encoding)
But the output looks like this:
Also, I tried to use encoding='UTF-8'. but still, the output is the same.
What can be done to read it properly?
Looks like you have an improperly saved TSV file. Once you circumvent the TAB problem (as suggested in my comment), you can convert the column with names to a more suitable representation.
Let's assume that the second column of the dataframe is called "names". The b'XXX' thing is probably a bytes [mis]representation of a string. Convert it to a bytes object with ast.literal_eval and then decode to a string:
import ast
df["names"].apply(ast.literal_eval).apply(bytes.decode)
#0 Myron...
#1 Fred...
Last but not least, your problem has almost nothing to do with encodings or charsets.
Your issue looks like the CSV is actually tab separated; so you need to have sep='\t' in the read_csv function. It's reading everything else as a single column, except "born 1979" in the first row, as that is the only cell with a comma in it.

len() function not giving correct character number

I am trying to figure out the number of characters in a string but for some strange reason len() is only giving me back 1.
here is an example of my output
WearWorks is a haptics design company that develops products and
experiences that communicate information through touch. Our first product,
Wayband, is a wearable tactile navigation device for the blind and visually
impaired.
True
1
here is my code
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url="https://www.wear.works/"
response=requests.get(url)
html=response.content
soup=BeautifulSoup(html,'html.parser')
#reference https://stackoverflow.com/questions/328356/extracting-text-from-html-file-using-python
# getting rid of the script sytle in html
for script in soup(["script", "style"]):
(script.extract()) # rip it out
# print(script)
# get text
# grabbing the first chunk of text
text = soup.get_text()[0]
print(isinstance(text, str))
print(len(text))
print(text)
The problem is text = soup.get_text()[0] convert it to text = soup.get_text() have a look. You're slicing a string to get the first character.

Resources