How to assign bs4 strings to pandas dataframe in for loop? - python-3.x

I have an HTML string that I am successfully able to use beautifulsoup4 on to extract the elements I need.
the HTML strings are in a list and I am wanting to extract only certain elements out of the strings and assign them to dataframe columns.
Current code:
import pandas as pd
from bs4 import BeautifulSoup
lst = [ <html>,<html>]
df = pd.DataFrame()
for i in lst:
soup = BeautifulSoup(i)
for link in soup.find_all('a'):
df['links'] = str(link.get('href'))
#print(link.get('href'))
#get all text messages
soup.find_all('p')
df['messages'] = str(soup.find_all('p'))
#get author name
soup.find_all(class_="author--name")
df['author'] = str(soup.find_all(class_="author--name"))
#get username
soup.find_all(class_= "author--username")
df['username'] = str(soup.find_all(class_= "author--username"))
All the soup lines of code are producing the data I need, but why is the dataframe not assigning the string values to the dataframe columns?
I can see that from an empty dataframe, the code creates the new columns but there are no values.
What am I doing wrong?

The solution was to wrap the assignments in brackets like so:
for i in lst:
df = pd.DataFrame()
soup = BeautifulSoup(i)
#print(soup)
for link in soup.find_all('a'):
df['links'] = [str(link.get('href'))]
#print(link.get('href'))
#get all text messages
soup.find_all('p')
df['messages'] = [str(soup.find_all('p'))]
#get author name
soup.find_all(class_="author--name")
df['author'] = [str(soup.find_all(class_="author--name"))]
#get username
soup.find_all(class_= "author--username")
df['username'] = [str(soup.find_all(class_= "author--username"))] text messages
soup.find_all('p')
df['messages'] = str(soup.find_all('p'))
#get author name
soup.find_all(class_="author--name")
df['author'] = str(soup.find_all(class_="author--name"))
#get username
soup.find_all(class_= "author--username")
df['username'] = str(soup.find_all(class_= "author--username"))

Related

python code to loop though a list of postcodes and get the GP practices for those postcodes by scraping the yellow pages (Australia)

The code below gives me the following error:
ValueError: Length mismatch: Expected axis has 0 elements, new values have 1 elements
on the df.columns = ["GP Practice Name"] line.
I tried
import pandas as pd
import requests
from bs4 import BeautifulSoup
postal_codes = ["2000", "2010", "2020", "2030", "2040"]
places_by_postal_code = {}
def get_places(postal_code):
url = f"https://www.yellowpages.com.au/search/listings?clue={postal_code}&locationClue=&latitude=&longitude=&selectedViewMode=list&refinements=category:General%20Practitioner&selectedSortType=distance"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
places = soup.find_all("div", {"class": "listing-content"})
return [place.find("h2").text for place in places]
for postal_code in postal_codes:
places = get_places(postal_code)
places_by_postal_code[postal_code] = places
df = pd.DataFrame.from_dict(places_by_postal_code, orient='index')
df.columns = ["GP Practice Name"]
df = pd.DataFrame(places_by_postal_code.values(), index=places_by_postal_code.keys(), columns=["GP Practice Name"])
print(df)
and was expecting a list of GPs for the postcodes specified in the postal_codes variable.

Pandas - Add items to dataframe

I am trying to add row items to the dataframe, and I am not able to update the dataframe.
What i tried until now is commented out as it doesn't do what I need.
I simply want to download the json file and store it to a dataframe with those given columns. Seems I am not able to extract the child components fron JSON file and store them to a brand new dataframe.
Please find bellow my code:
import requests, json, urllib
import pandas as pd
url = "https://www.cisa.gov/sites/default/files/feeds/known_exploited_vulnerabilities.json"
data = pd.read_json(url)
headers = []
df = pd.DataFrame()
for key, item in data['vulnerabilities'].items():
for k in item.keys():
headers.append(k)
col = list(set(headers))
new_df = pd.DataFrame(columns=col)
for item in data['vulnerabilities'].items():
print(item[1])
# new_df['product'] = item[1]['product']
# new_df['vendorProject'] = item[1]['vendorProject']
# new_df['dueDate'] = item[1]['dueDate']
# new_df['shortDescription'] = item[1]['shortDescription']
# new_df['dateAdded'] = item[1]['dateAdded']
# new_df['vulnerabilityName'] = item[1]['vulnerabilityName']
# new_df['cveID'] = item[1]['cveID']
# new_df.append(item[1], ignore_index = True)
new_df
At the end my df is still blank.
The nested JSON data can be directly converted to a flattened dataframe using pd.json_normalize(). The headers are extracted from the JSON itself.
new_df = pd.DataFrame(pd.json_normalize(data['vulnerabilities']))
UPDATE: Unnested the vulnerabilities column specifically.
Output:
It worked with this:
import requests, json, urllib
import pandas as pd
url = "https://www.cisa.gov/sites/default/files/feeds/known_exploited_vulnerabilities.json"
data = pd.read_json(url)
headers = []
df = pd.DataFrame()
for key, item in data['vulnerabilities'].items():
for k in item.keys():
headers.append(k)
col = list(set(headers))
new_df = pd.DataFrame(columns=col)
for item in data['vulnerabilities'].items():
new_df.loc[len(new_df.index)] = item[1] <===THIS
new_df.head()

extracting columns contents only so that all columns for each row are in the same row using Python's BeautifulSoup

I have the following python snippet in Jupyter Notebooks that works.
The challenge I have is to extract just the rows of columnar data only
Here's the snippet:
from bs4 import BeautifulSoup as bs
import pandas as pd
page = requests.get("http://lib.stat.cmu.edu/datasets/boston")
page
soup = bs(page.content)
soup
allrows = soup.find_all("p")
print(allrows)
I'm a little unclear of what you are after but I think it's each individual row of data from URL provided.
I couldn't find a way to use beautiful soup to parse the data you are after but did find a way to separate the rows using .split()
from bs4 import BeautifulSoup as bs
import pandas as pd
import requests
page = requests.get("http://lib.stat.cmu.edu/datasets/boston")
soup = bs(page.content)
allrows = soup.find_all("p")
text = soup.text # turn soup into text
text_split = text.split('\n\n') # split the page into 3 sections
data = text_split[2] # rows of data
# create df column titles using variable titles on page
col_titles = text_split[1].split('\n')
df = pd.DataFrame(columns=range(14))
df.columns = col_titles[1:]
# 'try/except' to catch end of index,
# loop throw text data building complete rows
try:
complete_row = []
n1 = 0 #used to track index
n2 = 1
rows = data.split('\n')
for el in range(len(rows)):
full_row = rows[n1] + rows[n2]
complete_row.append(full_row)
n1 = n1 + 2
n2 = n2 + 2
except IndexError:
print('end of loop')
# loop through rows of data, clean whitespace and append to df
for row in complete_row:
elem = row.split(' ')
df.loc[len(df)] = [el for el in elem if el]
#fininshed dataframe
df

Loop pages and save detailed contents as dataframe in Python

Say I need to crawler the detailed contents from this link:
The objective is to extract contents the elements from the link, and append all the entries as dataframe.
from bs4 import BeautifulSoup
import requests
import os
from urllib.parse import urlparse
url = 'http://www.jscq.com.cn/dsf/zc/cjgg/202101/t20210126_30144.html'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
text = soup.find_all(text=True)
output = ''
blacklist = [
'[document]',
'noscript',
'header',
'html',
'meta',
'head',
'input',
'script'
]
for t in text:
if t.parent.name not in blacklist:
output += '{} '.format(t)
print(output)
Out:
南京市玄武区锁金村10-30号房屋公开招租成交公告-成交公告-江苏产权市场
body{font-size:100%!important;}
.main_body{position:relative;width:1000px;margin:0 auto;background-color:#fff;}
.main_content_p img{max-width:90%;display:block;margin:0 auto;}
.m_con_r_h{padding-left: 20px;width: 958px;height: 54px;line-height: 55px;font-size: 12px;color: #979797;}
.m_con_r_h a{color: #979797;}
.main_content_p{min-height:200px;width:90%;margin:0 auto;line-height: 30px;text-indent:0;}
.main_content_p table{margin:0 auto!important;width:900px!important;}
.main_content_h1{border:none;width:93%;margin:0 auto;}
.tit_h{font-size:22px;font-family:'微软雅黑';color:#000;line-height:30px;margin-bottom:10px;padding-bottom:20px;text-align:center;}
.doc_time{font-size:12px;color:#555050;height:28px;line-height:28px;text-align:center;background:#F2F7FD;border-top:1px solid #dadada;}
.doc_time span{padding:0 5px;}
.up_dw{width:100%;border-top:1px solid #ccc;padding-top:10px;padding-bottom:10px;margin-top:30px;clear:both;}
.pager{width:50%;float:left;padding-left:0;text-align:center;}
.bshare-custom{position:absolute;top:20px;right:40px;}
.pager{width:90%;padding-left: 50px;float:inherit;text-align: inherit;}
页头部分开始
页头部分结束
START body
南京市玄武区锁金村10-30号房屋公开招租成交公告
组织机构:江苏省产权交易所
发布时间:2021-01-26
项目编号
17FCZZ20200125
转让/出租标的名称
南京市玄武区锁金村10-30号房屋公开招租
转让方/出租方名称
南京邮电大学资产经营有限责任公司
转让标的评估价/年租金评估价(元)
64800.00
转让底价/年租金底价(元)
97200.00
受让方/承租方名称
马尕西木
成交价/成交年租金(元)
97200.00
成交日期
2021年01月15日
附件:
END body
页头部分开始
页头部分结束
But how could I loop all the pages and extract contents, and append them to the following dataframe? Thanks.
Updates for appending dfs as a dataframe:
updated_df = pd.DataFrame()
with requests.Session() as connection_session: # reuse your connection!
for follow_url in get_follow_urls(get_main_urls(), connection_session):
key = follow_url.rsplit("/")[-1].replace(".html", "")
# print(f"Fetching data for {key}...")
dfs = pd.read_html(
connection_session.get(follow_url).content.decode("utf-8"),
flavor="bs4",
)
# https://stackoverflow.com/questions/39710903/pd-read-html-imports-a-list-rather-than-a-dataframe
for df in dfs:
df = dfs[0].T.iloc[1:, :].copy()
updated_df = updated_df.append(df)
print(updated_df)
cols = ['项目编号', '转让/出租标的名称', '转让方/出租方名称', '转让标的评估价/年租金评估价(元)',
'转让底价/年租金底价(元)', '受让方/承租方名称', '成交价/成交年租金(元)', '成交日期']
updated_df.columns = cols
updated_df.to_excel('./data.xlsx', index = False)
Here's how I would do this:
build all main urls
visit every main page
get the follow urls
visit each follow url
grab the table from the follow url
parse the table with pandas
add the table to a dictionary of pandas dataframes
process the tables (not included -> implement your logic)
repeat the 2 - 7 steps to continue scraping the data.
The code:
import pandas as pd
import requests
from bs4 import BeautifulSoup
BASE_URL = "http://www.jscq.com.cn/dsf/zc/cjgg"
def get_main_urls() -> list:
start_url = f"{BASE_URL}/index.html"
return [start_url] + [f"{BASE_URL}/index_{i}.html" for i in range(1, 6)]
def get_follow_urls(urls: list, session: requests.Session()) -> iter:
for url in urls[:1]: # remove [:1] to scrape all the pages
body = session.get(url).content
s = BeautifulSoup(body, "lxml").find_all("td", {"width": "60%"})
yield from [f"{BASE_URL}{a.find('a')['href'][1:]}" for a in s]
dataframe_collection = {}
with requests.Session() as connection_session: # reuse your connection!
for follow_url in get_follow_urls(get_main_urls(), connection_session):
key = follow_url.rsplit("/")[-1].replace(".html", "")
print(f"Fetching data for {key}...")
df = pd.read_html(
connection_session.get(follow_url).content.decode("utf-8"),
flavor="bs4",
)
dataframe_collection[key] = df
# process the dataframe_collection here
# print the dictionary of dataframes (optional and can be removed)
for key in dataframe_collection.keys():
print("\n" + "=" * 40)
print(key)
print("-" * 40)
print(dataframe_collection[key])
Output:
Fetching data for t20210311_30347...
Fetching data for t20210311_30346...
Fetching data for t20210305_30338...
Fetching data for t20210305_30337...
Fetching data for t20210303_30323...
Fetching data for t20210225_30306...
Fetching data for t20210225_30305...
Fetching data for t20210225_30304...
Fetching data for t20210225_30303...
Fetching data for t20210209_30231...
and then ...

appending Dict to nested list per request made

I am currently scraping through an XML API response. I am looking to gather a piece of information for each request and create a dictionary each time I find this piece of data. Each request can have several IDs. So one response can have 2 IDs while the next response might have 3 IDs. For example, let's say the first response has 2 IDs. I am storing this data in a list at the moment when the second request is done the additional 3 IDs are being stored under this same list as well.
import requests
import pandas as pd
from pandas import DataFrame
from bs4 import BeautifulSoup
import datetime as datetime
import json
import time
trackingDomain = ''
domain = ''
aIDs = []
cIDs = []
url = "https://" + domain + ""
print(url)
df = pd.read_csv('campids.csv')
for index, row in df.iterrows():
payload = {'api_key':'',
'campaign_id':'0',
'site_offer_id':row['IDs'],
'source_affiliate_id':'0',
'channel_id':'0',
'account_status_id':'0',
'media_type_id':'0',
'start_at_row':'0',
'row_limit':'0',
'sort_field':'campaign_id',
'sort_descending':'TRUE'
}
print('Campaign Payload', payload)
r = requests.get(url, params=payload)
print(r.status_code)
soup = BeautifulSoup(r.text, 'lxml')
success = soup.find('success').string
for affIDs in soup.select('campaign'):
affID = affIDs.find('source_affiliate_id').string
aIDs.append(affID)
dataDict = dict()
dataDict['offers'] = []
affDict = {'affliate_id':aIDs}
dataDict['offers'].append(dict(affDict))
The result ends up being as follows:
dictData = {'offers': [{'affliate_id': ['9','2','45','47','14','8','30','30','2','2','9','2']}]}
What I am looking to do is this:
dictData = {'offers':[{'affiliate_id'['9','2','45','47','14','8','30','30','2','2']},{'affiliate_id':['9','2']}]}
On the first request, I obtain the following:
IDs['9','2','45','47','14','8','30','30','2','2']
On the second request these IDs are returned:
['9','2']
I am new to Python so please bear with me as far etiquette goes and I am missing something. I'll be happy to provide any additional information.
It has to do with the order of your initializing and appending that is causing you to not get the outcome you are wanting. You are overwriting your dataDict after each iteration, and inserting the appended list which is not overwritten, thus leaving you with a final list that has appended ALL aIDs. What you want to to do is initialise that dataDict out side of your for loop, and then you can append the dictionary in the nested loop into that list:
Note: It's tough to work out/test without having the actual data, but I believe this should do it if I worked out the logic correctly in my head:
import requests
import pandas as pd
from pandas import DataFrame
from bs4 import BeautifulSoup
import datetime as datetime
import json
import time
trackingDomain = ''
domain = ''
cIDs = []
url = "https://" + domain + ""
# Initialize your dictionary
dataDict = dict()
# Initialize your list in your dictionary under key `offers`
dataDict['offers'] = []
print(url)
df = pd.read_csv('campids.csv')
for index, row in df.iterrows():
payload = {'api_key':'',
'campaign_id':'0',
'site_offer_id':row['IDs'],
'source_affiliate_id':'0',
'channel_id':'0',
'account_status_id':'0',
'media_type_id':'0',
'start_at_row':'0',
'row_limit':'0',
'sort_field':'campaign_id',
'sort_descending':'TRUE'
}
print('Campaign Payload', payload)
r = requests.get(url, params=payload)
print(r.status_code)
soup = BeautifulSoup(r.text, 'lxml')
success = soup.find('success').string
# Initialize your list for this iteration/row in your df.iterrows
aIDs = []
for affIDs in soup.select('campaign'):
affID = affIDs.find('source_affiliate_id').string
# Append those affIDs to the aIDs list
aIDs.append(affID)
# Create your dictionary of key:value with key 'affiliate_id' and value the aIDs list
affDict = {'affliate_id':aIDs}
# NOW append that into your list in your dictionary under key `offers`
dataDict['offers'].append(dict(affDict))

Resources