I am currently scraping through an XML API response. I am looking to gather a piece of information for each request and create a dictionary each time I find this piece of data. Each request can have several IDs. So one response can have 2 IDs while the next response might have 3 IDs. For example, let's say the first response has 2 IDs. I am storing this data in a list at the moment when the second request is done the additional 3 IDs are being stored under this same list as well.
import requests
import pandas as pd
from pandas import DataFrame
from bs4 import BeautifulSoup
import datetime as datetime
import json
import time
trackingDomain = ''
domain = ''
aIDs = []
cIDs = []
url = "https://" + domain + ""
print(url)
df = pd.read_csv('campids.csv')
for index, row in df.iterrows():
payload = {'api_key':'',
'campaign_id':'0',
'site_offer_id':row['IDs'],
'source_affiliate_id':'0',
'channel_id':'0',
'account_status_id':'0',
'media_type_id':'0',
'start_at_row':'0',
'row_limit':'0',
'sort_field':'campaign_id',
'sort_descending':'TRUE'
}
print('Campaign Payload', payload)
r = requests.get(url, params=payload)
print(r.status_code)
soup = BeautifulSoup(r.text, 'lxml')
success = soup.find('success').string
for affIDs in soup.select('campaign'):
affID = affIDs.find('source_affiliate_id').string
aIDs.append(affID)
dataDict = dict()
dataDict['offers'] = []
affDict = {'affliate_id':aIDs}
dataDict['offers'].append(dict(affDict))
The result ends up being as follows:
dictData = {'offers': [{'affliate_id': ['9','2','45','47','14','8','30','30','2','2','9','2']}]}
What I am looking to do is this:
dictData = {'offers':[{'affiliate_id'['9','2','45','47','14','8','30','30','2','2']},{'affiliate_id':['9','2']}]}
On the first request, I obtain the following:
IDs['9','2','45','47','14','8','30','30','2','2']
On the second request these IDs are returned:
['9','2']
I am new to Python so please bear with me as far etiquette goes and I am missing something. I'll be happy to provide any additional information.
It has to do with the order of your initializing and appending that is causing you to not get the outcome you are wanting. You are overwriting your dataDict after each iteration, and inserting the appended list which is not overwritten, thus leaving you with a final list that has appended ALL aIDs. What you want to to do is initialise that dataDict out side of your for loop, and then you can append the dictionary in the nested loop into that list:
Note: It's tough to work out/test without having the actual data, but I believe this should do it if I worked out the logic correctly in my head:
import requests
import pandas as pd
from pandas import DataFrame
from bs4 import BeautifulSoup
import datetime as datetime
import json
import time
trackingDomain = ''
domain = ''
cIDs = []
url = "https://" + domain + ""
# Initialize your dictionary
dataDict = dict()
# Initialize your list in your dictionary under key `offers`
dataDict['offers'] = []
print(url)
df = pd.read_csv('campids.csv')
for index, row in df.iterrows():
payload = {'api_key':'',
'campaign_id':'0',
'site_offer_id':row['IDs'],
'source_affiliate_id':'0',
'channel_id':'0',
'account_status_id':'0',
'media_type_id':'0',
'start_at_row':'0',
'row_limit':'0',
'sort_field':'campaign_id',
'sort_descending':'TRUE'
}
print('Campaign Payload', payload)
r = requests.get(url, params=payload)
print(r.status_code)
soup = BeautifulSoup(r.text, 'lxml')
success = soup.find('success').string
# Initialize your list for this iteration/row in your df.iterrows
aIDs = []
for affIDs in soup.select('campaign'):
affID = affIDs.find('source_affiliate_id').string
# Append those affIDs to the aIDs list
aIDs.append(affID)
# Create your dictionary of key:value with key 'affiliate_id' and value the aIDs list
affDict = {'affliate_id':aIDs}
# NOW append that into your list in your dictionary under key `offers`
dataDict['offers'].append(dict(affDict))
Related
I am trying to add row items to the dataframe, and I am not able to update the dataframe.
What i tried until now is commented out as it doesn't do what I need.
I simply want to download the json file and store it to a dataframe with those given columns. Seems I am not able to extract the child components fron JSON file and store them to a brand new dataframe.
Please find bellow my code:
import requests, json, urllib
import pandas as pd
url = "https://www.cisa.gov/sites/default/files/feeds/known_exploited_vulnerabilities.json"
data = pd.read_json(url)
headers = []
df = pd.DataFrame()
for key, item in data['vulnerabilities'].items():
for k in item.keys():
headers.append(k)
col = list(set(headers))
new_df = pd.DataFrame(columns=col)
for item in data['vulnerabilities'].items():
print(item[1])
# new_df['product'] = item[1]['product']
# new_df['vendorProject'] = item[1]['vendorProject']
# new_df['dueDate'] = item[1]['dueDate']
# new_df['shortDescription'] = item[1]['shortDescription']
# new_df['dateAdded'] = item[1]['dateAdded']
# new_df['vulnerabilityName'] = item[1]['vulnerabilityName']
# new_df['cveID'] = item[1]['cveID']
# new_df.append(item[1], ignore_index = True)
new_df
At the end my df is still blank.
The nested JSON data can be directly converted to a flattened dataframe using pd.json_normalize(). The headers are extracted from the JSON itself.
new_df = pd.DataFrame(pd.json_normalize(data['vulnerabilities']))
UPDATE: Unnested the vulnerabilities column specifically.
Output:
It worked with this:
import requests, json, urllib
import pandas as pd
url = "https://www.cisa.gov/sites/default/files/feeds/known_exploited_vulnerabilities.json"
data = pd.read_json(url)
headers = []
df = pd.DataFrame()
for key, item in data['vulnerabilities'].items():
for k in item.keys():
headers.append(k)
col = list(set(headers))
new_df = pd.DataFrame(columns=col)
for item in data['vulnerabilities'].items():
new_df.loc[len(new_df.index)] = item[1] <===THIS
new_df.head()
Based on the answered code from this link, I'm able to create a new column: df['url'] = 'https://www.cspea.com.cn/list/c01/' + df['projectCode'].
Next step I would like to pass the url column's values to the following code and append all the scrapied contents as dataframe.
import urllib3
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.cspea.com.cn/list/c01/gr2021bj1000186" # url column's values should be passed here one by one
soup = BeautifulSoup(requests.get(url, verify=False).content, "html.parser")
index, data = [], []
for th in soup.select(".project-detail-left th"):
h = th.get_text(strip=True)
t = th.find_next("td").get_text(strip=True)
index.append(h)
data.append(t)
df = pd.DataFrame(data, index=index, columns=["value"])
print(df)
How could I do that in Python? Thanks.
Updated:
import requests
from bs4 import BeautifulSoup
import pandas as pd
df = pd.read_excel('items_scraped.xlsx')
data = []
urls = df.url.tolist()
for url_link in urls:
url = url_link
# url = "https://www.cspea.com.cn/list/c01/gr2021bj1000186"
soup = BeautifulSoup(requests.get(url, verify=False).content, "html.parser")
index, data = [], []
for th in soup.select(".project-detail-left th"):
h = th.get_text(strip=True)
t = th.find_next("td").get_text(strip=True)
index.append(h)
data.append(t)
df = pd.DataFrame(data, index=index, columns=["value"])
df = df.T
df.reset_index(drop=True, inplace=True)
print(df)
df.to_excel('result.xlsx', index = False)
But it only saved one rows into excel file.
You need to combine the dfs generated in the loop. You could add them to a list and then call pd.concat on that list.
import requests
from bs4 import BeautifulSoup
import pandas as pd
df = pd.read_excel('items_scraped.xlsx')
# data = []
urls = df.url.tolist()
dfs = []
for url_link in urls:
url = url_link
# url = "https://www.cspea.com.cn/list/c01/gr2021bj1000186"
soup = BeautifulSoup(requests.get(url, verify=False).content, "html.parser")
index, data = [], []
for th in soup.select(".project-detail-left th"):
h = th.get_text(strip=True)
t = th.find_next("td").get_text(strip=True)
index.append(h)
data.append(t)
df = pd.DataFrame(data, index=index, columns=["value"])
df = df.T
df.reset_index(drop=True, inplace=True)
print(df)
dfs.append(df)
df = pd.concat(dfs)
df.to_excel('result.xlsx', index = False)
Use
urls = df.url.tolist()
To create a list of URLs and then iterate through them using f string to insert each one into your base url
Say I need to crawler the detailed contents from this link:
The objective is to extract contents the elements from the link, and append all the entries as dataframe.
from bs4 import BeautifulSoup
import requests
import os
from urllib.parse import urlparse
url = 'http://www.jscq.com.cn/dsf/zc/cjgg/202101/t20210126_30144.html'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
text = soup.find_all(text=True)
output = ''
blacklist = [
'[document]',
'noscript',
'header',
'html',
'meta',
'head',
'input',
'script'
]
for t in text:
if t.parent.name not in blacklist:
output += '{} '.format(t)
print(output)
Out:
南京市玄武区锁金村10-30号房屋公开招租成交公告-成交公告-江苏产权市场
body{font-size:100%!important;}
.main_body{position:relative;width:1000px;margin:0 auto;background-color:#fff;}
.main_content_p img{max-width:90%;display:block;margin:0 auto;}
.m_con_r_h{padding-left: 20px;width: 958px;height: 54px;line-height: 55px;font-size: 12px;color: #979797;}
.m_con_r_h a{color: #979797;}
.main_content_p{min-height:200px;width:90%;margin:0 auto;line-height: 30px;text-indent:0;}
.main_content_p table{margin:0 auto!important;width:900px!important;}
.main_content_h1{border:none;width:93%;margin:0 auto;}
.tit_h{font-size:22px;font-family:'微软雅黑';color:#000;line-height:30px;margin-bottom:10px;padding-bottom:20px;text-align:center;}
.doc_time{font-size:12px;color:#555050;height:28px;line-height:28px;text-align:center;background:#F2F7FD;border-top:1px solid #dadada;}
.doc_time span{padding:0 5px;}
.up_dw{width:100%;border-top:1px solid #ccc;padding-top:10px;padding-bottom:10px;margin-top:30px;clear:both;}
.pager{width:50%;float:left;padding-left:0;text-align:center;}
.bshare-custom{position:absolute;top:20px;right:40px;}
.pager{width:90%;padding-left: 50px;float:inherit;text-align: inherit;}
页头部分开始
页头部分结束
START body
南京市玄武区锁金村10-30号房屋公开招租成交公告
组织机构:江苏省产权交易所
发布时间:2021-01-26
项目编号
17FCZZ20200125
转让/出租标的名称
南京市玄武区锁金村10-30号房屋公开招租
转让方/出租方名称
南京邮电大学资产经营有限责任公司
转让标的评估价/年租金评估价(元)
64800.00
转让底价/年租金底价(元)
97200.00
受让方/承租方名称
马尕西木
成交价/成交年租金(元)
97200.00
成交日期
2021年01月15日
附件:
END body
页头部分开始
页头部分结束
But how could I loop all the pages and extract contents, and append them to the following dataframe? Thanks.
Updates for appending dfs as a dataframe:
updated_df = pd.DataFrame()
with requests.Session() as connection_session: # reuse your connection!
for follow_url in get_follow_urls(get_main_urls(), connection_session):
key = follow_url.rsplit("/")[-1].replace(".html", "")
# print(f"Fetching data for {key}...")
dfs = pd.read_html(
connection_session.get(follow_url).content.decode("utf-8"),
flavor="bs4",
)
# https://stackoverflow.com/questions/39710903/pd-read-html-imports-a-list-rather-than-a-dataframe
for df in dfs:
df = dfs[0].T.iloc[1:, :].copy()
updated_df = updated_df.append(df)
print(updated_df)
cols = ['项目编号', '转让/出租标的名称', '转让方/出租方名称', '转让标的评估价/年租金评估价(元)',
'转让底价/年租金底价(元)', '受让方/承租方名称', '成交价/成交年租金(元)', '成交日期']
updated_df.columns = cols
updated_df.to_excel('./data.xlsx', index = False)
Here's how I would do this:
build all main urls
visit every main page
get the follow urls
visit each follow url
grab the table from the follow url
parse the table with pandas
add the table to a dictionary of pandas dataframes
process the tables (not included -> implement your logic)
repeat the 2 - 7 steps to continue scraping the data.
The code:
import pandas as pd
import requests
from bs4 import BeautifulSoup
BASE_URL = "http://www.jscq.com.cn/dsf/zc/cjgg"
def get_main_urls() -> list:
start_url = f"{BASE_URL}/index.html"
return [start_url] + [f"{BASE_URL}/index_{i}.html" for i in range(1, 6)]
def get_follow_urls(urls: list, session: requests.Session()) -> iter:
for url in urls[:1]: # remove [:1] to scrape all the pages
body = session.get(url).content
s = BeautifulSoup(body, "lxml").find_all("td", {"width": "60%"})
yield from [f"{BASE_URL}{a.find('a')['href'][1:]}" for a in s]
dataframe_collection = {}
with requests.Session() as connection_session: # reuse your connection!
for follow_url in get_follow_urls(get_main_urls(), connection_session):
key = follow_url.rsplit("/")[-1].replace(".html", "")
print(f"Fetching data for {key}...")
df = pd.read_html(
connection_session.get(follow_url).content.decode("utf-8"),
flavor="bs4",
)
dataframe_collection[key] = df
# process the dataframe_collection here
# print the dictionary of dataframes (optional and can be removed)
for key in dataframe_collection.keys():
print("\n" + "=" * 40)
print(key)
print("-" * 40)
print(dataframe_collection[key])
Output:
Fetching data for t20210311_30347...
Fetching data for t20210311_30346...
Fetching data for t20210305_30338...
Fetching data for t20210305_30337...
Fetching data for t20210303_30323...
Fetching data for t20210225_30306...
Fetching data for t20210225_30305...
Fetching data for t20210225_30304...
Fetching data for t20210225_30303...
Fetching data for t20210209_30231...
and then ...
I have an HTML string that I am successfully able to use beautifulsoup4 on to extract the elements I need.
the HTML strings are in a list and I am wanting to extract only certain elements out of the strings and assign them to dataframe columns.
Current code:
import pandas as pd
from bs4 import BeautifulSoup
lst = [ <html>,<html>]
df = pd.DataFrame()
for i in lst:
soup = BeautifulSoup(i)
for link in soup.find_all('a'):
df['links'] = str(link.get('href'))
#print(link.get('href'))
#get all text messages
soup.find_all('p')
df['messages'] = str(soup.find_all('p'))
#get author name
soup.find_all(class_="author--name")
df['author'] = str(soup.find_all(class_="author--name"))
#get username
soup.find_all(class_= "author--username")
df['username'] = str(soup.find_all(class_= "author--username"))
All the soup lines of code are producing the data I need, but why is the dataframe not assigning the string values to the dataframe columns?
I can see that from an empty dataframe, the code creates the new columns but there are no values.
What am I doing wrong?
The solution was to wrap the assignments in brackets like so:
for i in lst:
df = pd.DataFrame()
soup = BeautifulSoup(i)
#print(soup)
for link in soup.find_all('a'):
df['links'] = [str(link.get('href'))]
#print(link.get('href'))
#get all text messages
soup.find_all('p')
df['messages'] = [str(soup.find_all('p'))]
#get author name
soup.find_all(class_="author--name")
df['author'] = [str(soup.find_all(class_="author--name"))]
#get username
soup.find_all(class_= "author--username")
df['username'] = [str(soup.find_all(class_= "author--username"))] text messages
soup.find_all('p')
df['messages'] = str(soup.find_all('p'))
#get author name
soup.find_all(class_="author--name")
df['author'] = str(soup.find_all(class_="author--name"))
#get username
soup.find_all(class_= "author--username")
df['username'] = str(soup.find_all(class_= "author--username"))
I have to requests.get() two urls from yahoo utilizing YQL that returns a json object. I'm getting back a json objects that I store into a list(). Then I'm looping to parse the data and creating a dic to then create a pandas data frame. Happened that only one list is getting appended to the data frame. Seems like in the last iteration, the 2nd list overwrites the first list. At this point, I can't figure out how to iterate on the list to append() both elements of the list. Here is my code...
import requests
import pandas as pd
urls = ['https://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20yahoo.finance.historicaldata%20where%20symbol%20in%20(%22DIA%22%2C%22SPY%22%2C%22IWN%22)%20and%20startDate%20%3D%20%222015-01-01%22%20and%20endDate%20%3D%20%222015-10-31%22&format=json&diagnostics=true&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys&callback=',
'https://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20yahoo.finance.historicaldata%20where%20symbol%20in%20(%22DIA%22%2C%22SPY%22%2C%22IWN%22)%20and%20startDate%20%3D%20%222015-11-01%22%20and%20endDate%20%3D%20%222016-08-31%22&format=json&diagnostics=true&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys&callback=']
for url in urls:
data = requests.get(url)
data_json = data.json()
quote_list = []
for quote in data_json['query']['results']['quote']:
quote_dic = {'symbol': quote['Symbol'],
'date': quote['Date'],
'volume': quote['Volume'],
'low': quote['Low'],
'high': quote['High'],
'open': quote['Open'],
'close': quote['Close'],
'adj_close': quote['Adj_Close']}
quote_list.append(quote_dic)
quote_df = pd.DataFrame(quote_list)
quote_df.to_csv('stocks.csv')
I need to be able to append the entire list() into the data frame. What would be the fix for this code?
Just create a list of dataframes, and concat them at the end of the loop:
df_list = []
for url in urls:
data = requests.get(url)
data_json = data.json()
df = pd.DataFrame(data_json['query']['results']['quote'])
df_list.append(df)
quote_df = pd.concat(df_list)
quote_df.to_csv('stocks.csv')
How about this solution?
import urllib
import re
import json
symbolslist = open("C:/Users/your_path_here/Desktop/stock_symbols.txt").read()
symbolslist = symbolslist.split("\n")
for symbol in symbolslist:
myfile = open("C:/Users/your_path_here/Desktop/" +symbol +".txt", "w+")
myfile.close()
htmltext = urllib.urlopen("http://www.bloomberg.com/markets/chart/data/1D/"+ symbol+ ":US")
data = json.load(htmltext)
datapoints = data["data_values"]
myfile = open("C:/Users/rshuell001/Desktop/symbols/" +symbol +".txt", "a")
for point in datapoints:
myfile.write(str(symbol+","+str(point[0])+","+str(point[1])+"\n"))
myfile.close()
In this file "C:/Users/your_path_here/Desktop/symbols/amex.txt"
you have the following tickers
ibm
sbux
msft