Result is not stacking as desired via tuple or dict - python-3.x

I am grabbing some financial data off a website, a list of names and a list of numbers, respectively. However, if i print them independently, I get the results separately fine, but I can't manage to put them together.
import requests
import bs4
import numpy as np
def fundamentals(name, number):
url = 'http://quotes.money.163.com/f10/dbfx_002230.html?date=2020-03-31#01c08'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/63.0.3239.132 Safari/537.36 QIHU 360SE'
}
response = requests.get(url, headers=headers).content
soup = bs4.BeautifulSoup(response, 'html5lib')
DuPond_name = soup.find_all(name='td', attrs={'class': 'dbbg01'})
DuPond_number = soup.find_all(name='td', attrs={'class': 'dbbg02'})
for i, names in enumerate(DuPond_name, 1):
name = names.getText()
print(name)
for i, numbers in enumerate(DuPond_number, 1):
number = numbers.getText()
print(number)
return {name: number}
if __name__ == '__main__':
print(fundamentals(name=[], number=[]))
DOM
净资产收益率
总资产收益率
权益乘数
销售净利率
总资产周转率
净利润
营业收入
营业收入
平均资产总额
营业收入
全部成本
投资收益
所得税
其他
营业成本
销售费用
管理费用
财务费用
-1.16%
-0.63%
1/(1-40.26%)
-9.33%
0.07
-131,445,229.01
1,408,820,489.46
1,408,820,489.46
9,751,224,017.79
1,408,820,489.46
1,704,193,442.22
5,971,254
17,965,689
--
776,103,494
274,376,792.25
186,977,519.02
5,173,865.88
{'财务费用': '5,173,865.88'}
Process finished with exit code 0
the final dict is only giving me the very last combination, how can i fix it? or if i can put them into dataframe form, sweeter. thank you all for the help!

Can you try this:
import pandas as pd
df = pd.DataFrame([list(DuPond_name), list(DuPond_number)]).T
For reference, This is how I tested it:
import pandas as pd
ls1 = ['净资产收益率','总资产收益率','权益乘数','销售净利率','总资产周转率','净利润','营业收入','营业收入','平均资产总额','营业收入','全部成本','投资收益','所得税','其他','营业成本','销售费用','管理费用','财务费用']
ls2 = ['-1.16%','-0.63%','1/(1-40.26%)','-9.33%','0.07','-131,445,229.01','1,408,820,489.46','1,408,820,489.46','9,751,224,017.79','1,408,820,489.46','1,704,193,442.22','5,971,254','17,965,689','--','776,103,494','274,376,792.25','186,977,519.02','5,173,865.88']
df = pd.DataFrame([ls1, ls2]).T
print(df)
This is the output I got:
0 1
0 净资产收益率 -1.16%
1 总资产收益率 -0.63%
2 权益乘数 1/(1-40.26%)
3 销售净利率 -9.33%
4 总资产周转率 0.07
5 净利润 -131,445,229.01
6 营业收入 1,408,820,489.46
7 营业收入 1,408,820,489.46
8 平均资产总额 9,751,224,017.79
9 营业收入 1,408,820,489.46
10 全部成本 1,704,193,442.22
11 投资收益 5,971,254
12 所得税 17,965,689
13 其他 --
14 营业成本 776,103,494
15 销售费用 274,376,792.25
16 管理费用 186,977,519.02
17 财务费用 5,173,865.88

Related

How to access a returned variable from function 1 to use in function 2?

Objective is to acquire stock price data for each stock ticker, then assign a relevant variable its current price. ie. var 'sandp' links to ticker_symbol 'GSPC' which equals the stocks closing price. This bit works. However, I wish to return each variable value which I can then use within another function, but how do I access that variable and its value?
Here is my code:
def live_indices():
"""Acquire stock value from Yahoo Finance using stock ticker as key. Then assign the relevant variable to the respective value.
ie. variable 'sandp' equates to the value gathered from 'GSPC' stock ticker.
"""
import requests
import bs4
ticker_symbol_1 = ['GSPC', 'DJI', 'IXIC', 'FTSE', 'NSEI', 'FCHI', 'N225', 'GDAXI']
ticker_symbol_2 = ['IMOEX.ME', '000001.SS'] # Assigned to seperate list as url for webscraping is different
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36'}
all_indices_values = []
for i in range(len(ticker_symbol_1)):
url = 'https://uk.finance.yahoo.com/quote/%5E' + ticker_symbol_1[i] + '?p=%5E' + ticker_symbol_1[i]
tmp_res = requests.get(url, headers=headers)
tmp_res.raise_for_status()
soup = bs4.BeautifulSoup(tmp_res.text, 'html.parser')
indices_price_value = soup.select(
'#quote-header-info > div.My\(6px\).Pos\(r\).smartphone_Mt\(6px\).W\(100\%\) > div.D\(ib\).Va\(m\).Maw\('
'65\%\).Ov\(h\) > div > fin-streamer.Fw\(b\).Fz\(36px\).Mb\(-4px\).D\(ib\)')[
0].text
all_indices_values.append(indices_price_value)
for i in range(len(ticker_symbol_2)):
url = 'https://uk.finance.yahoo.com/quote/' + ticker_symbol_2[i] + '?p=' + ticker_symbol_2[i]
tmp_res = requests.get(url, headers=headers)
tmp_res.raise_for_status()
soup = bs4.BeautifulSoup(tmp_res.text, 'html.parser')
indices_price_value = soup.select(
'#quote-header-info > div.My\(6px\).Pos\(r\).smartphone_Mt\(6px\).W\(100\%\) > div.D\(ib\).Va\(m\).Maw\('
'65\%\).Ov\(h\) > div > fin-streamer.Fw\(b\).Fz\(36px\).Mb\(-4px\).D\(ib\)')[
0].text
all_indices_values.append(indices_price_value)
sandp, dow, nasdaq, ftse100, nifty50, cac40, nikkei, dax, moex, shanghai = [all_indices_values[i] for i in range(10)] # 10 stock tickers in total
return sandp, dow, nasdaq, ftse100, nifty50, cac40, nikkei, dax, moex, shanghai
I want the next function to simply be given the variable name returned from the first function to print out the stock value. I have tried the below to no avail-
def display_value(stock_name):
print(stock_name)
display_value(live_indices(sandp))
The obvious error here is that 'sandp' is not defined.
Additionally, the bs4 code runs fairly slowly, would it be best to use Threads() or is there another way to speed things up?
This looks a bit complicated in my opinion, anyway focus on your question. So you are not returning a variable, you are returning a tuple of values.
def live_indices():
all_indices_values = [1,2,3,4,5,6,7,8,9,10,11,12,13]
sandp, dow, nasdaq, ftse100, nifty50, cac40, nikkei, dax, moex, shanghai = [all_indices_values[i] for i in range(10)] # 10 stock tickers in total
return sandp, dow, nasdaq, ftse100, nifty50, cac40, nikkei, dax, moex, shanghai
live_indices() #-> (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
What you want to make more likely to have is this assignment and it do not need the list comprehension nor the range() simply slice your list:
def live_indices():
all_indices_values = [1,2,3,4,5,6,7,8,9,10,11,12,13]
return all_indices_values
def display_value(x):
print(x)
sandp, dow, nasdaq, ftse100, nifty50, cac40, nikkei, dax, moex, shanghai = live_indices()[:10]
display_value(sandp)#-> 1
You may wanna work with more structured data, so return a dict:
Example
import requests
from bs4 import BeautifulSoup
def live_indices():
all_indices_values = {}
symbols = ['GSPC', 'DJI', 'IXIC', 'FTSE', 'NSEI', 'FCHI', 'N225', 'GDAXI', 'IMOEX.ME', '000001.SS']
for ticker in symbols:
url = f'https://uk.finance.yahoo.com/lookup/all?s={ticker}'
tmp_res = requests.get(url, headers=headers)
tmp_res.raise_for_status()
soup = bs4.BeautifulSoup(tmp_res.text, 'html.parser')
indices_price_value = soup.select('#Main tbody>tr td')[2].text
all_indices_values[ticker] = indices_price_value
return all_indices_values
def display_value(live_indices):
for ticker in live_indices.items():
print(ticker)
display_value(live_indices())
Output
('GSPC', '3,873.33')
('DJI', '30,822.42')
('IXIC', '11,448.40')
('FTSE', '7,236.68')
('NSEI', '17,530.85')
('FCHI', '6,077.30')
('N225', '27,567.65')
('GDAXI', '12,741.26')
('IMOEX.ME', '2,222.51')
('000001.SS', '3,126.40')

Trying to Capture the Table from Multiple Pages With For Loops

Good day, everyone.
I'm trying to get the table on each page from the links appended to 'player_page.'
I want the stats per game for each player in that season, and the table I want is listed on the players' individual page. Each link appended is correct, but I'm having trouble capturing the correct info when running my loops.
Any idea what I'm doing wrong here?
Any help is appreciated.
from bs4 import BeautifulSoup
import requests
import pandas as pd
from numpy import sin
url = 'https://www.pro-football-reference.com'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'
}
year = 2018
r = requests.get(url + '/years/' + str(year) + '/fantasy.htm')
soup = BeautifulSoup(r.content, 'lxml')
player_list = soup.find_all('td', attrs= {'class': 'left', 'data-stat': 'player'})
player_page = []
for player in player_list:
for link in player.find_all('a', href= True):
#names = str(link['href'])strip('')
link = str(link['href'].strip('.htm'))
player_page.append(url + link + '/gamelog' + '/' + str(year))
for page in player_page:
dfs = pd.read_html(page)
yearly_stats = []
for df in dfs:
yearly_stats.append(df)
final_stats = pd.concat(yearly_stats)
final_stats.to_excel('Fantasy2018.xlsx')
This works. The table columns change according to the player's position, I believe. Not everyone has tackle information, for example.
import pandas as pd
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://www.pro-football-reference.com'
year = 2018
r = requests.get(url + '/years/' + str(year) + '/fantasy.htm')
soup = BeautifulSoup(r.content, 'lxml')
player_list = soup.find_all('td', attrs= {'class': 'left', 'data-stat': 'player'})
dfs = []
for player in player_list:
for link in player.find_all('a', href= True):
name = link.getText()
link = str(link['href'].strip('.htm'))
try:
df = pd.read_html(url + link + '/gamelog' + '/' + str(year))[0]
for i, columns_old in enumerate(df.columns.levels):
columns_new = np.where(columns_old.str.contains('Unnamed'), '' , columns_old)
df.rename(columns=dict(zip(columns_old, columns_new)), level=i, inplace=True)
df.columns = df.columns.map('|'.join).str.strip('|')
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df = df.dropna(subset=['Date'])
df.insert(0,'Name',name)
df.insert(1,'Moment','Regular Season')
dfs.append(df)
except:
pass
try:
df1 = pd.read_html(url + link + '/gamelog' + '/' + str(year))[1]
for i, columns_old in enumerate(df1.columns.levels):
columns_new = np.where(columns_old.str.contains('Unnamed'), '' , columns_old)
df1.rename(columns=dict(zip(columns_old, columns_new)), level=i, inplace=True)
df1.columns = df1.columns.map('|'.join).str.strip('|')
df1['Date'] = pd.to_datetime(df1['Date'], errors='coerce')
df1 = df1.dropna(subset=['Date'])
df1.insert(0,'Name',name)
df1.insert(1,'Moment','Playoffs')
dfs.append(df1)
except:
pass
dfall = pd.concat(dfs)
dfall.to_excel('Fantasy2018.xlsx')

Loop pages and save contents in Excel file from website in Python

I'm trying to loop pages from this link and extract the interesting part.
Please see the contents in the red circle in the image below.
Here's what I've tried:
url = 'http://so.eastmoney.com/Ann/s?keyword=购买物业&pageindex={}'
for page in range(10):
r = requests.get(url.format(page))
soup = BeautifulSoup(r.content, "html.parser")
print(soup)
xpath for each element (might be helpful for those that don't read Chinese):
/html/body/div[3]/div/div[2]/div[2]/div[3]/h3/span --> 【润华物业】
/html/body/div[3]/div/div[2]/div[2]/div[3]/h3/a --> 润华物业:关于公司购买理财产品的公告
/html/body/div[3]/div/div[2]/div[2]/div[3]/p/label --> 2017-04-24
/html/body/div[3]/div/div[2]/div[2]/div[3]/p/span --> 公告编号:2017-019 证券代码:836007 证券简称:润华物业 主办券商:国联证券
/html/body/div[3]/div/div[2]/div[2]/div[3]/a --> http://data.eastmoney.com/notices/detail/836007/AN201704250530124271,JWU2JWI2JWE2JWU1JThkJThlJWU3JTg5JWE5JWU0JWI4JTlh.html
I need to save the output to an Excel file. How could I do that in Python? Many thanks.
BeautifulSoup won't see this stuff, as it's rendered dynamically by JS, but there's an API endpoint you can query to get what you're after.
Here's how:
import requests
import pandas as pd
def clean_up(text: str) -> str:
return text.replace('</em>', '').replace(':<em>', '').replace('<em>', '')
def get_data(page_number: int) -> dict:
url = f"http://searchapi.eastmoney.com/business/Web/GetSearchList?type=401&pageindex={page_number}&pagesize=10&keyword=购买物业&name=normal"
headers = {
"Referer": f"http://so.eastmoney.com/Ann/s?keyword=%E8%B4%AD%E4%B9%B0%E7%89%A9%E4%B8%9A&pageindex={page_number}",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:83.0) Gecko/20100101 Firefox/83.0",
}
return requests.get(url, headers=headers).json()
def parse_response(response: dict) -> list:
for item in response["Data"]:
title = clean_up(item['NoticeTitle'])
date = item['NoticeDate']
url = item['Url']
notice_content = clean_up(" ".join(item['NoticeContent'].split()))
company_name = item['SecurityFullName']
print(f"{company_name} - {title} - {date}")
yield [title, url, date, company_name, notice_content]
def save_results(parsed_response: list):
df = pd.DataFrame(
parsed_response,
columns=['title', 'url', 'date', 'company_name', 'content'],
)
df.to_excel("test_output.xlsx", index=False)
if __name__ == "__main__":
output = []
for page in range(1, 11):
for parsed_row in parse_response(get_data(page)):
output.append(parsed_row)
save_results(output)
This outputs:
栖霞物业购买资产的公告 - 2019-09-03 16:00:00 - 871792
索克物业购买资产的公告 - 2020-08-17 00:00:00 - 832816
中都物业购买股权的公告 - 2019-12-09 16:00:00 - 872955
开元物业:开元物业购买银行理财产品的公告 - 2015-05-21 16:00:00 - 831971
开元物业:开元物业购买银行理财产品的公告 - 2015-04-12 16:00:00 - 831971
盛全物业:拟购买房产的公告 - 2017-10-30 16:00:00 - 834070
润华物业购买资产暨关联交易公告 - 2016-08-23 16:00:00 - 836007
润华物业购买资产暨关联交易公告 - 2017-08-14 16:00:00 - 836007
萃华珠宝:关于拟购买物业并签署购买意向协议的公告 - 2017-07-10 16:00:00 - 002731
赛意信息:关于购买办公物业的公告 - 2020-12-02 00:00:00 - 300687
And saves this to a .csv file that can be easily handled by excel.
PS. I don't know Chinese (?) so you'd have to look into the response contents and pick more stuff out.
Updated answer based on #baduker's solution, but not working out for loop pages.
import requests
import pandas as pd
for page in range(10):
url = "http://searchapi.eastmoney.com/business/Web/GetSearchList?type=401&pageindex={}&pagesize=10&keyword=购买物业&name=normal"
headers = {
"Referer": "http://so.eastmoney.com/Ann/s?keyword=%E8%B4%AD%E4%B9%B0%E7%89%A9%E4%B8%9A&pageindex={}",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:83.0) Gecko/20100101 Firefox/83.0",
}
response = requests.get(url, headers=headers).json()
output_data = []
for item in response["Data"]:
# print(item)
# print('*' * 40)
title = item['NoticeTitle'].replace('</em>', '').replace(':<em>', '').replace('<em>', '')
url = item['Url']
date = item['NoticeDate'].split(' ')[0]
company_name = item['SecurityFullName']
content = item['NoticeContent'].replace('</em>', '').replace(':<em>', '').replace('<em>', '')
# url_code = item['Url'].split('/')[5]
output_data.append([title, url, date, company_name, content])
names = ['title', 'url', 'date', 'company_name', 'content']
df = pd.DataFrame(output_data, columns = names)
df.to_excel('test.xlsx', index = False)

How do I fix the AttributeError: 'NoneType' object has no attribute 'text'...when looping

I am a beginner and answers on this forum have been invaluable. I am using Python 3 and Beautiful Soup to scrape (non-table) data from multiple web pages on the same website by looping the page number. It works but I keep getting the AttributeError: 'NoneType' object has no attribute 'text' after the first iteration.
Here is the code I have tried thus far:
import requests
from bs4 import BeautifulSoup
import csv
import lxml
# Lists to store the scraped data in
addresses = []
geographies = []
rents = []
units = []
availabilities = []
# Scraping all pages
pages_url = requests.get('https://www.rent.com/new-york/tuckahoe-apartments')
pages_soup = BeautifulSoup(pages_url.text, 'html.parser')
list_nums = pages_soup.find('div', class_='_1y05u').text
print(list_nums)
pages = [str(i) for i in range(1,8)]
for page in pages:
response = requests.get('https://www.rent.com/new-york/tuckahoe-apartments?page=' + page).text
html_soup = BeautifulSoup(response, 'lxml')
# Extract data from individual listing containers
listing_containers = html_soup.find_all('div', class_='_3PdAH')
print(type(listing_containers))
print(len(listing_containers))
for container in listing_containers:
address = container.a.text
addresses.append(address)
geography = container.find('div', class_='_1dhrl').text
geographies.append(geography)
rent = container.find('div', class_='_3e12V').text
rents.append(rent)
unit = container.find('div', class_='_2tApa').text
units.append(unit)
availability = container.find('div', class_='_2P6xE').text
availabilities.append(availability)
import pandas as pd
test_df = pd.DataFrame({'Street' : addresses,
'City-State-Zip' : geographies,
'Rent' : rents,
'BR/BA' : units,
'Units Available' : availabilities
})
print(test_df)
Here is the output:
240 Properties
<class 'bs4.element.ResultSet'>
30
Street City-State-Zip Rent BR/BA Units Available
0 Quarry Place at Tuckahoe 64 Midland PlaceTuckahoe, NY 10707 $2,490+ 1–2 Beds • 1–2 Baths 2 Units Available
Traceback (most recent call last):
File "renttucktabletest.py", line 60, in <module>
availability = container.find('div', class_='_2P6xE').text
AttributeError: 'NoneType' object has no attribute 'text'
The result I am looking for is all 240 listings in the pandas dataframe exactly like the first iteration shown in the output above. Can anyone help to fix this error? Would be much appreciated. Thank you!
As pointed out, the issue is some of the containers are missing certain div elements. eg no 'unit' or 'availability' information.
One way to deal with this would be to use if - else statements. Append only if the element exists, else append a NaN value. Something like:
import requests
import numpy as np
from bs4 import BeautifulSoup
import csv
import lxml
# Lists to store the scraped data in
addresses = []
geographies = []
rents = []
units = []
availabilities = []
# Scraping all pages
pages_url = requests.get('https://www.rent.com/new-york/tuckahoe-apartments')
pages_soup = BeautifulSoup(pages_url.text, 'html.parser')
list_nums = pages_soup.find('div', class_='_1y05u').text
print(list_nums)
pages = [str(i) for i in range(1,8)]
for page in pages:
response = requests.get('https://www.rent.com/new-york/tuckahoe-apartments?page=' + page).text
html_soup = BeautifulSoup(response, 'lxml')
# Extract data from individual listing containers
listing_containers = html_soup.find_all('div', class_='_3PdAH')
print(type(listing_containers))
print(len(listing_containers))
for container in listing_containers:
address = container.a
if address:
addresses.append(address.text)
else:
addresses.append(np.nan)
geography = container.find('div', class_='_1dhrl')
if geography:
geographies.append(geography.text)
else:
geographies.append(np.nan)
rent = container.find('div', class_='_3e12V')
if rent:
rents.append(rent.text)
else:
rents.append(np.nan)
unit = container.find('div', class_='_2tApa')
if unit:
units.append(unit.text)
else:
units.append(np.nan)
availability = container.find('div', class_='_2P6xE')
if availability:
availabilities.append(availability.text)
else:
availabilities.append(np.nan)
import pandas as pd
test_df = pd.DataFrame({'Street' : addresses,
'City-State-Zip' : geographies,
'Rent' : rents,
'BR/BA' : units,
'Units Available' : availabilities
})
print(test_df)
Street City-State-Zip Rent \
0 Quarry Place at Tuckahoe 64 Midland PlaceTuckahoe, NY 10707 $2,490+
1 address not disclosed Tuckahoe, NY 10707 $2,510
2 address not disclosed Tuckahoe, NY 10707 $4,145
3 60 Washington St 1 60 Washington StTuckahoe, NY 10707 $3,500
4 269 Columbus Ave 5 269 Columbus AveTuckahoe, NY 10707 $2,700
BR/BA Units Available
0 1–2 Beds • 1–2 Baths 2 Units Available
1 1 Bed • 1 Bath NaN
2 2 Beds • 2 Bath NaN
3 3 Beds • 2 Bath NaN
4 2 Beds • 1 Bath NaN
If you pull the info from a script tag and treat as json that problem goes away. None or 0 is returned from the json where had you been trying for class name etc you would have got an error.
import requests
import json
from bs4 import BeautifulSoup as bs
import re
import pandas as pd
def add_records(url, s):
res = requests.get(url)
soup = bs(res.content, 'lxml')
r = re.compile(r'window.__APPLICATION_CONTEXT__ = (.*)')
data = soup.find('script', text=r).text
script = r.findall(data)[0]
items = json.loads(script)['store']['listings']['listings']
for item in items:
street = item['address']
geography = ', '.join([item['city'], item['state'], item['zipCode']])
rent = item['aggregates']['prices']['low']
BR_BA = 'beds: ' + str(item['aggregates']['beds']['low']) + ' , ' + 'baths: ' + str(item['aggregates']['baths']['low'])
units = item['aggregates']['totalAvailable']
listingId = item['listingId']
url = base_url + item['listingSeoPath']
# all_info = item
record = {'Street' : street,
'Geography' : geography,
'Rent' : rent,
'BR/BA' : BR_BA,
'Units Available' : units,
'ListingId' : listingId,
'Url' : url}
results.append(record)
url = 'https://www.rent.com/new-york/tuckahoe-apartments?page={}'
base_url = 'https://www.rent.com/'
results = []
with requests.Session() as s:
for page in range(1, 9):
add_records(url.format(page), s)
df = pd.DataFrame(results, columns = [ 'Street', 'Geography', 'Rent', 'BR/BA', 'Units Available', 'ListingId', 'Url'])
print(df)
Here is another approach to achieve the same.
import pandas
import requests
from bs4 import BeautifulSoup
urls = ['https://www.rent.com/new-york/tuckahoe-apartments?page={}'.format(page) for page in range(1,9)]
def get_content(links):
for url in links:
res = requests.get(url)
soup = BeautifulSoup(res.text, 'lxml')
for items in soup.select("._3PdAH"):
d = {}
d['address'] = items.select_one("[data-tid='property-title']").text
try:
d['geographies'] = items.select_one("[data-tid='listing-info-address']").text
except AttributeError: d['geographies'] = ""
try:
d['rent'] = items.select_one("[data-tid='price']").text
except AttributeError: d['rent'] = ""
try:
d['units'] = items.select_one("[data-tid='beds-baths']").text
except AttributeError: d['units'] = ""
try:
d['availabilities'] = items.select_one("[data-tid='property-unitAvailText']").text
except AttributeError: d['availabilities'] = ""
dataframe.append(d)
return dataframe
if __name__ == '__main__':
dataframe = []
item = get_content(urls)
df = pandas.DataFrame(item)
df.to_csv("output.csv",index=False)

Extracting data from web page to CSV file, only last row saved

I'm faced with the following challenge: I want to get all financial data about companies and I wrote a code that does it and let's say that the result is like below:
Unnamed: 0 I Q 2017 II Q 2017 \
0 Przychody netto ze sprzedaży (tys. zł) 137 134
1 Zysk (strata) z działal. oper. (tys. zł) -423 -358
2 Zysk (strata) brutto (tys. zł) -501 -280
3 Zysk (strata) netto (tys. zł)* -399 -263
4 Amortyzacja (tys. zł) 134 110
5 EBITDA (tys. zł) -289 -248
6 Aktywa (tys. zł) 27 845 26 530
7 Kapitał własny (tys. zł)* 22 852 22 589
8 Liczba akcji (tys. szt.) 13 921,975 13 921,975
9 Zysk na akcję (zł) -0029 -0019
10 Wartość księgowa na akcję (zł) 1641 1623
11 Raport zbadany przez audytora N N
but 464 times more.
Unfortunately when I want to save all 464 results in one CSV file I can save only one last result. Not all 464 results, just one... Could you help me save all? Below is my code.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.bankier.pl/gielda/notowania/akcje'
page = requests.get(url)
soup = BeautifulSoup(page.content,'lxml')
# Find the second table on the page
t = soup.find_all('table')[0]
#Read the table into a Pandas DataFrame
df = pd.read_html(str(t))[0]
#get
names_of_company = df["Walor AD"].values
links_to_financial_date = []
#all linkt with the names of companies
links = []
for i in range(len(names_of_company)):
new_string = 'https://www.bankier.pl/gielda/notowania/akcje/' + names_of_company[i] + '/wyniki-finansowe'
links.append(new_string)
############################################################################
for i in links:
url2 = f'https://www.bankier.pl/gielda/notowania/akcje/{names_of_company[0]}/wyniki-finansowe'
page2 = requests.get(url2)
soup = BeautifulSoup(page2.content,'lxml')
# Find the second table on the page
t2 = soup.find_all('table')[0]
df2 = pd.read_html(str(t2))[0]
df2.to_csv('output.csv', index=False, header=None)
You've almost got it. You're just overwriting your CSV each time. Replace
df2.to_csv('output.csv', index=False, header=None)
with
with open('output.csv', 'a') as f:
df2.to_csv(f, header=False)
in order to append to the CSV instead of overwriting it.
Also, your example doesn't work because this:
for i in links:
url2 = f'https://www.bankier.pl/gielda/notowania/akcje/{names_of_company[0]}/wyniki-finansowe'
should be:
for i in links:
url2 = i
When the website has no data, skip and move on to the next one:
try:
t2 = soup.find_all('table')[0]
df2 = pd.read_html(str(t2))[0]
with open('output.csv', 'a') as f:
df2.to_csv(f, header=False)
except:
pass

Resources