Finding regex pattern of date on separate lines and between text

Finding regex pattern of date on separate lines and between text - python-3.x

I got the regex pattern of 15 July - 3 September 2022 as
[\d]{1,2} [ADFJMNOS]\w* [\-] [\d]{1,2} [ADFJMNOS]\w* [\d]{4}
My doubts are
What will be the regex pattern if the date is not on a single line
example
15 July - 3 September
2022
22
July
Desired Output
15 July - 3 September 2022
22 July
Also if it is seen as a part of another word
example
delayed15 July – 3 Septemer 2022
Here it is attached with the word "delayed". The word can be anything.
Desired Output
15 July – 3 Septemer 2022
Code i am trying
from selenium import webdriver
from bs4 import BeautifulSoup
import re
url_list = ['https://alabamasymphony.org/event/bachmozart']
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.3"
}
driver = webdriver.Chrome('/home/ubuntu/selenium_drivers/chromedriver')
format_list = ["[\d]{1,2} [ADFJMNOS]\w* [\-] [\d]{1,2} [ADFJMNOS]\w* [\d]{4}"]
for URL in url_list:
date = []
driver.get(URL)
driver.implicitly_wait(2)
data = driver.page_source
cleantext = BeautifulSoup(data, "lxml").text
cleanr = re.compile('<.*?>')
x = re.sub(cleanr, ' ', cleantext)
print(URL)
for pattern in format_list:
all_dates = re.findall(pattern, x)
if all_dates == []:
continue
else:
date.append(all_dates)
for s in date:
print(s)

You can try this regex (Demo):
(\d{1,2})\s*([A-Za-z]+)[ \t]*(?:(-\s*\d{1,2}\s*[A-Za-z]+)\s+(\d{4}))?
The idea is to break down the date and text in 4 groups and print those as needed by you.
source code (run here) :
import re
regex = r"(\d{1,2})\s*([A-Za-z]+)[ \t]*(?:(-\s*\d{1,2}\s*[A-Za-z]+)\s+(\d{4}))?"
test_str = ("15 July - 3 September 2022 as\n\n\n"
"15 July - 3 September \n"
"2022\n"
"22\n"
"July\n\n\n\n"
" Also if it is seen as a part of another word\n\n"
"example\n\n"
"delayed15 July - 3 Septemer 2022\n\n"
"Here it is attached with the word \"delayed\". The word can be anything.\n\n\n")
matches = re.finditer(regex, test_str, re.MULTILINE)
for match in matches:
if match.group(3) is not None :
print(match.group(1)+" "+match.group(2)+match.group(3)+" "+match.group(4))
else:
print(match.group(1)+" "+match.group(2))

Related

Loop pages and save contents in Excel file from website in Python

I'm trying to loop pages from this link and extract the interesting part.
Please see the contents in the red circle in the image below.
Here's what I've tried:
url = 'http://so.eastmoney.com/Ann/s?keyword=购买物业&pageindex={}'
for page in range(10):
r = requests.get(url.format(page))
soup = BeautifulSoup(r.content, "html.parser")
print(soup)
xpath for each element (might be helpful for those that don't read Chinese):
/html/body/div[3]/div/div[2]/div[2]/div[3]/h3/span --> 【润华物业】
/html/body/div[3]/div/div[2]/div[2]/div[3]/h3/a --> 润华物业:关于公司购买理财产品的公告
/html/body/div[3]/div/div[2]/div[2]/div[3]/p/label --> 2017-04-24
/html/body/div[3]/div/div[2]/div[2]/div[3]/p/span --> 公告编号：2017-019 证券代码：836007 证券简称：润华物业 主办券商：国联证券
/html/body/div[3]/div/div[2]/div[2]/div[3]/a --> http://data.eastmoney.com/notices/detail/836007/AN201704250530124271,JWU2JWI2JWE2JWU1JThkJThlJWU3JTg5JWE5JWU0JWI4JTlh.html
I need to save the output to an Excel file. How could I do that in Python? Many thanks.

BeautifulSoup won't see this stuff, as it's rendered dynamically by JS, but there's an API endpoint you can query to get what you're after.
Here's how:
import requests
import pandas as pd
def clean_up(text: str) -> str:
return text.replace('</em>', '').replace(':<em>', '').replace('<em>', '')
def get_data(page_number: int) -> dict:
url = f"http://searchapi.eastmoney.com/business/Web/GetSearchList?type=401&pageindex={page_number}&pagesize=10&keyword=购买物业&name=normal"
headers = {
"Referer": f"http://so.eastmoney.com/Ann/s?keyword=%E8%B4%AD%E4%B9%B0%E7%89%A9%E4%B8%9A&pageindex={page_number}",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:83.0) Gecko/20100101 Firefox/83.0",
}
return requests.get(url, headers=headers).json()
def parse_response(response: dict) -> list:
for item in response["Data"]:
title = clean_up(item['NoticeTitle'])
date = item['NoticeDate']
url = item['Url']
notice_content = clean_up(" ".join(item['NoticeContent'].split()))
company_name = item['SecurityFullName']
print(f"{company_name} - {title} - {date}")
yield [title, url, date, company_name, notice_content]
def save_results(parsed_response: list):
df = pd.DataFrame(
parsed_response,
columns=['title', 'url', 'date', 'company_name', 'content'],
)
df.to_excel("test_output.xlsx", index=False)
if __name__ == "__main__":
output = []
for page in range(1, 11):
for parsed_row in parse_response(get_data(page)):
output.append(parsed_row)
save_results(output)
This outputs:
栖霞物业购买资产的公告 - 2019-09-03 16:00:00 - 871792
索克物业购买资产的公告 - 2020-08-17 00:00:00 - 832816
中都物业购买股权的公告 - 2019-12-09 16:00:00 - 872955
开元物业:开元物业购买银行理财产品的公告 - 2015-05-21 16:00:00 - 831971
开元物业:开元物业购买银行理财产品的公告 - 2015-04-12 16:00:00 - 831971
盛全物业:拟购买房产的公告 - 2017-10-30 16:00:00 - 834070
润华物业购买资产暨关联交易公告 - 2016-08-23 16:00:00 - 836007
润华物业购买资产暨关联交易公告 - 2017-08-14 16:00:00 - 836007
萃华珠宝:关于拟购买物业并签署购买意向协议的公告 - 2017-07-10 16:00:00 - 002731
赛意信息:关于购买办公物业的公告 - 2020-12-02 00:00:00 - 300687
And saves this to a .csv file that can be easily handled by excel.
PS. I don't know Chinese (?) so you'd have to look into the response contents and pick more stuff out.

Updated answer based on #baduker's solution, but not working out for loop pages.
import requests
import pandas as pd
for page in range(10):
url = "http://searchapi.eastmoney.com/business/Web/GetSearchList?type=401&pageindex={}&pagesize=10&keyword=购买物业&name=normal"
headers = {
"Referer": "http://so.eastmoney.com/Ann/s?keyword=%E8%B4%AD%E4%B9%B0%E7%89%A9%E4%B8%9A&pageindex={}",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:83.0) Gecko/20100101 Firefox/83.0",
}
response = requests.get(url, headers=headers).json()
output_data = []
for item in response["Data"]:
# print(item)
# print('*' * 40)
title = item['NoticeTitle'].replace('</em>', '').replace(':<em>', '').replace('<em>', '')
url = item['Url']
date = item['NoticeDate'].split(' ')[0]
company_name = item['SecurityFullName']
content = item['NoticeContent'].replace('</em>', '').replace(':<em>', '').replace('<em>', '')
# url_code = item['Url'].split('/')[5]
output_data.append([title, url, date, company_name, content])
names = ['title', 'url', 'date', 'company_name', 'content']
df = pd.DataFrame(output_data, columns = names)
df.to_excel('test.xlsx', index = False)

Get an xhr document that loads when you visit a page

I tried to obtain the elements that we can see below the photos on the following site or on others, equivalent:
https://www.nosetime.com/xiangshui/947895-oulong-xuecheng-atelier-cologne-orange.html
https://www.nosetime.com/xiangshui/705357-pomelo-paradis.html
https://www.nosetime.com/xiangshui/592260-cl-mentine-california.html
https://www.nosetime.com/xiangshui/612353-oulong-atelier-cologne-trefle.html
https://www.nosetime.com/xiangshui/911317-oulong-nimingmeigui-atelier-cologne.html
But I can't get it from the source code. It is supposed to download dynamically with a javascript script. In fact it seems to be in an xhr document:
So how can I get an xhr document that is downloaded when I visit a page?
I tried:
url = "https://www.nosetime.com/xiangshui/350870-oulong-atelier-cologne-oolang-infini.html"
r = requests.post(url, headers=headers)
data = r.json()
print(data)
Pero me devuelve:
---------------------------------------------------------------------------
JSONDecodeError Traceback (most recent call last)
<ipython-input-8-e72156ddb336> in <module>()
2
3 r = requests.post(url, headers=headers)
----> 4 data = r.json()
5
6 print(data)
3 frames
/usr/lib/python3.6/json/decoder.py in raw_decode(self, s, idx)
355 obj, end = self.scan_once(s, idx)
356 except StopIteration as err:
--> 357 raise JSONDecodeError("Expecting value", s, err.value) from None
358 return obj, end
JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Just add the right headers and there you have the data.
import requests
headers = {
"referer": "https://www.nosetime.com/xiangshui/350870-oulong-atelier-cologne-oolang-infini.html",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36",
}
response = requests.get("https://www.nosetime.com/app/item.php?id=350870", headers=headers).json()
print(response["id"], response["isscore"], response["brandid"])
For some reason I can't paste the entire JSON output as SO thinks this is spam... o.O. Anyhow, this should get you the JSON response.
This prints:
350870 8.6 10091761
EDIT:
If you have more products, you can simply look over the product URLS and extract from the JSON what you need. For example,
import requests
product_urls = [
"https://www.nosetime.com/xiangshui/947895-oulong-xuecheng-atelier-cologne-orange.html",
"https://www.nosetime.com/xiangshui/705357-pomelo-paradis.html",
"https://www.nosetime.com/xiangshui/592260-cl-mentine-california.html",
"https://www.nosetime.com/xiangshui/612353-oulong-atelier-cologne-trefle.html",
"https://www.nosetime.com/xiangshui/911317-oulong-nimingmeigui-atelier-cologne.html",
]
for product_url in product_urls:
headers = {
"referer": product_url,
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36",
}
product_id = product_url.split("/")[-1].split("-")[0]
response = requests.get(
f"https://www.nosetime.com/app/item.php?id={product_id}",
headers=headers,
).json()
print(f"Product name: {response['enname']} | Rating: {response['isscore']}")
Output:
Product name: Atelier Cologne Orange Sanguine, 2010 | Rating: 8.9
Product name: Atelier Cologne Pomelo Paradis, 2015 | Rating: 8.8
Product name: Atelier Cologne Clémentine California, 2016 | Rating: 8.6
Product name: Atelier Cologne Trefle Pur, 2010 | Rating: 8.6
Product name: Atelier Cologne Rose Anonyme, 2012 | Rating: 7.7

how to save results in text file or excel without square brackets?

I am working on webscraping,I am taking names from text file by line by line and searching it on google and scraping address from that results. I want to add that result infront of respective names. this is my text file a.txt:
0.5BN FINHEALTH PRIVATE LIMITED
01 SYNERGY CO.
1 BY 0 SOLUTIONS
and this is my code:
import requests
from bs4 import BeautifulSoup
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
out_fl = open('a.txt','r')
for line in out_fl:
query = line
query = query.replace(' ', '+')
print(line)
URL = f"https://google.com/search?q={query}"
print(URL)
headers = {"user-agent": USER_AGENT}
resp = requests.get(URL, headers=headers)
if resp.status_code == 200:
soup = BeautifulSoup(resp.content, "html.parser")
results = []
newline = '\n'
for g in soup.find_all('span', class_="i4J0ge"):
x = f'{line}:{g.text}{newline}'
results.append(x)
print(results)
with open("results.txt","a") as result:
result.write(str(results))
I am getting result like this but its not formatted properly please help me out.
my expected result is like:
0.5BN FINHEALTH PRIVATE LIMITED : Address: 2nd Floor, BHIVE Forum, GNS Towers #18, Dairy
Circle Road, Adugodi, Koramangala, Bengaluru, Karnataka 560029Hours: Closed ⋅ Opens 9:30AM
MonSaturdayClosedSundayClosedMonday9:30am–7:30pmTuesday9:30am–7:30pmWednesday9:30am–
7:30pmThursday9:30am–7:30pmFriday9:30am–7:30pmSuggest an editUnable to add this file.
Please check that it is a valid photo
01 SYNERGY CO. : 01 SYNERGY CO.\n:Located in: Punjab Agricultural UniversityAddress: 3rd
Floor Kartar Bhawan, Ferozpur Rd, Ludhiana, Punjab 141001Hours: Closes soon ⋅ 5PM ⋅ Opens
9:30AM MonSaturday10am–5pmSundayClosedMonday9:30am–7:30pmTuesday9:30am–
7:30pmWednesday9:30am–7:30pmThursday9:30am–7:30pmFriday9:30am–7:30pmSuggest an editUnable
to add this file. Please check that it is a valid photo.Phone: 098159 18807
Or into excel. Thank you

You can assign your results to pandas data frame and than you can write it into excel or csv
Import pandas as pd
df=pd.DataFrame(columns=["",""]. # Assign column name as required
df = [results]
df.to_excel('filename.xlsx', sheet_name='sheet name', index = False)

Result is not stacking as desired via tuple or dict

I am grabbing some financial data off a website, a list of names and a list of numbers, respectively. However, if i print them independently, I get the results separately fine, but I can't manage to put them together.
import requests
import bs4
import numpy as np
def fundamentals(name, number):
url = 'http://quotes.money.163.com/f10/dbfx_002230.html?date=2020-03-31#01c08'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/63.0.3239.132 Safari/537.36 QIHU 360SE'
}
response = requests.get(url, headers=headers).content
soup = bs4.BeautifulSoup(response, 'html5lib')
DuPond_name = soup.find_all(name='td', attrs={'class': 'dbbg01'})
DuPond_number = soup.find_all(name='td', attrs={'class': 'dbbg02'})
for i, names in enumerate(DuPond_name, 1):
name = names.getText()
print(name)
for i, numbers in enumerate(DuPond_number, 1):
number = numbers.getText()
print(number)
return {name: number}
if __name__ == '__main__':
print(fundamentals(name=[], number=[]))
DOM
净资产收益率
总资产收益率
权益乘数
销售净利率
总资产周转率
净利润
营业收入
营业收入
平均资产总额
营业收入
全部成本
投资收益
所得税
其他
营业成本
销售费用
管理费用
财务费用
-1.16%
-0.63%
1/(1-40.26%)
-9.33%
0.07
-131,445,229.01
1,408,820,489.46
1,408,820,489.46
9,751,224,017.79
1,408,820,489.46
1,704,193,442.22
5,971,254
17,965,689
--
776,103,494
274,376,792.25
186,977,519.02
5,173,865.88
{'财务费用': '5,173,865.88'}
Process finished with exit code 0
the final dict is only giving me the very last combination, how can i fix it? or if i can put them into dataframe form, sweeter. thank you all for the help!

Can you try this:
import pandas as pd
df = pd.DataFrame([list(DuPond_name), list(DuPond_number)]).T
For reference, This is how I tested it:
import pandas as pd
ls1 = ['净资产收益率','总资产收益率','权益乘数','销售净利率','总资产周转率','净利润','营业收入','营业收入','平均资产总额','营业收入','全部成本','投资收益','所得税','其他','营业成本','销售费用','管理费用','财务费用']
ls2 = ['-1.16%','-0.63%','1/(1-40.26%)','-9.33%','0.07','-131,445,229.01','1,408,820,489.46','1,408,820,489.46','9,751,224,017.79','1,408,820,489.46','1,704,193,442.22','5,971,254','17,965,689','--','776,103,494','274,376,792.25','186,977,519.02','5,173,865.88']
df = pd.DataFrame([ls1, ls2]).T
print(df)
This is the output I got:
0 1
0 净资产收益率 -1.16%
1 总资产收益率 -0.63%
2 权益乘数 1/(1-40.26%)
3 销售净利率 -9.33%
4 总资产周转率 0.07
5 净利润 -131,445,229.01
6 营业收入 1,408,820,489.46
7 营业收入 1,408,820,489.46
8 平均资产总额 9,751,224,017.79
9 营业收入 1,408,820,489.46
10 全部成本 1,704,193,442.22
11 投资收益 5,971,254
12 所得税 17,965,689
13 其他 --
14 营业成本 776,103,494
15 销售费用 274,376,792.25
16 管理费用 186,977,519.02
17 财务费用 5,173,865.88

How to use python to click the “load more” to extract links of names

I want to get links of names from all the pages by clicking load more and needs help with pagination
I've got the logic to print links for names but needs help with pagination
for pos in positions:
url = "https://247sports.com/Season/2021-Football/CompositeRecruitRankings/?InstitutionGroup=HighSchool"
two = requests.get("https://247sports.com/Season/2021-Football/CompositeRecruitRankings/?InstitutionGroup=HighSchool" + pos,headers=HEADERS)
bsObj = BeautifulSoup(two.content , 'lxml')
main_content = urljoin(url,bsObj.select(".data-js")[1]['href']) ## ['href']InstitutionGroup" extracting the link leading to the page containing everything available here
response = requests.get(main_content)
obj = BeautifulSoup(response.content , 'lxml')
names = obj.findAll("div",{"class" : "recruit"})
for player_name in names:
player_name.find('a',{'class' : ' rankings-page__name-link'})
for all_players in player_name.find_all('a', href=True):
player_urls = site + all_players.get('href')
# print(player_urls)
I expect output : https://247sports.com/Player/Jack-Sawyer-46049925/
(links of all player names)

Can just iterate through the parameters in the requests. Since you can just continue to iterate forever, I had it check for when players started to repeat (essentially when the next iteration doesn't add new players). Seem to stop after 21 pages which gives 960 players.
import requests
from bs4 import BeautifulSoup
url = 'https://247sports.com/Season/2021-Football/CompositeRecruitRankings/'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'}
player_links = []
prior_count = 0
for page in range(1,101):
#print ('Page: %s' %page)
payload = {
'ViewPath': '~/Views/SkyNet/PlayerSportRanking/_SimpleSetForSeason.ascx',
'InstitutionGroup': 'HighSchool',
'Page': '%s' %page}
response = requests.get(url, headers=headers, params=payload)
soup = BeautifulSoup(response.text, 'html.parser')
recruits = soup.find_all('div',{'class':'recruit'})
for recruit in recruits:
print ('https://247sports.com' + recruit.find('a')['href'])
player_links.append('https://247sports.com' + recruit.find('a')['href'])
current_count = len(list(set(player_links)))
if prior_count == current_count:
print ('No more players')
break
else:
prior_count = current_count
Output:
print (player_links)
['https://247sports.com/Player/Korey-Foreman-46056100', 'https://247sports.com/Player/Jack-Sawyer-46049925', 'https://247sports.com/Player/Tommy-Brockermeyer-46040211', 'https://247sports.com/Player/James-Williams-46049981', 'https://247sports.com/Player/Payton-Page-46055295', 'https://247sports.com/Player/Camar-Wheaton-46050152', 'https://247sports.com/Player/Brock-Vandagriff-46050870', 'https://247sports.com/Player/JT-Tuimoloau-46048440', 'https://247sports.com/Player/Emeka-Egbuka-46048438', 'https://247sports.com/Player/Tony-Grimes-46048912', 'https://247sports.com/Player/Sam-Huard-46048437', 'https://247sports.com/Player/Amarius-Mims-46079928', 'https://247sports.com/Player/Savion-Byrd-46078964', 'https://247sports.com/Player/Jake-Garcia-46053996', 'https://247sports.com/Player/Agiye-Hall-46055274', 'https://247sports.com/Player/Caleb-Williams-46040610', 'https://247sports.com/Player/JJ-McCarthy-46042742', 'https://247sports.com/Player/Dylan-Brooks-46079585', 'https://247sports.com/Player/Nolan-Rucci-46058902', 'https://247sports.com/Player/GaQuincy-McKinstry-46052990', 'https://247sports.com/Player/Will-Shipley-46056925', 'https://247sports.com/Player/Maason-Smith-46057128', 'https://247sports.com/Player/Isaiah-Johnson-46050757', 'https://247sports.com/Player/Landon-Jackson-46049327', 'https://247sports.com/Player/Tunmise-Adeleye-46050288', 'https://247sports.com/Player/Terrence-Lewis-46058521', 'https://247sports.com/Player/Lee-Hunter-46058922', 'https://247sports.com/Player/Raesjon-Davis-46056065', 'https://247sports.com/Player/Kyle-McCord-46047962', 'https://247sports.com/Player/Beaux-Collins-46049126', 'https://247sports.com/Player/Landon-Tengwall-46048781', 'https://247sports.com/Player/Smael-Mondon-46058273', 'https://247sports.com/Player/Derrick-Davis-Jr-46049676', 'https://247sports.com/Player/Troy-Franklin-46048840', 'https://247sports.com/Player/Tywone-Malone-46081337', 'https://247sports.com/Player/Micah-Morris-46051663', 'https://247sports.com/Player/Donte-Thornton-46056489', 'https://247sports.com/Player/Bryce-Langston-46050326', 'https://247sports.com/Player/Damon-Payne-46041148', 'https://247sports.com/Player/Rocco-Spindler-46049869', 'https://247sports.com/Player/David-Daniel-46076804', 'https://247sports.com/Player/Branden-Jennings-46049721', 'https://247sports.com/Player/JaTavion-Sanders-46058800', 'https://247sports.com/Player/Chris-Hilton-46055801', 'https://247sports.com/Player/Jason-Marshall-46051367', ... ]

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Finding regex pattern of date on separate lines and between text - python-3.x

Related

Loop pages and save contents in Excel file from website in Python

Get an xhr document that loads when you visit a page

how to save results in text file or excel without square brackets?

Result is not stacking as desired via tuple or dict

How to use python to click the “load more” to extract links of names

Categories

Resources