How do I put my scraped data into a data frame - text

Please i need help, am having trouble trying to put my scraped data into a data frame that has 3 columns i.e. date, source and keywords extracted from each scraped website for further text analysis, my code is borrowed from https://stackoverflow.com/users/12229253/foreverlearning and is given below:
from newspaper import Article
import nltk
nltk.download('punkt')
urls = ['https://dailypost.ng/2022/02/02/securing-nigeria-duty-of-all-tiers-of-government-oyo-senator-balogun/', 'https://guardian.ng/news/declare-bandits-as-terrorists-senate-tells-buhari/', 'https://www.thisdaylive.com/index.php/2021/10/24/when-will-fg-declare-bandits-as-terrorists/', 'https://punchng.com/rep-wants-buhari-to-name-lawmaker-sponsoring-terrorism/', 'https://punchng.com/national-assembly-plans-to-meet-us-congress-over-875m-weapons-deal-stoppage/']
results = {}
for url in urls:
article = Article(url)
article.download()
article.parse()
article.nlp()
results[url] = article
for url in urls:
print(url)
article = results[url]
print(article.authors)
print(article.publish_date)
print(article.keywords)

I played around with it and here is how you can make it into a data frame. Assuming that you wanted to use pandas in the first place:
import nltk
import pandas as pd
from newspaper import Article
nltk.download('punkt')
urls = ['https://dailypost.ng/2022/02/02/securing-nigeria-duty-of-all-tiers-of-government-oyo-senator-balogun/', 'https://guardian.ng/news/declare-bandits-as-terrorists-senate-tells-buhari/', 'https://www.thisdaylive.com/index.php/2021/10/24/when-will-fg-declare-bandits-as-terrorists/', 'https://punchng.com/rep-wants-buhari-to-name-lawmaker-sponsoring-terrorism/', 'https://punchng.com/national-assembly-plans-to-meet-us-congress-over-875m-weapons-deal-stoppage/']
# create a data frame with the needed columns
saved_data = pd.DataFrame(columns=['Date', 'Source', 'KeyWords'])
# put into a data frame that has 3 columns i.e. date, source and keywords
def add_data_to_df(urls, saved_data):
for url in urls: # process each url separately
article = Article(url)
article.download()
article.parse()
article.nlp()
# create a row with the data you need using attributes
record = {'Date': article.publish_date, 'Source': url, 'KeyWords': article.keywords}
# append info about each url as a new row
saved_data = saved_data.append(record, ignore_index = True)
return saved_data
Now, when you run this function
add_data_to_df(urls, saved_data), you should see a data frame with contents similar to the ones I got below during testing:
Date Source KeyWords
0 2022-02-02 00:00:00 https://dailypost.ng/2022/02/02/securing-niger... [nigeria, securing, oyo, state, senator, prote...
1 2021-09-30 04:25:24+00:00 https://guardian.ng/news/declare-bandits-as-te... [shutdown, terrorists, nigeria, guardian, decl...
2 2021-10-24 00:00:00 https://www.thisdaylive.com/index.php/2021/10/... [terrorists, nigerian, declare, state, militar...
3 2021-10-05 14:41:48+00:00 https://punchng.com/rep-wants-buhari-to-name-l... [president, buhari, house, national, lawmaker,...
4 2021-07-31 00:30:47+00:00 https://punchng.com/national-assembly-plans-to... [plans, congress, deal, nigeria, nigerian, rig...
(Sorry for the format, I am showing the output as plain text since I am not allowed to attach screenshots, but you will have a nice pandas format)
Edit: adding a function to save the data frame to a csv file. Note that this one of the shortest ways of doing this and it will save the file to the current working directory, i.e., where you are executing your code:
# this function saves given data to csv
def save_to_csv(saved_data):
saved_data.to_csv('output.csv', index=False, sep=',')
# process the articles and create a csv
save_to_csv(add_data_to_df(urls, saved_data))

Related

How to use HDFStore.select screen data

Question:
1、how to select the rows(Pseudo code) : columns['Name']='Name_A' (Name_A just a example) & columns['time'] isin (2021-11-21 00:00:00,2021-11-22 00:00:00) .
I have store about 4 billion rows data to a hdf5 file.
Now, I want to select some data.
My code like this:
import pandas as pd
ss = pd.HDFStore("xh_data_L9.hdf5") #<class 'pandas.io.pytables.HDFStore'>
print(type(ss))
print(ss.keys())
s_1 = ss.select('alldata',start=0,stop=500) # data example
print(s_1)
ss.close()
I found HDFStore.select usage like this:
HDFStore.select(key, where=None, start=None, stop=None, columns=None, iterator=False, chunksize=None, auto_close=False)
# can not run success.
s_3 = ss.select('alldata',where="Time>2021-11-21 00:00:00 & Time<2021-11-22 00:00:00)")
s_3 = ss.select('alldata',['Name'] == 'Name_A')
I have google some method,but don't how to use "where"
code and result
I found that the reason was whether the data_columns was established when the file was created.
#this method created hdf5 don't have data_columns
ss.append('store',df_temp,index=True)
#this method created hdf5 have data_columns
store.append("store", df_temp, format="table", data_columns=True)
#query whether include data_columns
import pandas as pd
ss = pd.HDFStore("store.hdf5")
print(ss.info())
if the result include " dc->[Time,Name,Value]".
ss.select("store",where="Name='Name_A'")
#Single quotation marks are required before and after the varies.
The following is the official website explanation for data_columns:
data_columns :
list of columns, or True, default None
List of columns to create as indexed data columns for on-disk
queries, or True to use all columns. By default only the axes of the object are indexed.
See here <https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#query-via-data-columns>.

Pandas read_xml and SEPA (CAMT 053) XML

Recently I wanted to try the newly implemented xml_read function within pandas. I thought about testing the feature with SEPA camt-format xml. I'm stuck with the functions parameters, as I'm unfamiliar with the lxml logic. I tried pointing to the transactions values as rows ("Ntry" tag), as I thought this will then loop through those rows and creates the dataframe. Setting xpath to default returns an empty dataframe with the columns "GrpHdr" and "Rpt", but the relevant data is one level below "Rpt". Setting xpath='//*' creates a huge dataframe with every tag as column and values randomly sorted.
If anyone is familiar with using the pandas xml_read and nested xmls, I'd appreciate any hints.
The xml file looks like this (fake values):
<Document>
<BkToCstmrAcctRpt>
<GrpHdr>
<MsgId>Account</MsgId>
<CreDtTm>2021-08-05T14:20:23.077+02:00</CreDtTm>
<MsgRcpt>
<Nm> Name</Nm>
</MsgRcpt>
</GrpHdr>
<Rpt>
<Id>Account ID</Id>
<CreDtTm>2021-08-05T14:20:23.077+02:00</CreDtTm>
<Acct>
<Id>
<IBAN>DEXXXXX</IBAN>
</Id>
</Acct>
<Bal>
<Tp>
<CdOrPrtry>
</CdOrPrtry>
</Tp>
<Amt Ccy="EUR">161651651651</Amt>
<CdtDbtInd>CRDT</CdtDbtInd>
<Dt>
<DtTm>2021-08-05T14:20:23.077+02:00</DtTm>
</Dt>
</Bal>
<Ntry>
<Amt Ccy="EUR">11465165</Amt>
<CdtDbtInd>CRDT</CdtDbtInd>
<Sts>BOOK</Sts>
<BookgDt>
<Dt>2021-08-02</Dt>
</BookgDt>
<ValDt>
<Dt>2021-08-02</Dt>
</ValDt>
<BkTxCd>
<Domn>
<Cd>PMNT</Cd>
<Fmly>
<Cd>RCDT</Cd>
<SubFmlyCd>ESCT</SubFmlyCd>
</Fmly>
</Domn>
<Prtry>
<Cd>NTRF+65454</Cd>
<Issr>DFE</Issr>
</Prtry>
</BkTxCd>
<NtryDtls>
<TxDtls>
<Amt Ccy="EUR">4945141.0</Amt>
<CdtDbtInd>CRDT</CdtDbtInd>
<BkTxCd>
<Domn>
<Cd>PMNT</Cd>
<Fmly>
<Cd>RCDT</Cd>
<SubFmlyCd>ESCT</SubFmlyCd>
</Fmly>
</Domn>
<Prtry>
<Cd>NTRF+55155</Cd>
<Issr>DFEsds</Issr>
</Prtry>
</BkTxCd>
<RltdPties>
<Dbtr>
<Nm>Name</Nm>
</Dbtr>
<Cdtr>
<Nm>Name</Nm>
</Cdtr>
</RltdPties>
<RmtInf>
<Ustrd>Referenz NOTPROVIDED</Ustrd>
<Ustrd> Buchug</Ustrd>
</RmtInf>
</TxDtls>
</NtryDtls>
</Ntry>
</Rpt>
</BkToCstmrAcctRpt>
</Document>
The bank statement is not a shallow xml, thus not very suitable for pandas.read_xml (as indicated in the documentation).
Instead I suggest to use sepa library.
from sepa import parser
import re
import pandas as pd
# Utility function to remove additional namespaces from the XML
def strip_namespace(xml):
return re.sub(' xmlns="[^"]+"', '', xml, count=1)
# Read file
with open('example.xml', 'r') as f:
input_data = f.read()
# Parse the bank statement XML to dictionary
camt_dict = parser.parse_string(parser.bank_to_customer_statement, bytes(strip_namespace(input_data), 'utf8'))
statements = pd.DataFrame.from_dict(camt_dict['statements'])
all_entries = []
for i,_ in statements.iterrows():
if 'entries' in camt_dict['statements'][i]:
df = pd.DataFrame()
dd = pd.DataFrame.from_records(camt_dict['statements'][i]['entries'])
df['Date'] = dd['value_date'].str['date']
df['Date'] = pd.to_datetime(df['Date']).dt.strftime('%Y-%m-%d')
iban = camt_dict['statements'][i]['account']['id']['iban']
df['IBAN'] = iban
df['Currency'] = dd['amount'].str['currency']
all_entries.append(df)
df_entries = pd.concat(all_entries)

Import Balance Sheet in an automatic organized manner from SEC to Dataframe

I am looking at getting the Balance Sheet data automatically and properly organized for any company using Beautiful Soup.
I am not planning on getting each variable but rather the whole Balance sheet. Originally, I was trying to do many codes to extract the URL for a particular company of my choice.
For Example, suppose I want to get the Balance Sheet data from the following URL:
URL1:'https://www.sec.gov/Archives/edgar/data/1418121/000118518520000213/aple20191231_10k.htm'
or from
URL2:'https://www.sec.gov/Archives/edgar/data/1326801/000132680120000046/form8-k03312020earnings.htm'
I am trying to write a function (suppose it is known as get_balancesheet(URL) ) such that regardless of the URL you will get the Dataframe that contains the balance sheet in an organized manner.
# Import libraries
import requests
import re
from bs4 import BeautifulSoup
I wrote the following function that needs a lot of improvement
def Get_Data_Balance_Sheet(url):
page = requests.get(url)
# Create a BeautifulSoup object
soup = BeautifulSoup(page.content)
futures1 = soup.find_all(text=re.compile('CONSOLIDATED BALANCE SHEETS'))
Table=[]
for future in futures1:
for row in future.find_next("table").find_all("tr"):
t1=[cell.get_text(strip=True) for cell in row.find_all("td")]
Table.append(t1)
# Remove list from list of lists if list is empty
Table = [x for x in Table if x != []]
return Table
Then I execute the following
url='https://www.sec.gov/Archives/edgar/data/1326801/000132680120000013/fb-12312019x10k.htm'
Tab=Get_Data_Balance_Sheet(url)
Tab
Note that this is not what I am planning for to have It is not simply putting it in a dataframe but we need to change it such that regardless of which URL we can get the Balance Sheet.
Well, this being EDGAR it's not going to be simple, but it's doable.
First things first - with the CIK you can extract specific filings of specific types made the CIK filer during a spacific period. So let say you are interested in Forms 10-K and 10-Q, original or amended (as in "FORM 10-K/A", for example), filed by this CIK filer from 2019 through 2020.
start = 2019
end = 2020
cik = 220000320193
short_cik = str(cik)[-6:] #we will need it later to form urls
First we need to get a list of filings meeting these criteria and load it into beautifulsoup:
import requests
from bs4 import BeautifulSoup as bs
url = f"https://www.sec.gov/cgi-bin/srch-edgar?text=cik%3D%{cik}%22+AND+form-type%3D(10-q*+OR+10-k*)&first={start}&last={end}"
req = requests.get(url)
soup = bs(req.text,'lxml')
There are 8 filings meeting the criteria: two Form 10-K and 6 Form 10-Q. Each of these filings has an accession number. The accession number is hiding in the url of each of these filings and we need to extract it to get to the actual target - the Excel file which contains the financial statements which are attached to each specific filing.
acc_nums = []
for link in soup.select('td>a[href]'):
target = link['href'].split(short_cik,1)
if len(target)>1:
acc_num = target[1].split('/')[1]
if not acc_num in acc_nums: #we need this filter because each filing has two forms: text and html, with the same accession number
acc_nums.append(acc_num)
At this point, acc_nums contains the accession number for each of these 8 filings. We can now download the target Excel file. Obviusly, you can loop through acc_num and download all 8, but let's say you are only looking for (randomly) the Excel file attached to the third filing:
fs_url = f"https://www.sec.gov/Archives/edgar/data/{short_cik}/{acc_nums[2]}/Financial_Report.xlsx"
fs = requests.get(fs_url)
with open('random_edgar.xlsx', 'wb') as output:
output.write(fs.content)
And there you'll have more than you'll ever want to know about Apple's financials at that point in time...

Scraping table data with BeautifulSoup or Pandas

I'm somewhat new to using python and I've been given a task that requires data scraping from a table. I do not know very much html either. I've never done this before and have spent a couple days looking at various ways to scrape tables. Unfortunately all of the examples are of what appears to be a more simple webpage layout than what I'm dealing with. I've tried quite a few various methods, but none of them allow me to select the table data that I need.
How would one scrape the table at the bottom of the following webpage under the "Daily Water Level" tab?
url = https://apps.wrd.state.or.us/apps/gw/gw_info/gw_hydrograph/Hydrograph.aspx?gw_logid=HARN0052657
I've tried using the methods in the following links and others not show here:
Beautiful Soup Scraping table
Scrape table with BeautifulSoup
Web scraping with BeautifulSoup
Some of the script I've tried:
from bs4 import BeautifulSoup
import requests
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
data = soup.find_all("table") # {"class": "xxxx"})
I've also tried using pandas, but I can't figure out how to select the table I need instead of the first table on the webpage that has the basic well information:
import pandas as pd
df_list = pd.read_html(url)
df_list
Unfortunately the data I need doesn't even show up when I run this script and the table I'm trying to select doesn't have a class that I can use to select only that table and not the table of basic well information. I've inspected the webpage, but can't seem to find a way to get to the correct table.
As far as the final result would look, I would need to export it as a csv or as a pandas data frame so that I can then graph it with modeled groundwater data for comparison purposes. Any suggestions would be greatly appreciated!
Try below approach using python - requests simple, straightforward, reliable, fast and less code is required when it comes to requests. I have fetched the API URL from website itself after inspecting the network section of google chrome browser.
What exactly below script is doing:
First it will take the API URL and do a GET request with the dynamic parameters(in CAPS) you can change the value of Well No, Start and end date to get the desired result.
After getting the data script will parse the JSON data using json.loads library.
It will iterate all over the list of daily water level data and create a list of all the data points so that it can be used to create a CSV file for ex:- GW Login Id, GW Site ID, Land Surface Elevation, Record date etc.
Finally it will write all the headers and data in the CSV file. (! Important please make sure to input the file path in the file_path variable)
import json
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
import csv
def scrap_daily_water_level():
file_path = '' #Input File path here
file_name = 'daily_water_level_data.csv' #File name
#CSV headers
csv_headers = ['Line #','GW Log Id','GW Site Id', 'Land Surface Elevation', 'Record Date','Restrict to OWRD only', 'Reviewed Status', 'Reviewed Status Description', 'Water level ft above mean sea level', 'Water level ft below land surface']
list_of_water_readings = []
#Dynamic Params
WELL_NO = 'HARN0052657'
START_DATE = '1/1/1905'
END_DATE = '12/30/2050'
#API URL
URL = 'https://apps.wrd.state.or.us/apps/gw/gw_data_rws/api/' + WELL_NO + '/gw_recorder_water_level_daily_mean_public/?start_date=' + START_DATE + '&end_date=' + END_DATE + '&reviewed_status=&restrict_to_owrd_only=n'
response = requests.get(URL,verify=False) #GET API call
json_result = json.loads(response.text) #JSON loads to parse JSON data
print('Daily water level data count ',json_result['feature_count']) # Prints no. of data counts
extracted_data = json_result['feature_list'] #Extracted data in JSON form
for idx, item in enumerate(extracted_data): #Iterate over the list of extracted data
list_of_water_readings.append({ #append and create list of data with headers for further usage
'Line #': idx + 1,
'GW Log Id' : item['gw_logid'],
'GW Site Id': item['gw_site_id'],
'Land Surface Elevation': item['land_surface_elevation'],
'Record Date': item['record_date'],
'Restrict to OWRD only': item['restrict_to_owrd_only'],
'Reviewed Status':item['reviewed_status'],
'Reviewed Status Description': item['reviewed_status_description'],
'Water level ft above mean sea level': item['waterlevel_ft_above_mean_sea_level'],
'Water level ft below land surface': item['waterlevel_ft_below_land_surface']
})
#Create CSV and write data in to it.
with open(file_path + file_name ,'a+') as daily_water_level_data_CSV: #Open file in a+ mode
csvwriter = csv.DictWriter(daily_water_level_data_CSV, delimiter=',', lineterminator='\n',fieldnames=csv_headers)
print('Writing CSV header now...')
csvwriter.writeheader() #Write headers in CSV file
for item in list_of_water_readings: #iterate over the appended data and save them in to the CSV file.
print('Writing data rows now..')
print(item)
csvwriter.writerow(item)
scrap_daily_water_level()

python3 - how to scrape the data from span

I try to use python3 and BeautifulSoup.
import requests
import json
from bs4 import BeautifulSoup
url = "https://www.binance.com/pl"
#get the data
data = requests.get(url);
soup = BeautifulSoup(data.text,'lxml')
print(soup)
If I open the html code (in browser) I can see:
html code in browser
But in my data (printing in console) i cant see btc price:
what data i cant see in console
Could u give me some advice how to scrape this data?
Use .findAll() to find all the rows, and then you can use it to find all the cells in a given row. You have to look at how the page is structured. It's not a standard row, but a bunch of divs made to look like a table. So you have to look at the role of each div to get to the data you want.
I'm assuming that you're going to want to look at specific rows, so my example uses the Para column to find those rows. Since the star is in it's own little cell, the Para column is the second cell, or index of 1. With that, it's just a question of which cells you want to export.
You could take out the filter if you want to get everything. You can also modify it to see if the value of a cell is above a certain price point.
# Import necessary libraries
import requests
from bs4 import BeautifulSoup
# Ignore the insecure warning
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
# Set options and which rows you want to look at
url = "https://www.binance.com/pl"
desired_rows = ['ADA/BTC', 'ADX/BTC']
# Get the page and convert it into beautiful soup
response = requests.get(url, verify=False)
soup = BeautifulSoup(response.text, 'html.parser')
# Find all table rows
rows = soup.findAll('div', {'role':'row'})
# Process all the rows in the table
for row in rows:
try:
# Get the cells for the given row
cells = row.findAll('div', {'role':'gridcell'})
# Convert them to just the values of the cell, ignoring attributes
cell_values = [c.text for c in cells]
# see if the row is one you want
if cell_values[1] in desired_rows:
# Output the data however you'd like
print(cell_values[1], cell_values[-1])
except IndexError: # there was a row without cells
pass
This resulted in the following output:
ADA/BTC 1,646.39204255
ADX/BTC 35.29384873

Resources