How can extract an income statement from all company concepts? - python-3.x

All company concepts in xbrl format can be extracted with sec's RESTful api.
For example,i want to get tesla's concepts in xbrl format in 2020, get the tesla's cik and the url for api.
cik='1318605'
url = 'https://data.sec.gov/api/xbrl/companyfacts/CIK{:>010s}.json'.format(cik)
To express financial statement in 2020 with elements fy and fp:
'fy' == 2020 and 'fp' == 'FY'
I write the whole python code to call sec's api:
import requests
import json
cik='1318605'
url = 'https://data.sec.gov/api/xbrl/companyfacts/CIK{:>010s}.json'.format(cik)
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36"
}
res = requests.get(url=url,headers=headers)
result = json.loads(res.text)
concepts_list = list(result['facts']['us-gaap'].keys())
data_concepts = result['facts']['us-gaap']
fdata = {}
for item in concepts_list:
data = data_concepts[item]['units']
data_units = list(data_concepts[item]['units'].keys())
for data_units_attr in data_units:
for record in data[data_units_attr]:
if record['fy'] == 2020 and record['fp'] == 'FY':
fdata[item] = record['val']
fdata contains all the company concepts and its value in 2020 for tesla,show part of it:
fdata
{'AccountsAndNotesReceivableNet': 334000000,
'AccountsPayableCurrent': 6051000000,
'AccountsReceivableNetCurrent': 1886000000,
How can get all concepts below to income statement?I want to extract it to make an income statement.
Maybe i should add some values in dei with almost same way as above.
EntityCommonStockSharesOutstanding:959853504
EntityPublicFloat:160570000000
It is simple to parse financial statement such as income statement from ixbrl file:
https://www.sec.gov/ix?doc=/Archives/edgar/data/0001318605/000156459021004599/tsla-10k_20201231.htm
I can get it ,please help to replicate the annual income statement on 2020 for Tesla from sec's RESTful api,or extract the income statement from the whole instance file:
https://www.sec.gov/Archives/edgar/data/1318605/000156459021004599/0001564590-21-004599.txt
If you tell me how to get all concepts belong to income statement ,i can fulfill the job ,with the same way balance and cashflow statement all can be extracted.In my case the fdata contain all concepts belong to income,balance,cashflow statement,which concept in fdata belong to which financial statement? How to map every concepts into income,balance,cashflow statement?
#expression in pseudocode
income_statement = fdata[all_concepts_belong_to_income_statement]
balance_statement = fdata[all_concepts_belong_to_balance_statement]
cashflow_statement = fdata[all_concepts_belong_to_cashflow_statement]

If I understand you correctly, the following should help you get least part of the way there.I'm not going to try to replicate the financial statements; instead, I'll only choose one, and show the concepts used in creating the statement - you'll have to take it from there.
I'll use the "income statement" (formally called in the instance document: "Consolidated Statements of Comprehensive Income (Loss) - USD ($) - $ in Millions") as the example. Again, the data is not in json, but html.
#import the necessary libraries
import lxml.html as lh
import requests
#get the filing
url = 'https://www.sec.gov/Archives/edgar/data/1318605/000156459021004599/0001564590-21-004599.txt'
req = requests.get(url, headers={"User-Agent": "b2g"})
#parse the filing content
doc = lh.fromstring(req.content)
#locate the relevant table and get the data
concepts = doc.xpath("//p[#id='Consolidated_Statmnts_of_Cmprehnsve_Loss']/following-sibling::div[1]//table//#name")
for c in set(concepts):
print(c)
Output:
There are 5 concepts, repeated 3 times each:
us-gaap:ComprehensiveIncomeNetOfTaxAttributableToNoncontrollingInterest
us-gaap:OtherComprehensiveIncomeLossForeignCurrencyTransactionAndTranslationAdjustmentNetOfTax
us-gaap:ProfitLoss
us-gaap:ComprehensiveIncomeNetOfTax
us-gaap:ComprehensiveIncomeNetOfTaxIncludingPortionAttributableToNoncontrollingInterest

I get a indirect way to extract income statement via SEC's api,SEC already publish all data extracted from raw xbrl file(called xbrl instance published also),the tool which parse xbrl file to form four csv files num.txt,sub.txt,pre.txt,tag.txt is not published,what i want to do is to create the tool myself.
Step1:
Download dataset from https://www.sec.gov/dera/data/financial-statement-data-sets.html on 2020,and unzip it ,we get pre.txt,num.txt,sub.txt,tag.txt.
Step2:
Create database and table pre,num,sub,tag according to fields in pre.txt,num.txt,sub.txt,tag.txt.
Step3:
Import pre.txt,num.txt,sub.txt,tag.txt into table pre,num,sub,tag.
Step4:
Query in my postgresql:
the adsh number for tesla's financial statement on 2020 is `0001564590-21-004599`
\set accno '0001564590-21-004599'
select tag,value from num where adsh=:'accno' and ddate = '2020-12-31' and qtrs=4 and tag in
(select tag from pre where adsh=:'accno' and stmt='IS');
tag | value
---------------------------------------------------------------------------------------------+------------------
NetIncomeLoss | 721000000.0000
OperatingLeasesIncomeStatementLeaseRevenue | 1052000000.0000
GrossProfit | 6630000000.0000
InterestExpense | 748000000.0000
CostOfRevenue | 24906000000.0000
WeightedAverageNumberOfSharesOutstandingBasic | 933000000.0000
IncomeLossFromContinuingOperationsBeforeIncomeTaxesExtraordinaryItemsNoncontrollingInterest | 1154000000.0000
EarningsPerShareDiluted | 0.6400
ProfitLoss | 862000000.0000
OperatingExpenses | 4636000000.0000
InvestmentIncomeInterest | 30000000.0000
OperatingIncomeLoss | 1994000000.0000
SellingGeneralAndAdministrativeExpense | 3145000000.0000
NetIncomeLossAttributableToNoncontrollingInterest | 141000000.0000
StockholdersEquityNoteStockSplitConversionRatio1 | 5.0000
NetIncomeLossAvailableToCommonStockholdersBasic | 690000000.0000
Revenues | 31536000000.0000
IncomeTaxExpenseBenefit | 292000000.0000
OtherNonoperatingIncomeExpense | -122000000.0000
WeightedAverageNumberOfDilutedSharesOutstanding | 1083000000.0000
EarningsPerShareBasic | 0.7400
ResearchAndDevelopmentExpense | 1491000000.0000
BuyOutOfNoncontrollingInterest | 31000000.0000
CostOfAutomotiveLeasing | 563000000.0000
CostOfRevenuesAutomotive | 20259000000.0000
CostOfServicesAndOther | 2671000000.0000
SalesRevenueAutomotive | 27236000000.0000
SalesRevenueServicesAndOtherNet | 2306000000.0000
(28 rows)
stmt='IS':get tag in the income statement,ddate='2020-12-31':annual report on 2020 year.Please read SEC's data field definition for qtrs,you may know why to set qtrs=4 in the select command.
Make a comparison with https://www.nasdaq.com/market-activity/stocks/tsla/financials,show part of it,look at the 12/31/2020 column:
Period Ending: 12/31/2021 12/31/2020 12/31/2019 12/31/2018
Total Revenue $53,823,000 $31,536,000 $24,578,000 $21,461,000
Cost of Revenue $40,217,000 $24,906,000 $20,509,000 $17,419,000
Gross Profit $13,606,000 $6,630,000 $4,069,000 $4,042,000
Research and Development $2,593,000 $1,491,000 $1,343,000 $1,460,000
All the number are equal,the terms in xbrl are:Revenues,CostOfRevenue,GrossProfit,ResearchAndDevelopmentExpense ,the respective terms in financial accounting concepts are:Total Revenue,Cost of Revenue,Gross Profit,Research and Development. What i get are right numbers.
Somedays later,i can find the direct way to parse raw xbrl file to get the income statement,i am not familiar with xbrl yet,wish stackoverflow community help me fulfill my goal.

Related

Avoiding cartesian when adding unique classifier to a list in python 3

I have 5 .csv files I am importing and all contain emails:
Donors = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Donors Q1 2021 R12.csv",
usecols=["Email Address"])
Activists = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Activists Q1 2021 R12.csv",
usecols=["Email"])
Low_Level_Activists = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Low Level Activists Q1 2021 R12.csv",
usecols=["Email"])
Ambassadors = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Ambassadors Q1 2021.csv",
usecols=["Email Address"])
Volunteers = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Volunteers Q1 2021 R12.csv",
usecols=["Email Address"])
Followers= pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Followers Q1 2021 R12.csv",
usecols=["Email"])
While I am only importing emails (annoyingly with two different naming conventions because of the systems they originate from), I am adding the import name as a classifer - i.e. Donors, Volunteers, etc.
Donors['Value'] = "Donors"
Activists['Value'] = "Activists"
Low_Level_Activists['Value'] = "Low_Level_Activists"
Ambassadors['Value'] = "Ambassadors"
Volunteers['Value'] = "Volunteers"
Advocates['Value'] = 'Followers'
I then concatenate all the files and handle the naming issue. I am sure there is a more elegant way to do this but here's what I have:
S1= pd.concat([Donors,Activists,Low_Level_Activists,Ambassadors,Volunteers,Advocates], ignore_index= True)
S1['Handle'] = S1['Email Address'].where(S1['Email Address'].notnull(), S1['Email'])
S1= S1.drop(['Email','Email Address'], axis = 1)
print(S1['Handle'].count()) #checks full count
The total on that last line is 166,749
Here is my problem. I need to filter the emails for uniques - easy enough using .nuniques() and the but the problem I am having is I also need to carry the classifier. So if a singular email is a Donor but also an Activist, I pull both when I try to merge the unique values with the classifier.
I have been at this for many hours (and to the end of the Internet!) and can't seem to find a workable solution. I've tried dictionary for loops, merges, etc. ad infinitum. The unique email count is 165,923 (figured out via Python &/or excel:( ).
Essentially I would want to pull the earliest classifier in my list on a match. So if an email is a Donor and an Activist-> call them a Donor. Or if a email is a Volunteer and a Follower -> call them a Volunteer on one email record.
Any help would be greatly appreciated.
I'll give it a try with some made-up data:
import pandas as pd
fa = pd.DataFrame([['paul#mail.com', 'Donors'], ['max#mail.com', 'Donors']], columns=['Handle', 'Value'])
fb = pd.DataFrame([['paul#mail.com', 'Activists'], ['annie#mail.com', 'Activists']], columns=['Handle', 'Value'])
S1 = pd.concat([fa, fb])
print(S1)
gives
Handle Value
0 paul#mail.com Donors
1 max#mail.com Donors
0 paul#mail.com Activists
1 annie#mail.com Activists
You can group by Handle and then pick any Value you like, e.g. the first:
for handle, group in S1.groupby('Handle'):
print(handle, group.reset_index().loc[0, 'Value'])
gives
annie#mail.com Activists
max#mail.com Donors
paul#mail.com Donors
or collect all roles of a person:
for handle, group in S1.groupby('Handle'):
print(handle, group.Value.unique())
gives
annie#mail.com ['Activists']
max#mail.com ['Donors']
paul#mail.com ['Donors' 'Activists']

Let the user create a custom format using a string

I'd like a user to be able to create a custom format in QtWidgets.QPlainTextEdit() and it would format the string and split out the results in another QtWidgets.QPlainTextEdit().
For example:
movie = {
"Title":"The Shawshank Redemption",
"Year":"1994",
"Rated":"R",
"Released":"14 Oct 1994",
"Runtime":"142 min",
"Genre":"Drama",
"Director":"Frank Darabont",
"Writer":"Stephen King (short story \"Rita Hayworth and Shawshank Redemption\"),Frank Darabont (screenplay)",
"Actors":"Tim Robbins, Morgan Freeman, Bob Gunton, William Sadler",
"Plot":"Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.",
"Language":"English",
"Country":"USA",
"Awards":"Nominated for 7 Oscars. Another 21 wins & 36 nominations.",
"Poster":"https://m.media-amazon.com/images/M/MV5BMDFkYTc0MGEtZmNhMC00ZDIzLWFmNTEtODM1ZmRlYWMwMWFmXkEyXkFqcGdeQXVyMTMxODk2OTU#._V1_SX300.jpg",
"Ratings": [
{
"Source":"Internet Movie Database",
"Value":"9.3/10"
},
{
"Source":"Rotten Tomatoes",
"Value":"91%"
},
{
"Source":"Metacritic",
"Value":"80/100"
}
],
"Metascore":"80",
"imdbRating":"9.3",
"imdbVotes":"2,367,380",
"imdbID":"tt0111161",
"Type":"movie",
"DVD":"15 Aug 2008",
"BoxOffice":"$28,699,976",
"Production":"Columbia Pictures, Castle Rock Entertainment",
"Website":"N/A"
}
custom_format = '[ {Title} | ⌚ {Runtime} | ⭐ {Genre} | 📅 {Released} | {Rated} ]'.format(Title=movie['Title'], Runtime=movie['Runtime'], Genre=movie['Genre'],Released=movie['Released'],Rated=movie['Rated'])
print(custom_format)
This code above, would easily print [ The Shawshank Redemption | ⌚ 142 min | ⭐ Drama | 📅 14 Oct 1994 | R ].
However, if I change this code from:
custom_format = '[ {Title} | ⌚ {Runtime} | ⭐ {Genre} | 📅 {Released} | {Rated} ]'.format(Title=movie['Title'], Runtime=movie['Runtime'], Genre=movie['Genre'],Released=movie['Released'],Rated=movie['Rated'])
To:
custom_format = "'[ {Title} | ⌚ {Runtime} | ⭐ {Genre} | 📅 {Released} | {Rated} ]'.format(Title=movie['Title'], Runtime=movie['Runtime'], Genre=movie['Genre'],Released=movie['Released'],Rated=movie['Rated'])"
Notice, that the whole thing is wrapped in "". Therefor its a string. Now doing this will not print out the format that I want.
The reason I wrapped it in "" is because when I add my original custom_format into a QtWidgets.QPlainTextEdit(), it converts it into a string it wont format later on.
So my original idea was, the user creates a custom format for themselves in a QtWidgets.QPlainTextEdit(). Then I copy that format, open a new window wher the movie json variable is contained and paste the format into another QtWidgets.QPlainTextEdit() where it would hopefuly show it formatted correctly.
Any help on this would be appreciated.
ADDITIONAL INFORMATION:
User creates their format inside QtWidgets.QPlainTextEdit().
Then the user clicks Test Format which should display [ The Shawshank Redemption | ⌚ 142 min | ⭐ Drama | 📅 14 Oct 1994 | R ] but instead it displays
Trying to use the full format command would require an eval(), which is normally considered not only bad practice, but also a serious security issue, especially when the input argument is completely set by the user.
Since the fields are known, I see little point in providing the whole format line, and it is better to parse the format string looking for keywords, then use keyword lookup to create the output.
class Formatter(QtWidgets.QWidget):
def __init__(self):
super().__init__()
layout = QtWidgets.QVBoxLayout(self)
self.formatBase = QtWidgets.QPlainTextEdit(
'[ {Title} | ⌚ {Runtime} | ⭐ {Genre} | 📅 {Released} | {Rated} ]')
self.formatOutput = QtWidgets.QPlainTextEdit()
layout.addWidget(self.formatBase)
layout.addWidget(self.formatOutput)
self.formatBase.textChanged.connect(self.processFormat)
self.processFormat()
def processFormat(self):
format_str = self.formatBase.toPlainText()
# escape double braces
clean = re.sub('{{', '', re.sub('}}', '', format_str))
# capture keyword arguments
tokens = re.split(r'\{(.*?)\}', clean)
keywords = tokens[1::2]
try:
# build the dictionary with given arguments, unrecognized keywords
# are just printed back in the {key} form, in order let the
# user know that the key wasn't valid;
values = {k:movie.get(k, '{{{}}}'.format(k)) for k in keywords}
self.formatOutput.setPlainText(format_str.format(**values))
except (ValueError, KeyError):
# exception for unmatching braces
pass

How to remove '/5' from CSV file

I am cleaning a restaurant data set using Pandas' read_csv.
I have columns like this:
name, online_order, book_table, rate, votes
xxxx, Yes, Yes, 4.5/5, 705
I expect them to be like this:
name, online_order, book_table, rate, votes
xxxx, Yes, Yes, 4.5, 705
You basically need to split the item(dataframe["rate"]) based on / and take out what you need. .apply this on your dataframe using lambda x: getRate(x)
def getRate(x):
return str(x).split("/")[0]
To use it with column name rate, we can use:
dataframe["rate"] = dataframe["rate"].apply(lambda x: getRate(x))
You can use the python .split() function to remove specific text, given that the text is consistently going to be "/5", and there are no instances of "/5" that you want to keep in that string. You can use it like this:
num = "4.5/5"
num.split("/5")[0]
output: '4.5'
If this isn't exactly what you need, there's more regex python functions here
You can use DataFrame.apply() to make your replacement operation on the ratecolumn:
def clean(x):
if "/" not in x :
return x
else:
return x[0:x.index('/')]
df.rate = df.rate.apply(lambda x : clean(x))
print(df)
Output
+----+-------+---------------+-------------+-------+-------+
| | name | online_order | book_table | rate | votes |
+----+-------+---------------+-------------+-------+-------+
| 0 | xxxx | Yes | Yes | 4.5 | 705 |
+----+-------+---------------+-------------+-------+-------+
EDIT
Edited to handle situations in which there could be multiple / or that it could be another number than /5 (ie : /4or /1/3 ...)

Trying to scrape and segregated into Headings and Contents. The problem is that both have same class and tags, How to segregate?

I am trying to web scrape http://www.intermediary.natwest.com/intermediary-solutions/lending-criteria.html segregating into 2 parts Heading and Content, The problem is that both have same class and tags. Other than using regex and hard coding, How to distinguish and extract into 2 columns in excel?
In the picture(https://ibb.co/8X5xY9C) or in the website link provided, Bold(Except Alphabet Letters(A) and later 'back to top' ) represents Heading and Explanation(non-bold just below bold) represents the content(Content even consists of 'li' and 'ul' blocks later in the site, which should come under respective Heading)
#Code to Start With
from bs4 import BeautifulSoup
import requests
url = "http://www.intermediary.natwest.com/intermediary-solutions/lending-criteria.html";
html = requests.get(url)
soup = BeautifulSoup(html.text, "html.parser")
Heading = soup.findAll('strong')
content = soup.findAll('div', {"class": "comp-rich-text"})
Output Excel looks something Link this
https://i.stack.imgur.com/NsMmm.png
I've thought about it a little more and thought of a better solution. Rather than "crowd" my initial solution, I chose to add a 2nd solution here:
So thinking about it again, and following my logic of splitting the html by the headlines (essentially breaking it up where we find <strong> tags), I choose to convert to strings using .prettify(), and then split on those specific strings/tags and read back into BeautifulSoup to pull the text. From what I see, it looks like it hasn't missed anything, but you'll have to search through the dataframe to double check:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'http://www.intermediary.natwest.com/intermediary-solutions/lending-criteria.html'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
sections = soup.find_all('div',{'class':'accordion-section-content'})
results = {}
for section in sections:
splits = section.prettify().split('<strong>')
for each in splits:
try:
headline, content = each.split('</strong>')[0].strip(), each.split('</strong>')[1]
headline = BeautifulSoup(headline, 'html.parser').text.strip()
content = BeautifulSoup(content, 'html.parser').text.strip()
content_split = content.split('\n')
content = ' '.join([ text.strip() for text in content_split if text != ''])
results[headline] = content
except:
continue
df = pd.DataFrame(results.items(), columns = ['Headings','Content'])
df.to_csv('C:/test.csv', index=False)
Output:
print (df)
Headings Content
0 Age requirements Applicants must be at least 18 years old at th...
1 Affordability Our affordability calculator is the same one u...
2 Agricultural restriction The only acceptable agricultural tie is where ...
3 Annual percentage rate of charge (APRC) The APRC is all fees associated with the mortg...
4 Adverse credit We consult credit reference agencies to look a...
5 Applicants (number of) The maximum number of applicants is two.
6 Armed Forces personnel Unsecured personal loans are only acceptable f...
7 Back to back Back to back is typically where the vendor has...
8 Customer funded purchase: when the customer has funded the purchase usin...
9 Bridging: residential mortgage applications where the cu...
10 Inherited: a recently inherited property where the benefi...
11 Porting: where a fixed/discounted rate was ported to a ...
12 Repossessed property: where the vendor is the mortgage lender in pos...
13 Part exchange: where the vendor is a large national house bui...
14 Bank statements We accept internet bank statements in paper fo...
15 Bonus For guaranteed bonuses we will consider an ave...
16 British National working overseas Applicants must be resident in the UK. Applica...
17 Builder's Incentives The maximum amount of acceptable incentive is ...
18 Buy-to-let (purpose) A buy-to-let mortgage can be used for: Purcha...
19 Capital Raising - Acceptable purposes permanent home improvem...
20 Buy-to-let (affordability) Buy to Let affordability must be assessed usin...
21 Buy-to-let (eligibility criteria) The property must be in England, Scotland, Wal...
22 Definition of a portfolio landlord We define a portfolio landlord as a customer w...
23 Carer's Allowance Carer's Allowance is paid to people aged 16 or...
24 Cashback Where a mortgage product includes a cashback f...
25 Casual employment Contract/agency workers with income paid throu...
26 Certification of documents When submitting copies of documents, please en...
27 Child Benefit We can accept up to 100% of working tax credit...
28 Childcare costs We use the actual amount the customer has decl...
29 When should childcare costs not be included? There are a number of situations where childca...
.. ... ...
108 Shared equity We lend on the Government-backed shared equity...
109 Shared ownership We do not lend against Shared Ownership proper...
110 Solicitors' fees We have a panel of solicitors for our fees ass...
111 Source of deposit We reserve the right to ask for proof of depos...
112 Sole trader/partnerships We will take an average of the last two years'...
113 Standard variable rate A standard variable rate (SVR) is a type of v...
114 Student loans Repayment of student loans is dependent on rec...
115 Tenure Acceptable property tenure: Feuhold, Freehold,...
116 Term Minimum term is 3 years Residential - Maximum...
117 Unacceptable income types The following forms of income are classed as u...
118 Bereavement allowance: paid to widows, widowers or surviving civil pa...
119 Employee benefit trusts (EBT): this is a tax mitigation scheme used in conjun...
120 Expenses: not acceptable as they're paid to reimburse pe...
121 Housing Benefit: payment of full or partial contribution to cla...
122 Income Support: payment for people on low incomes, working les...
123 Job Seeker's Allowance: paid to people who are unemployed or working 1...
124 Stipend: a form of salary paid for internship/apprentic...
125 Third Party Income: earned by a spouse, partner, parent who are no...
126 Universal Credit: only certain elements of the Universal Credit ...
127 Universal Credit The Standard Allowance element, which is the n...
128 Valuations: day one instruction We are now instructing valuations on day one f...
129 Valuation instruction A valuation will be automatically instructed w...
130 Valuation fees A valuation will always be obtained using a pa...
131 Please note: W hen upgrading the free valuation for a home...
132 Adding fees to the loan Product fees are the only fees which can be ad...
133 Product fee This fee is paid when the mortgage is arranged...
134 Working abroad Previously, we required applicants to be empl...
135 Acceptable - We may consider applications from people who: ...
136 Not acceptable - We will not consider applications from people...
137 Working and Family Tax Credits We can accept up to 100% of Working Tax Credit...
[138 rows x 2 columns]
EDIT: SEE OTHER SOLUTION PROVIDED
It's tricky. I tried to essentially to grab the headings, then use those to grab all the text after the heading, and that proceeds the next heading. The code below is a little messy, and requires some cleaning up, but hopefully gets you to a point to work with it or get you moving in the right direction:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
url = 'http://www.intermediary.natwest.com/intermediary-solutions/lending-criteria.html'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
sections = soup.find_all('div',{'class':'accordion-section-content'})
results = {}
for section in sections:
headlines = section.find_all('strong')
headlines = [each.text for each in headlines ]
for i, headline in enumerate(headlines):
if headline != headlines[-1]:
next_headline = headlines[i+1]
else:
next_headline = ''
try:
find_content = section(text=headline)[0].parent.parent.find_next_siblings()
if ':' in headline and 'Gifted deposit' not in headline and 'Help to Buy' not in headline:
content = section(text=headline)[0].parent.nextSibling
results[headline] = content.strip()
break
except:
find_content = section(text=re.compile(headline))[0].parent.parent.find_next_siblings()
if find_content == []:
try:
find_content = section(text=headline)[0].parent.parent.parent.find_next_siblings()
except:
find_content = section(text=re.compile(headline))[0].parent.parent.parent.find_next_siblings()
content = []
for sibling in find_content:
if next_headline not in sibling.text or headline == headlines[-1]:
content.append(sibling.text)
else:
content = '\n'.join(content)
results[headline.strip()] = content.strip()
break
if headline == headlines[-1]:
content = '\n'.join(content)
results[headline] = content.strip()
df = pd.DataFrame(results.items())

Behave: Writing a Scenario Outline with dynamic examples

Gherkin / Behave Examples
Gherkin syntax features test automation using examples:
Feature: Scenario Outline (tutorial04)
Scenario Outline: Use Blender with <thing>
Given I put "<thing>" in a blender
When I switch the blender on
Then it should transform into "<other thing>"
Examples: Amphibians
| thing | other thing |
| Red Tree Frog | mush |
| apples | apple juice |
Examples: Consumer Electronics
| thing | other thing |
| iPhone | toxic waste |
| Galaxy Nexus | toxic waste |
The test suite would run four times, once for each example, giving a result similar to:
My problem
How can I test using confidential data in the Examples section? For example, I would like to test an internal API with user ids or SSN numbers, without keeping the data hard coded in the feature file.
Is there a way to load the Examples dynamically from an external source?
Update: Opened a github issue on the behave project.
I've come up with another solution (behave-1.2.6):
I managed to dynamically create examples for a Scenario Outline by using before_feature.
Given a feature file (x.feature):
Feature: Verify squared numbers
Scenario Outline: Verify square for <number>
Then the <number> squared is <result>
Examples: Static
| number | result |
| 1 | 1 |
| 2 | 4 |
| 3 | 9 |
| 4 | 16 |
# Use the tag to mark this outline
#dynamic
Scenario Outline: Verify square for <number>
Then the <number> squared is <result>
Examples: Dynamic
| number | result |
| . | . |
And the steps file (steps/x.step):
from behave import step
#step('the {number:d} squared is {result:d}')
def step_impl(context, number, result):
assert number*number == result
The trick is to use before_feature in environment.py as it has already parsed the examples tables to the scenario outlines, but hasn't generated the scenarios from the outline yet.
import behave
import copy
def before_feature(context, feature):
features = (s for s in feature.scenarios if type(s) == behave.model.ScenarioOutline and
'dynamic' in s.tags)
for s in features:
for e in s.examples:
orig = copy.deepcopy(e.table.rows[0])
e.table.rows = []
for num in range(1,5):
n = copy.deepcopy(orig)
# This relies on knowing that the table has two rows.
n.cells = ['{}'.format(num), '{}'.format(num*num)]
e.table.rows.append(n)
This will only operate on Scenario Outlines that are tagged with #dynamic.
The result is:
behave -k --no-capture
Feature: Verify squared numbers # features/x.feature:1
Scenario Outline: Verify square for 1 -- #1.1 Static # features/x.feature:8
Then the 1 squared is 1 # features/steps/x.py:3
Scenario Outline: Verify square for 2 -- #1.2 Static # features/x.feature:9
Then the 2 squared is 4 # features/steps/x.py:3
Scenario Outline: Verify square for 3 -- #1.3 Static # features/x.feature:10
Then the 3 squared is 9 # features/steps/x.py:3
Scenario Outline: Verify square for 4 -- #1.4 Static # features/x.feature:11
Then the 4 squared is 16 # features/steps/x.py:3
#dynamic
Scenario Outline: Verify square for 1 -- #1.1 Dynamic # features/x.feature:19
Then the 1 squared is 1 # features/steps/x.py:3
#dynamic
Scenario Outline: Verify square for 2 -- #1.2 Dynamic # features/x.feature:19
Then the 2 squared is 4 # features/steps/x.py:3
#dynamic
Scenario Outline: Verify square for 3 -- #1.3 Dynamic # features/x.feature:19
Then the 3 squared is 9 # features/steps/x.py:3
#dynamic
Scenario Outline: Verify square for 4 -- #1.4 Dynamic # features/x.feature:19
Then the 4 squared is 16 # features/steps/x.py:3
1 feature passed, 0 failed, 0 skipped
8 scenarios passed, 0 failed, 0 skipped
8 steps passed, 0 failed, 0 skipped, 0 undefined
Took 0m0.005s
This relies on having an Examples table with the correct shape as the final table, in my example, with two rows. I also don't fuss with creating new behave.model.Row objects, I just copy the one from the table and update it. For extra ugliness, if you're using a file, you can put the file name in the Examples table.
Got here looking for something else, but since I've been in similar situation with Cucumber before, maybe someone will also end up at this question, looking for a possible solution. My approach to this problem is to use BDD variables that I can later handle at runtime in my step_definitions. In my python code I can check what is the value of the Gherkin variable and map it to what's needed.
For this example:
Scenario Outline: Use Blender with <thing>
Given I put "<thing>" in a blender
When I switch the blender on
Then it should transform into "<other thing>"
Examples: Amphibians
| thing | other thing |
| Red Tree Frog | mush |
| iPhone | data.iPhone.secret_key | # can use .yaml syntax here as well
Would translate to such step_def code:
#given('I put "{thing}" in a blender')
def step_then_should_transform_into(context, other_thing):
if other_thing == BddVariablesEnum.SECRET_KEY:
basic_actions.load_secrets(context, key)
So all you have to do is to have well defined DSL layer.
Regarding the issue of using SSN numbers in testing, I'd just use fake SSNs and not worry that I'm leaking people's private information.
Ok, but what about the larger issue? You want to use a scenario outline with examples that you cannot put in your feature file. Whenever I've run into this problem what I did was to give a description of the data I need and let the step implementation either create the actual data set used for testing or fetch the data set from an existing test database.
Scenario Outline: Accessing the admin interface
Given a user who <status> an admin has logged in
Then the user <ability> see the admin interface
Examples: Users
| status | ability |
| is | can |
| is not | cannot |
There's no need to show any details about the user in the feature file. The step implementation is responsible for either creating or fetching the appropriate type of user depending on the value of status.

Resources