This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 2 years ago.
I am working on the text data in a string format. I'd like to know how to extract part of the string as below:
data = '<?xml version_history="1.0" encoding="utf-8"?><feed xml:base="https://dummydomain.facebook.com/" xmlns="http://www.w3.org/2005/Atom" xmlns:d="http://schemas.microsoft.com/ado/2008/09/dataservices" xmlns:m="http://schemas.microsoft.com/ado/2008/09/dataservices/metadata" xmlns:georss="http://www.georss.org/georss" xmlns:gml="http://www.opengis.net/gml"><id>aad232-c2cc-42ca-ac1e-e1d1b4dd55de</id><title<d:VersionType>3.0</d:VersionLabel><d:Name>XYZ Company</d:Title><d:New_x005f_h84_x005f_dsad_x005f_x003f_x005f_ m:null="true" /><d:Action_x005f_x0020_x005f_Status>Active<d:Stage>3. Contract Signed<d:ComplianceAssetId m:null="true" /><d:ID m:type="Edm.Int32">408</d:ID><d:Modified m:type="Edm.DateTime">2020-07-30T12:15:04Z</d:Modified><d:Author m:type="SP.FieldUserValue"><d:LookupId m:type="Edm.Int32">13</d:LookupId><d:LookupValue> Amy, Jackson</d:LookupValue><d:Email>Amy.Jackson#doe.com</d:Email><id>af62fe09-fds2-42ca-a12c1e-e1d1b4dd55de</id><title<d:VersionType>2.0</d:VersionLabel><d:Name>XYZ Company</d:Title><d:New_x005f_x0342fs_x005f_dsad_x005f_x003f_x005f_ m:null="true" /><d:Action_x005f_x0020_x005f_Status>Active<d:Stage>2. Active Discussion<d:ComplianceAssetId m:null="true" /><d:ID m:type="Edm.Int32">408</d:ID><d:Modified m:type="Edm.DateTime">2020-02-15T18:15:60Z</d:Modified><d:Author m:type="SP.FieldUserValue"><d:LookupId m:type="Edm.Int32">13</d:LookupId><d:LookupValue> Sam, Joseph</d:LookupValue><d:Email>Sam. Joseph #doe.com</d:Email></d:LookupValue><d:Email>Amy.Jackson#doe.com</d:Email><id>af62fe09-fds2-42ca-a12c1e-e1d1b4dd55de</id><title<d:VersionType>1.0</d:VersionLabel><d:Name>XYZ Company</d:Title><d:New_x005f_x0342fs_x005f_dsad_x005f_x003f_x005f_ m:null="true" /><d:Action_x005f_x0020_x005f_Status>Active<d:Stage>1. Exploratory<d:ComplianceAssetId m:null="true" /><d:ID m:type="Edm.Int32">408</d:ID><d:Modified m:type="Edm.DateTime">2019-07-15T10:20:04Z</d:Modified><d:Author m:type="SP.FieldUserValue"><d:LookupId m:type="Edm.Int32">13</d:LookupId><d:LookupValue> Sam, Joseph</d:LookupValue><d:Email>Sam. Joseph #doe.com</d:Email>'
I want to extract all <d:VersionType>,<d:Name>,<d:Stage>,and <d:Modified m:type="Edm.DateTime">
Expected outputs:
d:VersionType d:Name d:Stage d:Modified m:type="Edm.DateTime"
3.0 XYZ Company 3. Contract 2020-07-30T12:15:04Z
2.0 XYZ Company 2. Contract 2020-02-15T18:15:60Z
1.0 XYZ Company 1. Exploratory 2019-07-15T10:20:04Z
Thanks in advance for your help!
Try using beautiful soup as it lets you parse xml, html and other documents. Such files are already in a specific structure, and you don't have to build a regex expression from scratch, making your job a lot easier.
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'xml')
version_type = [item.text for item in soup.findAll('d:VersionType')] # gives ['3.0', '2.0', '1.0']
Replace d:VersionType with other elements you want (d:Name, d:Stage, ..) to extract their contents as well.
Related
All company concepts in xbrl format can be extracted with sec's RESTful api.
For example,i want to get tesla's concepts in xbrl format in 2020, get the tesla's cik and the url for api.
cik='1318605'
url = 'https://data.sec.gov/api/xbrl/companyfacts/CIK{:>010s}.json'.format(cik)
To express financial statement in 2020 with elements fy and fp:
'fy' == 2020 and 'fp' == 'FY'
I write the whole python code to call sec's api:
import requests
import json
cik='1318605'
url = 'https://data.sec.gov/api/xbrl/companyfacts/CIK{:>010s}.json'.format(cik)
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36"
}
res = requests.get(url=url,headers=headers)
result = json.loads(res.text)
concepts_list = list(result['facts']['us-gaap'].keys())
data_concepts = result['facts']['us-gaap']
fdata = {}
for item in concepts_list:
data = data_concepts[item]['units']
data_units = list(data_concepts[item]['units'].keys())
for data_units_attr in data_units:
for record in data[data_units_attr]:
if record['fy'] == 2020 and record['fp'] == 'FY':
fdata[item] = record['val']
fdata contains all the company concepts and its value in 2020 for tesla,show part of it:
fdata
{'AccountsAndNotesReceivableNet': 334000000,
'AccountsPayableCurrent': 6051000000,
'AccountsReceivableNetCurrent': 1886000000,
How can get all concepts below to income statement?I want to extract it to make an income statement.
Maybe i should add some values in dei with almost same way as above.
EntityCommonStockSharesOutstanding:959853504
EntityPublicFloat:160570000000
It is simple to parse financial statement such as income statement from ixbrl file:
https://www.sec.gov/ix?doc=/Archives/edgar/data/0001318605/000156459021004599/tsla-10k_20201231.htm
I can get it ,please help to replicate the annual income statement on 2020 for Tesla from sec's RESTful api,or extract the income statement from the whole instance file:
https://www.sec.gov/Archives/edgar/data/1318605/000156459021004599/0001564590-21-004599.txt
If you tell me how to get all concepts belong to income statement ,i can fulfill the job ,with the same way balance and cashflow statement all can be extracted.In my case the fdata contain all concepts belong to income,balance,cashflow statement,which concept in fdata belong to which financial statement? How to map every concepts into income,balance,cashflow statement?
#expression in pseudocode
income_statement = fdata[all_concepts_belong_to_income_statement]
balance_statement = fdata[all_concepts_belong_to_balance_statement]
cashflow_statement = fdata[all_concepts_belong_to_cashflow_statement]
If I understand you correctly, the following should help you get least part of the way there.I'm not going to try to replicate the financial statements; instead, I'll only choose one, and show the concepts used in creating the statement - you'll have to take it from there.
I'll use the "income statement" (formally called in the instance document: "Consolidated Statements of Comprehensive Income (Loss) - USD ($) - $ in Millions") as the example. Again, the data is not in json, but html.
#import the necessary libraries
import lxml.html as lh
import requests
#get the filing
url = 'https://www.sec.gov/Archives/edgar/data/1318605/000156459021004599/0001564590-21-004599.txt'
req = requests.get(url, headers={"User-Agent": "b2g"})
#parse the filing content
doc = lh.fromstring(req.content)
#locate the relevant table and get the data
concepts = doc.xpath("//p[#id='Consolidated_Statmnts_of_Cmprehnsve_Loss']/following-sibling::div[1]//table//#name")
for c in set(concepts):
print(c)
Output:
There are 5 concepts, repeated 3 times each:
us-gaap:ComprehensiveIncomeNetOfTaxAttributableToNoncontrollingInterest
us-gaap:OtherComprehensiveIncomeLossForeignCurrencyTransactionAndTranslationAdjustmentNetOfTax
us-gaap:ProfitLoss
us-gaap:ComprehensiveIncomeNetOfTax
us-gaap:ComprehensiveIncomeNetOfTaxIncludingPortionAttributableToNoncontrollingInterest
I get a indirect way to extract income statement via SEC's api,SEC already publish all data extracted from raw xbrl file(called xbrl instance published also),the tool which parse xbrl file to form four csv files num.txt,sub.txt,pre.txt,tag.txt is not published,what i want to do is to create the tool myself.
Step1:
Download dataset from https://www.sec.gov/dera/data/financial-statement-data-sets.html on 2020,and unzip it ,we get pre.txt,num.txt,sub.txt,tag.txt.
Step2:
Create database and table pre,num,sub,tag according to fields in pre.txt,num.txt,sub.txt,tag.txt.
Step3:
Import pre.txt,num.txt,sub.txt,tag.txt into table pre,num,sub,tag.
Step4:
Query in my postgresql:
the adsh number for tesla's financial statement on 2020 is `0001564590-21-004599`
\set accno '0001564590-21-004599'
select tag,value from num where adsh=:'accno' and ddate = '2020-12-31' and qtrs=4 and tag in
(select tag from pre where adsh=:'accno' and stmt='IS');
tag | value
---------------------------------------------------------------------------------------------+------------------
NetIncomeLoss | 721000000.0000
OperatingLeasesIncomeStatementLeaseRevenue | 1052000000.0000
GrossProfit | 6630000000.0000
InterestExpense | 748000000.0000
CostOfRevenue | 24906000000.0000
WeightedAverageNumberOfSharesOutstandingBasic | 933000000.0000
IncomeLossFromContinuingOperationsBeforeIncomeTaxesExtraordinaryItemsNoncontrollingInterest | 1154000000.0000
EarningsPerShareDiluted | 0.6400
ProfitLoss | 862000000.0000
OperatingExpenses | 4636000000.0000
InvestmentIncomeInterest | 30000000.0000
OperatingIncomeLoss | 1994000000.0000
SellingGeneralAndAdministrativeExpense | 3145000000.0000
NetIncomeLossAttributableToNoncontrollingInterest | 141000000.0000
StockholdersEquityNoteStockSplitConversionRatio1 | 5.0000
NetIncomeLossAvailableToCommonStockholdersBasic | 690000000.0000
Revenues | 31536000000.0000
IncomeTaxExpenseBenefit | 292000000.0000
OtherNonoperatingIncomeExpense | -122000000.0000
WeightedAverageNumberOfDilutedSharesOutstanding | 1083000000.0000
EarningsPerShareBasic | 0.7400
ResearchAndDevelopmentExpense | 1491000000.0000
BuyOutOfNoncontrollingInterest | 31000000.0000
CostOfAutomotiveLeasing | 563000000.0000
CostOfRevenuesAutomotive | 20259000000.0000
CostOfServicesAndOther | 2671000000.0000
SalesRevenueAutomotive | 27236000000.0000
SalesRevenueServicesAndOtherNet | 2306000000.0000
(28 rows)
stmt='IS':get tag in the income statement,ddate='2020-12-31':annual report on 2020 year.Please read SEC's data field definition for qtrs,you may know why to set qtrs=4 in the select command.
Make a comparison with https://www.nasdaq.com/market-activity/stocks/tsla/financials,show part of it,look at the 12/31/2020 column:
Period Ending: 12/31/2021 12/31/2020 12/31/2019 12/31/2018
Total Revenue $53,823,000 $31,536,000 $24,578,000 $21,461,000
Cost of Revenue $40,217,000 $24,906,000 $20,509,000 $17,419,000
Gross Profit $13,606,000 $6,630,000 $4,069,000 $4,042,000
Research and Development $2,593,000 $1,491,000 $1,343,000 $1,460,000
All the number are equal,the terms in xbrl are:Revenues,CostOfRevenue,GrossProfit,ResearchAndDevelopmentExpense ,the respective terms in financial accounting concepts are:Total Revenue,Cost of Revenue,Gross Profit,Research and Development. What i get are right numbers.
Somedays later,i can find the direct way to parse raw xbrl file to get the income statement,i am not familiar with xbrl yet,wish stackoverflow community help me fulfill my goal.
I´m still new to Node.js and currently developing a small app for my kitchen. This app can scan receipts and uses OCR to extract the data. For OCR extracting I´m using the ocr-space web api. Afterwards I need to parse the raw text to a JSON structure and send it to my database. I´ve also tested this receipt using AWS textract, which gave me a even poorer result.
Currently I´m struggling at the parsing part using RegEx in Node.js.
Here is my JSON structure which I use to parse the receipt data:
receipt = {
title: 'title of receipt'
items: [
'item1',
'item2',
'item3'
],
preparation: 'preparation text'
}
As most of the receipts have a items part and afterwards a preparation part my general approach so far looks like the following:
Searching for keywords like 'items' and 'preparation' in the raw text
Parse the text between these keywords
Do further string processing, like missing whitespaces, triming etc.
This approach doesn´t work if these keywords are missing. Take for example the following receipt, where I´m struggle to parse it into my JSON structure. The receipt is in German and there are no corresponding keywords ('items' or 'Zutaten', 'preparation' or 'Zubereitung').
Following information from the raw text are necessary:
title: line 1
items: line 2 - 8
preparation: line 9 until end
Do you have any hints or tips how to come closer to the solution? Or do you have any other ideas how to manage such situations accordingly?
Quinoa-Brot
30 g Chiasamen
350 g Quinoa
70 ml Olivenöl
1/2 TL Speisenatron
1 Prise Salz
Saft von 1/2 Zitrone
1 Handvoll Sonnenblumenkerne
30 g Schwarzkümmelsamen
1 Chiasamen mit 100 ml Wasser
verrühren und 30 Minuten quel-
len lassen. Den Ofen auf 200 oc
vorheizen, eine kleine Kastenform
mit Backpapier auslegen.
2 Quinoa mit der dreifachen
Menge Wasser in einen Topf ge-
ben, einmal aufkochen und dann
3 Minuten köcheln lassen - die
Quinoa wird so nur teilweise ge-
gegart. In ein Sieb abgießen, kalt
abschrecken und anschließend
gut abtropfen lassen.
Between each line there is a \n tabulator.
The parsed receipt should look like this:
receipt = {
title: 'Quinoa-Brot',
items: [
'30 g Chiasamen',
'350 g Quinoa',
'70 ml Olivenöl',
'1/2 TL Speisenatron',
'1 Prise Salz',
'Saft von 1/2 Zitrone'
'1 Handvoll Sonnenblumenkerne'
'30 g Schwarzkümmelsamen',
],
preparation: '1 Chiasamen mit 100 ml Wasser verrühren und 30 Minuten quellen lassen. Den Ofen auf 200 oc vorheizen, eine kleine Kastenform mit Backpapier auslegen. 2 Quinoa mit der dreifachen Menge Wasser in einen Topf geben, einmal aufkochen und dann 3 Minuten köcheln lassen - die Quinoa wird so nur teilweise gegegart. In ein Sieb abgießen, kalt abschrecken und anschließend gut abtropfen lassen.'
}
Pattern matching solutions like RegExp don't sound suitable for this sort of a categorization problem. You might want to consider clustering (k-means, etc.) - training a model to differentiate between ingredients and instructions. This can be done by labeling a number of recipes (the more the better), or using unsupervised ML by clustering line by line.
If you need to stick to RegExp for some reason, you keeping track of repeated words. Weak methodology: ingredient names (Chiasemen, Quinoa, ) will be referenced in the instructions, so you can match on multiline to find where the same word is repeated later on:
(?<=\b| )([^ ]+)(?= |$).+(\1)
If you do run this on a loop, plus logic, you can find pairs ingredient-instruction pairs, and work through the document with silhouette information.
You might be able to take advantage of ingredient lines containing numeric data like numbers or words like "piece(s), sticks, leaves" which you might store in a dictionary. That can enrich the word boundary input matches.
I would reconsider using RegExp here at all...
I have a dataset which looks like this:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Reviews>
<Review rid="1004293">
<sentences>
<sentence id="1004293:0">
<text>Judging from previous posts this used to be a good place, but not any longer.</text>
<Opinions>
<Opinion target="place" category="RESTAURANT#GENERAL" polarity="negative" from="51" to="56"/>
</Opinions>
</sentence>
<sentence id="1004293:1">
<text>The food here is rather good, but only if you like to wait for it.</text>
<Opinions>
<Opinion target="food" category="FOOD#QUALITY" polarity="positive" from="4" to="8"/>
<Opinion target="NULL" category="SERVICE#GENERAL" polarity="negative" from="0" to="0"/>
</Opinions>
</sentence>
...
How can I parse the data from this .xml file to .tsv file in the following format:
["negative", "Judging from previous posts this used to be a good place, but not any longer.", "RESTAURANT#GENERAL"]
["positive", "The food here is rather good, but only if you like to wait for it.","FOOD#QUALITY"]
["negative", "The food here is rather good, but only if you like to wait for it.","SERVICE#GENERAL"]
Thanks!
You can use elementtree package of python to get your desired output. Below is the code which will print your list. You can create a tsv by replacing the print and writing to a tsv file.
The sample.xml file must be present in the same directory where this code is present.
from xml.etree import ElementTree
file = 'sample.xml'
tree = ElementTree.parse(file)
root = tree.getroot()
for sentence in root.iter('sentence'):
# Loop all sentence in the xml
for opinion in sentence.iter('Opinion'):
# Loop all Opinion of a particular sentence.
print([opinion.attrib['polarity'], sentence.find('text').text, opinion.attrib['category']])
Output:
['negative', 'Judging from previous posts this used to be a good place, but not any longer.', 'RESTAURANT#GENERAL']
['positive', 'The food here is rather good, but only if you like to wait for it.', 'FOOD#QUALITY']
['negative', 'The food here is rather good, but only if you like to wait for it.', 'SERVICE#GENERAL']
sample.xml contains:
<Reviews>
<Review rid="1004293">
<sentences>
<sentence id="1004293:0">
<text>Judging from previous posts this used to be a good place, but not any longer.</text>
<Opinions>
<Opinion target="place" category="RESTAURANT#GENERAL" polarity="negative" from="51" to="56"/>
</Opinions>
</sentence>
<sentence id="1004293:1">
<text>The food here is rather good, but only if you like to wait for it.</text>
<Opinions>
<Opinion target="food" category="FOOD#QUALITY" polarity="positive" from="4" to="8"/>
<Opinion target="NULL" category="SERVICE#GENERAL" polarity="negative" from="0" to="0"/>
</Opinions>
</sentence>
</sentences>
</Review>
I'm working on a coreference-resolution system based on Neural Networks for my Bachelor's Thesis, and i have a problem when i read the corpus.
The corpus is already preproccesed, and i only need to read it to do my stuff. I use Beautiful Soup 4 to read the xml files of each document that contains the data i need.
the files look like this:
<?xml version='1.0' encoding='ISO-8859-1'?>
<!DOCTYPE markables SYSTEM "markables.dtd">
<markables xmlns="www.eml.org/NameSpaces/markable">
<markable id="markable_102" span="word_390" grammatical_role="vc" coref_set="empty" visual="none" rel_type="none" np_form="indefnp" type="" entity="NO" nb="UNK" def="INDEF" sentenceid="19" lemmata="premia" pos="nn" head_pos="word_390" wikipedia="" mmax_level="markable"/>
<markable id="markable_15" span="word_48..word_49" grammatical_role="vc" coref_set="empty" visual="none" rel_type="none" np_form="defnp" type="" entity="NO" nb="SG" def="DEF" sentenceid="3" lemmata="Grozni hegoalde" pos="nnp nn" head_pos="word_48" wikipedia="Grozny" mmax_level="markable"/>
<markable id="markable_101" span="word_389" grammatical_role="sbj" coref_set="set_21" coref_type="named entities" visual="none" rel_type="coreferential" sub_type="exact repetition" np_form="ne_o" type="enamex" entity="LOC" nb="SG" def="DEF" sentenceid="19" lemmata="Mosku" pos="nnp" head_pos="word_389" wikipedia="" mmax_level="markable"/>
...
i need to extract all the spans here, so try to do it with this code (python3):
...
from bs4 import BeautifulSoup
...
file1 = markables+filename+"_markable_level.xml"
xml1 = open(file1) #markable
soup1 = BeautifulSoup(xml1, "html5lib") #markable
...
...
for markable in soup1.findAll('markable'):
try:
span = markable.contents[1]['span']
print(span)
spanA = span.split("..")[0]
spanB = span.split("..")[-1]
...
(I ignored most of the code, as they are 500 lines)
python3 aurreprozesaketaSTM.py
train
--- 28.329787254333496 seconds ---
&&&&&&&&&&&&&&&&&&&&&&&&& egun.06-1-p0002500.2000-06-01.europa
word_48..word_49
word_389
word_385..word_386
word_48..word_52
...
if you conpare the xml file with the output, you can see that word_390 is missing.
I get almost all the data that i need, then preproccess everything, build the system with neural networks, and finally i get scores and all...
But as I loose the first word of each document, my systems accuracy is a bit lower than what should be.
Can anyone help me with this? Any idea where is the problem?
You are parsing XML with html5lib. It is not supported for parsing XML.
lxml’s XML parser ... The only currently supported XML parser
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
I have looked at the other question over Parsing XML with namespace in Python via 'ElementTree' and reviewed the xml.etree.ElementTree documentation. The issue I'm having is admittedly similar so feel free to tag this as duplicate, but I can't figure it out.
The line of code I'm having issues with is
instance_alink = root.find('{http://www.w3.org/2005/Atom}link')
My code is as follows:
import xml.etree.cElementTree as ET
tree = ET.parse('../../external_data/rss.xml')
root = tree.getroot()
instance_title = root.find('channel/title').text
instance_link = root.find('channel/link').text
instance_alink = root.find('{http://www.w3.org/2005/Atom}link')
instance_description = root.find('channel/description').text
instance_language = root.find('channel/language').text
instance_pubDate = root.find('channel/pubDate').text
instance_lastBuildDate = root.find('channel/lastBuildDate').text
The XML file:
<?xml version="1.0" encoding="windows-1252"?>
<rss version="2.0">
<channel>
<title>Filings containing financial statements tagged using the US GAAP or IFRS taxonomies.</title>
<link>http://www.example.com</link>
<atom:link href="http://www.example.com" rel="self" type="application/rss+xml" xmlns:atom="http://www.w3.org/2005/Atom"/>
<description>This is a list of up to 200 of the latest filings containing financial statements tagged using the US GAAP or IFRS taxonomies, updated every 10 minutes.</description>
<language>en-us</language>
<pubDate>Mon, 20 Nov 2017 20:20:45 EST</pubDate>
<lastBuildDate>Mon, 20 Nov 2017 20:20:45 EST</lastBuildDate>
....
The attributes I'm trying to retrieve are in line 6; so 'href', 'type', etc.
<atom:link href="http://www.example.com" rel="self" type="application/rss+xml" xmlns:atom="http://www.w3.org/2005/Atom"/>
Obviously, I've tried
instance_alink = root.find('{http://www.w3.org/2005/Atom}link').attrib
but that doesn't work cause it's type None. My thought is that it's looking for children but there are none. I can grab the attributes in the other lines in XML but not these for some reason. I've also played with ElementTree and lxml (but lxml won't load properly on Windows for whatever reason)
Any help is greatly appreciated cause the documentation seems sparse.
I was able to solve with
alink = root.find('channel/{http://www.w3.org/2005/Atom}link').attrib
the issue is that I was looking for the tag {http://www.w3.org/2005/Atom}link at the same level of <channel>, which, of course, didn't exist.