Avoiding cartesian when adding unique classifier to a list in python 3 - python-3.x

I have 5 .csv files I am importing and all contain emails:
Donors = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Donors Q1 2021 R12.csv",
usecols=["Email Address"])
Activists = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Activists Q1 2021 R12.csv",
usecols=["Email"])
Low_Level_Activists = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Low Level Activists Q1 2021 R12.csv",
usecols=["Email"])
Ambassadors = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Ambassadors Q1 2021.csv",
usecols=["Email Address"])
Volunteers = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Volunteers Q1 2021 R12.csv",
usecols=["Email Address"])
Followers= pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Followers Q1 2021 R12.csv",
usecols=["Email"])
While I am only importing emails (annoyingly with two different naming conventions because of the systems they originate from), I am adding the import name as a classifer - i.e. Donors, Volunteers, etc.
Donors['Value'] = "Donors"
Activists['Value'] = "Activists"
Low_Level_Activists['Value'] = "Low_Level_Activists"
Ambassadors['Value'] = "Ambassadors"
Volunteers['Value'] = "Volunteers"
Advocates['Value'] = 'Followers'
I then concatenate all the files and handle the naming issue. I am sure there is a more elegant way to do this but here's what I have:
S1= pd.concat([Donors,Activists,Low_Level_Activists,Ambassadors,Volunteers,Advocates], ignore_index= True)
S1['Handle'] = S1['Email Address'].where(S1['Email Address'].notnull(), S1['Email'])
S1= S1.drop(['Email','Email Address'], axis = 1)
print(S1['Handle'].count()) #checks full count
The total on that last line is 166,749
Here is my problem. I need to filter the emails for uniques - easy enough using .nuniques() and the but the problem I am having is I also need to carry the classifier. So if a singular email is a Donor but also an Activist, I pull both when I try to merge the unique values with the classifier.
I have been at this for many hours (and to the end of the Internet!) and can't seem to find a workable solution. I've tried dictionary for loops, merges, etc. ad infinitum. The unique email count is 165,923 (figured out via Python &/or excel:( ).
Essentially I would want to pull the earliest classifier in my list on a match. So if an email is a Donor and an Activist-> call them a Donor. Or if a email is a Volunteer and a Follower -> call them a Volunteer on one email record.
Any help would be greatly appreciated.

I'll give it a try with some made-up data:
import pandas as pd
fa = pd.DataFrame([['paul#mail.com', 'Donors'], ['max#mail.com', 'Donors']], columns=['Handle', 'Value'])
fb = pd.DataFrame([['paul#mail.com', 'Activists'], ['annie#mail.com', 'Activists']], columns=['Handle', 'Value'])
S1 = pd.concat([fa, fb])
print(S1)
gives
Handle Value
0 paul#mail.com Donors
1 max#mail.com Donors
0 paul#mail.com Activists
1 annie#mail.com Activists
You can group by Handle and then pick any Value you like, e.g. the first:
for handle, group in S1.groupby('Handle'):
print(handle, group.reset_index().loc[0, 'Value'])
gives
annie#mail.com Activists
max#mail.com Donors
paul#mail.com Donors
or collect all roles of a person:
for handle, group in S1.groupby('Handle'):
print(handle, group.Value.unique())
gives
annie#mail.com ['Activists']
max#mail.com ['Donors']
paul#mail.com ['Donors' 'Activists']

Related

How can extract an income statement from all company concepts?

All company concepts in xbrl format can be extracted with sec's RESTful api.
For example,i want to get tesla's concepts in xbrl format in 2020, get the tesla's cik and the url for api.
cik='1318605'
url = 'https://data.sec.gov/api/xbrl/companyfacts/CIK{:>010s}.json'.format(cik)
To express financial statement in 2020 with elements fy and fp:
'fy' == 2020 and 'fp' == 'FY'
I write the whole python code to call sec's api:
import requests
import json
cik='1318605'
url = 'https://data.sec.gov/api/xbrl/companyfacts/CIK{:>010s}.json'.format(cik)
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36"
}
res = requests.get(url=url,headers=headers)
result = json.loads(res.text)
concepts_list = list(result['facts']['us-gaap'].keys())
data_concepts = result['facts']['us-gaap']
fdata = {}
for item in concepts_list:
data = data_concepts[item]['units']
data_units = list(data_concepts[item]['units'].keys())
for data_units_attr in data_units:
for record in data[data_units_attr]:
if record['fy'] == 2020 and record['fp'] == 'FY':
fdata[item] = record['val']
fdata contains all the company concepts and its value in 2020 for tesla,show part of it:
fdata
{'AccountsAndNotesReceivableNet': 334000000,
'AccountsPayableCurrent': 6051000000,
'AccountsReceivableNetCurrent': 1886000000,
How can get all concepts below to income statement?I want to extract it to make an income statement.
Maybe i should add some values in dei with almost same way as above.
EntityCommonStockSharesOutstanding:959853504
EntityPublicFloat:160570000000
It is simple to parse financial statement such as income statement from ixbrl file:
https://www.sec.gov/ix?doc=/Archives/edgar/data/0001318605/000156459021004599/tsla-10k_20201231.htm
I can get it ,please help to replicate the annual income statement on 2020 for Tesla from sec's RESTful api,or extract the income statement from the whole instance file:
https://www.sec.gov/Archives/edgar/data/1318605/000156459021004599/0001564590-21-004599.txt
If you tell me how to get all concepts belong to income statement ,i can fulfill the job ,with the same way balance and cashflow statement all can be extracted.In my case the fdata contain all concepts belong to income,balance,cashflow statement,which concept in fdata belong to which financial statement? How to map every concepts into income,balance,cashflow statement?
#expression in pseudocode
income_statement = fdata[all_concepts_belong_to_income_statement]
balance_statement = fdata[all_concepts_belong_to_balance_statement]
cashflow_statement = fdata[all_concepts_belong_to_cashflow_statement]
If I understand you correctly, the following should help you get least part of the way there.I'm not going to try to replicate the financial statements; instead, I'll only choose one, and show the concepts used in creating the statement - you'll have to take it from there.
I'll use the "income statement" (formally called in the instance document: "Consolidated Statements of Comprehensive Income (Loss) - USD ($) - $ in Millions") as the example. Again, the data is not in json, but html.
#import the necessary libraries
import lxml.html as lh
import requests
#get the filing
url = 'https://www.sec.gov/Archives/edgar/data/1318605/000156459021004599/0001564590-21-004599.txt'
req = requests.get(url, headers={"User-Agent": "b2g"})
#parse the filing content
doc = lh.fromstring(req.content)
#locate the relevant table and get the data
concepts = doc.xpath("//p[#id='Consolidated_Statmnts_of_Cmprehnsve_Loss']/following-sibling::div[1]//table//#name")
for c in set(concepts):
print(c)
Output:
There are 5 concepts, repeated 3 times each:
us-gaap:ComprehensiveIncomeNetOfTaxAttributableToNoncontrollingInterest
us-gaap:OtherComprehensiveIncomeLossForeignCurrencyTransactionAndTranslationAdjustmentNetOfTax
us-gaap:ProfitLoss
us-gaap:ComprehensiveIncomeNetOfTax
us-gaap:ComprehensiveIncomeNetOfTaxIncludingPortionAttributableToNoncontrollingInterest
I get a indirect way to extract income statement via SEC's api,SEC already publish all data extracted from raw xbrl file(called xbrl instance published also),the tool which parse xbrl file to form four csv files num.txt,sub.txt,pre.txt,tag.txt is not published,what i want to do is to create the tool myself.
Step1:
Download dataset from https://www.sec.gov/dera/data/financial-statement-data-sets.html on 2020,and unzip it ,we get pre.txt,num.txt,sub.txt,tag.txt.
Step2:
Create database and table pre,num,sub,tag according to fields in pre.txt,num.txt,sub.txt,tag.txt.
Step3:
Import pre.txt,num.txt,sub.txt,tag.txt into table pre,num,sub,tag.
Step4:
Query in my postgresql:
the adsh number for tesla's financial statement on 2020 is `0001564590-21-004599`
\set accno '0001564590-21-004599'
select tag,value from num where adsh=:'accno' and ddate = '2020-12-31' and qtrs=4 and tag in
(select tag from pre where adsh=:'accno' and stmt='IS');
tag | value
---------------------------------------------------------------------------------------------+------------------
NetIncomeLoss | 721000000.0000
OperatingLeasesIncomeStatementLeaseRevenue | 1052000000.0000
GrossProfit | 6630000000.0000
InterestExpense | 748000000.0000
CostOfRevenue | 24906000000.0000
WeightedAverageNumberOfSharesOutstandingBasic | 933000000.0000
IncomeLossFromContinuingOperationsBeforeIncomeTaxesExtraordinaryItemsNoncontrollingInterest | 1154000000.0000
EarningsPerShareDiluted | 0.6400
ProfitLoss | 862000000.0000
OperatingExpenses | 4636000000.0000
InvestmentIncomeInterest | 30000000.0000
OperatingIncomeLoss | 1994000000.0000
SellingGeneralAndAdministrativeExpense | 3145000000.0000
NetIncomeLossAttributableToNoncontrollingInterest | 141000000.0000
StockholdersEquityNoteStockSplitConversionRatio1 | 5.0000
NetIncomeLossAvailableToCommonStockholdersBasic | 690000000.0000
Revenues | 31536000000.0000
IncomeTaxExpenseBenefit | 292000000.0000
OtherNonoperatingIncomeExpense | -122000000.0000
WeightedAverageNumberOfDilutedSharesOutstanding | 1083000000.0000
EarningsPerShareBasic | 0.7400
ResearchAndDevelopmentExpense | 1491000000.0000
BuyOutOfNoncontrollingInterest | 31000000.0000
CostOfAutomotiveLeasing | 563000000.0000
CostOfRevenuesAutomotive | 20259000000.0000
CostOfServicesAndOther | 2671000000.0000
SalesRevenueAutomotive | 27236000000.0000
SalesRevenueServicesAndOtherNet | 2306000000.0000
(28 rows)
stmt='IS':get tag in the income statement,ddate='2020-12-31':annual report on 2020 year.Please read SEC's data field definition for qtrs,you may know why to set qtrs=4 in the select command.
Make a comparison with https://www.nasdaq.com/market-activity/stocks/tsla/financials,show part of it,look at the 12/31/2020 column:
Period Ending: 12/31/2021 12/31/2020 12/31/2019 12/31/2018
Total Revenue $53,823,000 $31,536,000 $24,578,000 $21,461,000
Cost of Revenue $40,217,000 $24,906,000 $20,509,000 $17,419,000
Gross Profit $13,606,000 $6,630,000 $4,069,000 $4,042,000
Research and Development $2,593,000 $1,491,000 $1,343,000 $1,460,000
All the number are equal,the terms in xbrl are:Revenues,CostOfRevenue,GrossProfit,ResearchAndDevelopmentExpense ,the respective terms in financial accounting concepts are:Total Revenue,Cost of Revenue,Gross Profit,Research and Development. What i get are right numbers.
Somedays later,i can find the direct way to parse raw xbrl file to get the income statement,i am not familiar with xbrl yet,wish stackoverflow community help me fulfill my goal.

Failing to use sumproduct on date ranges with multiple conditions [Python]

From replacement data table (below on the image), I am trying to incorporate the solbox product replace in time series data format(above on the image). I need to extract out the number of consumers per day from the information.
What I need to find out:
On a specific date, which number of solbox product was active
On a specific date, which number of solbox product (which was a consumer) was active
I have used this line of code in excel but cannot implement this on python properly.
=SUMPRODUCT((Record_Solbox_Replacement!$O$2:$O$1367 = "consumer") * (A475>=Record_Solbox_Replacement!$L$2:$L$1367)*(A475<Record_Solbox_Replacement!$M$2:$M$1367))
I tried in python -
timebase_df['date'] = pd.date_range(start = replace_table_df['solbox_started'].min(), end = replace_table_df['solbox_started'].max(), freq = frequency)
timebase_df['date_unix'] = timebase_df['date'].astype(np.int64) // 10**9
timebase_df['no_of_solboxes'] = ((timebase_df['date_unix']>=replace_table_df['started'].to_numpy()) & (timebase_df['date_unix'] < replace_table_df['ended'].to_numpy() & replace_table_df['customer_type'] == 'customer']))
ERROR:
~\Anaconda3\Anaconda4\lib\site-packages\pandas\core\ops\array_ops.py in comparison_op(left, right, op)
232 # The ambiguous case is object-dtype. See GH#27803
233 if len(lvalues) != len(rvalues):
--> 234 raise ValueError("Lengths must match to compare")
235
236 if should_extension_dispatch(lvalues, rvalues):
ValueError: Lengths must match to compare
Can someone help me please? I can explain in comment section if I have missed something.

pandas groupby trying to optimse several steps

I've been trying to optimise a bokeh server to calculate live stats by selected country on Covid19.
I found myself repeating a groupby function to calculate new columns and was wondering, having selected the groupby, if I could then apply it in a similar way to .agg() on multiple columns ?
For example:
dfall = pd.DataFrame(db("SELECT * FROM C19daily"))
dfall.set_index(['geoId', 'date'], drop=False, inplace=True)
dfall = dfall.sort_index(ascending=True)
dfall.head()
id date geoId cases deaths auid
geoId date
AD 2020-03-03 70119 2020-03-03 AD 1 0 AD03/03/2020
2020-03-14 70118 2020-03-14 AD 1 0 AD14/03/2020
2020-03-16 70117 2020-03-16 AD 3 0 AD16/03/2020
2020-03-17 70116 2020-03-17 AD 9 0 AD17/03/2020
2020-03-18 70115 2020-03-18 AD 0 0 AD18/03/2020
I need to create new columns based on 'cases' and 'deaths' and applying various functions like cumsum(). Currently I do this the long way
dfall['ccases'] = dfall.groupby(level=0)['cases'].cumsum()
dfall['dpc_cases'] = dfall.groupby(level=0)['cases'].pct_change(fill_method='pad', periods=7)
.....
dfall['cdeaths'] = dfall.groupby(level=0)['deaths'].cumsum()
dfall['dpc_deaths'] = dfall.groupby(level=0)['deaths'].pct_change(fill_method='pad', periods=7)
I tried to optimise the groupby call like this:-
with dfall.groupby(level=0) as gr:
gr = g['cases'].cumsum()...
But the error suggest the class doesn't support this
AttributeError: __enter__
I thought I could use .agg({}) and supply dictionary
g = dfall.groupby(level=0).agg({'cc' : 'cumsum', 'cd' : 'cumsum'})
but that produces another error
pandas.core.base.SpecificationError: nested renamer is not supported
I have plenty of other bits to optimise, I thought this python part would be the easiest and save a few ms!
Could anyone nudge me in the right direction?
To avoid repeating dfall.groupby(level=0) you can just save it in a variable:
gb = dfall.groupby(level=0)
gb_cases = gb['cases']
dfall['ccases'] = gb_cases.cumsum()
dfall['dpc_cases'] = gb_cases.pct_change(fill_method='pad', periods=7)
...
And to run multiple aggregations using a single expression, I think you can use named aggregation. But I have no clue whether it will be more performant or not. Either way, it's better to profile the code and improve the actual bottlenecks.

Parsing heterogenous data from a text file in Python

I am trying to parse raw data results from a text file into an organised tuple but having trouble getting it right.
My raw data from the textfile looks something like this:
Episode Cumulative Results
EpisodeXD0281119
Date collected21/10/2019
Time collected10:00
Real time PCR for M. tuberculosis (Xpert MTB/Rif Ultra):
PCR result Mycobacterium tuberculosis complex NOT detected
Bacterial Culture:
Bottle: Type FAN Aerobic Plus
Result No growth after 5 days
EpisodeST32423457
Date collected23/02/2019
Time collected09:00
Gram Stain:
Neutrophils Occasional
Gram positive bacilli Moderate (2+)
Gram negative bacilli Numerous (3+)
Gram negative cocci Moderate (2+)
EpisodeST23423457
Date collected23/02/2019
Time collected09:00
Bacterial Culture:
A heavy growth of
1) Klebsiella pneumoniae subsp pneumoniae (KLEPP)
ensure that this organism does not spread in the ward/unit.
A heavy growth of
2) Enterococcus species (ENCSP)
Antibiotic/Culture KLEPP ENCSP
Trimethoprim-sulfam R
Ampicillin / Amoxic R S
Amoxicillin-clavula R
Ciprofloxacin R
Cefuroxime (Parente R
Cefuroxime (Oral) R
Cefotaxime / Ceftri R
Ceftazidime R
Cefepime R
Gentamicin S
Piperacillin/tazoba R
Ertapenem R
Imipenem S
Meropenem R
S - Sensitive ; I - Intermediate ; R - Resistant ; SDD - Sensitive Dose Dependant
Comment for organism KLEPP:
** Please note: this is a carbapenem-RESISTANT organism. Although some
carbapenems may appear susceptible in vitro, these agents should NOT be used as
MONOTHERAPY in the treatment of this patient. **
Please isolate this patient and practice strict contact precautions. Please
inform Infection Prevention and Control as contact screening might be
indicated.
For further advice on the treatment of this isolate, please contact.
The currently available laboratory methods for performing colistin
susceptibility results are unreliable and may not predict clinical outcome.
Based on published data and clinical experience, colistin is a suitable
therapeutic alternative for carbapenem resistant Acinetobacter spp, as well as
carbapenem resistant Enterobacteriaceae. If colistin is clinically indicated,
please carefully assess clinical response.
EpisodeST234234057
Date collected23/02/2019
Time collected09:00
Authorised by xxxx on 27/02/2019 at 10:35
MIC by E-test:
Organism Klebsiella pneumoniae (KLEPN)
Antibiotic Meropenem
MIC corrected 4 ug/mL
MIC interpretation Resistant
Antibiotic Imipenem
MIC corrected 1 ug/mL
MIC interpretation Sensitive
Antibiotic Ertapenem
MIC corrected 2 ug/mL
MIC interpretation Resistant
EpisodeST23423493
Date collected18/02/2019
Time collected03:15
Potassium 4.4 mmol/L 3.5 - 5.1
EpisodeST45445293
Date collected18/02/2019
Time collected03:15
Creatinine 32 L umol/L 49 - 90
eGFR (MDRD formula) >60 mL/min/1.73 m2
Creatinine 28 L umol/L 49 - 90
eGFR (MDRD formula) >60 mL/min/1.73 m2
Essentially the pattern is that ALL information starts with a unique EPISODE NUMBER and follows with a DATE and TIME and then the result of whatever test. This is the pattern throughout.
What I am trying to parse into my tuple is the date, time, name of the test and the result - whatever it might be. I have the following code:
with open(filename) as f:
data = f.read()
data = data.splitlines()
DS = namedtuple('DS', 'date time name value')
parsed = list()
idx_date = [i for i, r in enumerate(data) if r.strip().startswith('Date')]
for start, stop in zip(idx_date[:-1], idx_date[1:]):
chunk = data[start:stop]
date = time = name = value = None
for row in chunk:
if not row: continue
row = row.strip()
if row.startswith('Episode'): continue
if row.startswith('Date'):
_, date = row.split()
date = date.replace('collected', '')
elif row.startswith('Time'):
_, time = row.split()
time = time.replace('collected', '')
else:
name, value, *_ = row.split()
print (name)
parsed.append(DS(date, time, name, value))
print(parsed)
My error is that I am unable to find a way to parse the heterogeneity of the test RESULT in a way that I can use later, for example for the tuple DS ('DS', 'date time name value'):
DATE = 21/10/2019
TIME = 10:00
NAME = Real time PCR for M tuberculosis or Potassium
RESULT = Negative or 4.7
Any advice appreciated. I have hit a brick wall.

Vanilla Interest Rate Swap Valuation in Python using QuantLib

I am valuing a Vanilla Interest Rate Swap as at 31 January 2017 (Valuation Date) but the effective date of the Vanilla Interest Rate Swap is 31 December 2016 (Start Date). Firstly, I would like to know how I can adjust for my valuation date and start date in the code below;
import QuantLib as ql
startDate = ql.Date(31,12,2016)
valuationDate = ql.Date(31,1,2017)
maturityDate = ql.Date(30,9,2019)
calendar = ql.SouthAfrica()
bussiness_convention = ql.Unadjusted
swapT = ql.VanillaSwap.Payer
nominal = 144000000
fixedLegTenor = ql.Period(3, ql.Months)
fixedSchedule = ql.Schedule(startDate, maturityDate,
fixedLegTenor, calendar,
ql.ModifiedFollowing, ql.ModifiedFollowing,
ql.DateGeneration.Forward, False)
fixedRate = 0.077
fixedLegDayCount = ql.Actual365Fixed()
spread = 0
floatLegTenor = ql.Period(3, ql.Months)
floatSchedule = ql.Schedule(startDate, maturityDate,
floatLegTenor, calendar,
ql.ModifiedFollowing, ql.ModifiedFollowing,
ql.DateGeneration.Forward, False)
floatLegDayCount = ql.Actual365Fixed()
discountDates = [ql.Date(31,1,2017),ql.Date(7,2,2017),ql.Date(28,2,2017),
ql.Date(31,3,2017),ql.Date(28,4,2017),ql.Date(31,7,2017),
ql.Date(31,10,2017),ql.Date(31,1,2018),ql.Date(31,1,2019),
ql.Date(31,1,2020),ql.Date(29,1,2021),ql.Date(31,1,2022),
ql.Date(31,1,2023),ql.Date(31,1,2024),ql.Date(31,1,2025),
ql.Date(30,1,2026),ql.Date(29,1,2027),ql.Date(31,1,2029),
ql.Date(30,1,2032),ql.Date(30,1,2037),ql.Date(31,1,2042),
ql.Date(31,1,2047)]
discountRates = [1,0.9986796,0.99457,0.9884423,0.9827433,0.9620352,0.9420467,0.9218714,
0.863127,0.7993626,0.7384982,0.6796581,0.6244735,0.5722537,0.5236629,
0.4779477,0.4362076,0.3619845,0.2795902,0.1886847,0.1352048,0.1062697]
jibarTermStructure = ql.RelinkableYieldTermStructureHandle()
jibarIndex = ql.Jibar(ql.Period(3,ql.Months), jibarTermStructure)
jibarIndex.addFixings(dtes, fixings)
discountCurve = ql.DiscountCurve(discountDates,discountRates,ql.Actual365Fixed())
swapEngine = ql.DiscountingSwapEngine(discountCurve)
interestRateSwap = ql.VanillaSwap(swapT, nominal, fixedSchedule,
fixedRate,fixedLegDayCount,jibarIndex,spread,floatSchedule,
floatLegDayCount,swapEngine,floatSchedule.convention)
Secondly, I would like to know how best I can incorporate jibarIndex in interestRateSwap =ql.VanillaSwap() or else how can I use my Discount Factors and Discount Dates to calculate the value of the Interest Rate Swap
ql.Settings.instance().setEvaluationDate(today)
sets the evaluation date or valuation date.
I don't see a problem with your jibarIndex index in the VanillaSwap code. You've given the object to QuantLib, the library will use it as a forward curve to price your floating leg and thus the swap.

Resources