How can understand semantic meaning for different value? - python-3.x

I want to get apple's financial data , download https://www.sec.gov/files/dera/data/financial-statement-and-notes-data-sets/2022_01_notes.zip from https://www.sec.gov/dera/data/financial-statement-and-notes-data-set.html.Extract it and put it in the /tmp/2022_01_notes.You can get the table sub,num and field definiton in the webpage https://www.sec.gov/files/aqfsn_1.pdf.
I compute the zip file's MD5 message digest.
md5sum 2022_01_notes.zip
b1cdf638200991e1bbe260489093bf67 2022_01_notes.zip
You can download it from official webpage or my dropbox:
https://www.dropbox.com/s/5ntwasipze8vr29/2022_01_notes.zip?dl=0
No matter where you download it from ,please check the md5sum value,maybe SEC uploaded wrong file and they will update the zip file in the future.
import pandas as pd
df_sub = pd.read_csv('/tmp/2022_01_notes/sub.tsv',sep='\t')
df_sub[df_sub['cik'] == 320193] #apple's cik is 321093
df_sub
adsh cik name sic countryba stprba cityba ... instance nciks aciks pubfloatusd floatdate floataxis floatmems
4329 0000320193-22-000006 320193 APPLE INC 3571.0 US CA CUPERTINO ... aapl-20220127_htm.xml 1 NaN NaN NaN NaN NaN
4731 0000320193-22-000007 320193 APPLE INC 3571.0 US CA CUPERTINO ... aapl-20211225_htm.xml 1 NaN NaN NaN NaN NaN
0000320193-22-000007 is a access number for its 2022Q2 data.
df_num = pd.read_csv('/tmp/2022_01_notes/num.tsv',sep='\t')
#get all apple's financial data in xbrl concepts format
df_apple = df_num[df_num['adsh'] == '0000320193-22-000007' ]
#extract only one concept ----RevenueFromContractWithCustomerExcludingAssessedTax
#it is revenue mapping into financial accountant concept from xbrl taxonomy.
df_apple_revenue = df_apple[df_apple['tag'] == 'RevenueFromContractWithCustomerExcludingAssessedTax']
df_apple_revenue_2021 = df_apple_revenue[df_apple_revenue['ddate'] == 20201231]
df_apple_revenue_2021
It is too long to display the dataframe on my terminal console,i write into a excel
df_apple_revenue_2021.to_csv('/tmp/apple_revenue_2021.csv')
and show it in the excel,paste the content here.
For the first two lines ,what does 8285000000 and 15761000000 mean?Please give a rational description for 8285000000 and 15761000000.
0000320193-22-000007 RevenueFromContractWithCustomerExcludingAssessedTax us-gaap/2021 20201231 1 USD 0xf159835fd3644f228d15724ad9d1837c 0 8285000000 0 1 0.013698995 5 -6
0000320193-22-000007 RevenueFromContractWithCustomerExcludingAssessedTax us-gaap/2021 20201231 1 USD 0x58c22680ab8dbbfb662ff4e14055c1bd 1 15761000000 0 1 0.013698995 5 -6

To explain these figures, you have to tie back to the filing from which they were extracted. In this case, the filing with the accession-number of 0000320193-22-000007 is Form 10-Q For the Fiscal Quarter Ended December 25, 2021. If you check in that filing, you'll find, for example, seven of the value numbers in your dataframe in the table Net sales by reportable segment specifically Three Months Ended December 26,2020.
So, for example, 8285000000 refers to the Japan segment for that period, while 15761000000 is in the Net sales by category table for the Services category for the same reporting period. That table contains six more of the values in the dataframe.

Related

Pandas: Subtracting Two Mismatched Dataframes

Forgive me if this is a repeat question, but I can't find the answer and I'm not even sure what the right terminology is.
I have two dataframes that don't have completely matching rows or columns. Something like:
Balances = pd.DataFrame({'Name':['Alan','Barry','Carl', 'Debbie', 'Elaine'],
'Age Of Debt':[1,4,3,7,2],
'Balance':[500,5000,300,100,3000],
'Payment Due Date':[1,1,30,14,1]})
Payments = pd.DataFrame({'Name':['Debbie','Alan','Carl'],
'Balance':[50,100,30]})
I want to subtract the Payments dataframe from the Balances dataframe based on Name, so essentially a new dataframe that looks like this:
pd.DataFrame({'Name':['Alan','Barry','Carl', 'Debbie', 'Elaine'],
'Age Of Debt':[1,4,3,7,2],
'Balance':[400,5000,270,50,3000],
'Payment Due Date':[1,1,30,14,1]})
I can imagine having to iterate over the rows of Balances, but when both dataframes are very large I don't think it's very efficient.
You can use .merge:
tmp = pd.merge(Balances, Payments, on="Name", how="outer").fillna(0)
Balances["Balance"] = tmp["Balance_x"] - tmp["Balance_y"]
print(Balances)
Prints:
Name Age Of Debt Balance Payment Due Date
0 Alan 1 400.0 1
1 Barry 4 5000.0 1
2 Carl 3 270.0 30
3 Debbie 7 50.0 14
4 Elaine 2 3000.0 1

Extracting specific values from a pandas columns and storing it in new columns

I have a pandas column which is storing data in a form of a list in the following format:
text
[['Mark','PERSON'],['Data Scientist','TITLE'], ['Berlin','LOC'], ['Python','SKILLS'], ['Tableau,','SKILLS'], ['SQL','SKILLS'], ['AWS','SKILLS']]
[['John','PERSON'],['Data Engineer','TITLE'], ['London','LOC'], ['Python','SKILLS'], ['DB2,','SKILLS'], ['SQL','SKILLS']
[['Pearson','PERSON'],['Intern','TITLE'], ['Barcelona','LOC'], ['Python','SKILLS'], ['Excel,','SKILLS'], ['SQL','SKILLS']
[['Broody','PERSON'],['Manager','TITLE'], ['Barcelona','LOC'], ['Team Management','SKILLS'], ['Excel,','SKILLS'], ['Good Communications','SKILLS']
[['Rita','PERSON'],['Software Developer','TITLE'], ['London','LOC'], ['Dot Net','SKILLS'], ['SQl Server,','SKILLS'], ['VS Code,'SKILLS']
What I want to see as an output is :
PERSON TITLE LOC SKILLS
Mark Data Scientist Berlin Python, Tableau, SQL, AWS
John Data Engineer London Python, DB2,SQL
..... and so on for the rest of the input rows as well
So essentially splitting the data by "," and storing the left part before "," as the column header and the right part of the "," as the value.
How can I achieve this?
If you have a data frame like this call it "df":
index text
0 1 [[Mark, PERSON], [Data Scientist, TITLE], [Ber...
1 2 [[John, PERSON], [Data Engineer, TITLE], [Lond...
2 3 [[Pearson, PERSON], [Intern, TITLE], [Barcelon...
3 4 [[Broody, PERSON], [Manager, TITLE], [Barcelon...
4 5 [[Rita, PERSON], [Software Developer, TITLE], ...
You can try something like that :
person=[]
skills=[]
title=[]
loc=[]
temp=[]
for i in range(len(df['text'])):
for j in range(len(df['text'][i])):
if df['text'][i][j][1]=='PERSON':
person.append(df['text'][i][j][0])
elif df['text'][i][j][1]=='TITLE':
title.append(df['text'][i][j][0])
elif df['text'][i][j][1]=='LOC':
loc.append(df['text'][i][j][0])
elif df['text'][i][j][1]=='SKILLS':
temp.append(df['text'][i][j][0].replace(",", ""))
skills.append(",".join(temp))
temp=[]
Output
PERSON TITLE LOC SKILLS
0 Mark Data Scientist Berlin Python,Tableau,SQL,AWS
1 John Data Engineer London Python,DB2,SQL
2 Pearson Intern Barcelona Python,Excel,SQL
3 Broody Manager Barcelona Team Management,Excel,Good Communications
4 Rita Software Developer London Dot Net,SQl Server,VS Code

How to merge two rows having same values into single row in python?

I am having a table called 'data' in that the values will be like following,
ID NAME DOB LOCATION
1 bob 08/10/1985 NEW JERSEY
1 bob 15/09/1987 NEW YORK
2 John 08/10/1985 NORTH CAROLINA
2 John 26/11/1990 OKLAHOMA
For example
I want output like,
ID NAME No.of.Days
1 bob difference of two given dates in days
2 John difference of two given dates in days
Please help me to form a python code to get the expected output.
If there will be only two dates in a for a given ID then below works!
df.groupby(['ID','NAME'])['DOB'].apply(lambda x: abs(pd.to_datetime(list(x)[0]) - pd.to_datetime(list(x)[1]))).reset_index(name='No.Of.Days')
Output
ID NAME No.Of.Days
0 1 bob 766 days
1 2 John 1934 days
you can use np.diff also
df.groupby(['ID','NAME'])['DOB'].apply(lambda x: np.diff(list(x))[0]).reset_index(name='No.Of.Days')
First, You need to convert Date column into date format. Lets suppose you are reading from .csv then read your .csv file as follows
df = pd.read_csv('yourfile.csv', parse_dates = ['DOB'])
otherwise, convert your existing dataframe column into date format as follows.
df['DOB'] = pd.to_datetime(df['DOB'])
now, you can perform the usual numeric operations.
df.groupby(['ID','NAME'])['DOB'].apply(lambda x: abs(pd.to_datetime(list(x)[0]) - pd.to_datetime(list(x)[1]))).reset_index(name='No.Of.Days')

string to pandas dataframe

following the parsing of a large pdf document I end up with string in the format in python:
Company Name;(Code) at End of Month;Reason for Alteration No. of Shares;Bond Symbol, etc.; Value, etc.; after Alteration;Remarks
Shares;Shares
TANSEISHA CO.,LTD.;(9743)48,424,071;0
MEITEC CORPORATION;(9744)31,300,000;0
TKC Corporation;(9746)26,731,033;0
ASATSU-DK INC.;(9747);42,155,400;Exercise of Subscription Warrants;0;May 2013 Resolution based 1;0Shares
May 2013 Resolution based 2;0Shares
Would it be possible to transform this into a pandas dataframe as follows where the columns are delimited by the ";". So looking at the above section from the string my df should look like:
Company Name (Code) at End of Month Reason for Alteration ....
Value,etc after Alteration Remarks Shares .....
As additional problem my rows don't always have the same number of strings delimited by ";", meaning that I would need to find a way to see my columns( I don't mind setting like a dataframe with 15 columns and delete afterwards those II do no need)
Thanks
This is a nice opportunity to use StringIO to make your result look like an open file handle so that you can just use pd.read_csv:
In [1]: import pandas as pd
In [2]: from StringIO import StringIO
In [3]: s = """Company Name;(Code) at End of Month;Reason for Alteration No. of Shares;Bond Symbol, etc.; Value, etc.; after Alteration;Remarks
...: Shares;Shares
...: TANSEISHA CO.,LTD.;(9743)48,424,071;0
...: MEITEC CORPORATION;(9744)31,300,000;0
...: TKC Corporation;(9746)26,731,033;0
...: ASATSU-DK INC.;(9747);42,155,400;Exercise of Subscription Warrants;0;May 2013 Resolution based 1;0Shares
...: May 2013 Resolution based 2;0Shares"""
In [4]: pd.read_csv(StringIO(s), sep=";")
Out [4]: Company Name (Code) at End of Month Reason for Alteration No. of Shares Bond Symbol, etc. Value, etc. after Alteration Remarks
0 Shares Shares NaN NaN NaN NaN NaN
1 TANSEISHA CO.,LTD. (9743)48,424,071 0 NaN NaN NaN NaN
2 MEITEC CORPORATION (9744)31,300,000 0 NaN NaN NaN NaN
3 TKC Corporation (9746)26,731,033 0 NaN NaN NaN NaN
4 ASATSU-DK INC. (9747) 42,155,400 Exercise of Subscription Warrants 0.0 May 2013 Resolution based 1 0Shares
5 May 2013 Resolution based 2 0Shares NaN NaN NaN NaN NaN
Note that it does look like there are some obvious data cleanup problems to tackle from here, but that should at least give you a start.
I would split your read in string into a list of list. Possibly use regex to find the beginning of each record (or at least use something that you know where it shows up, it looks like (Code) at End of Month might work) and slice your way through. Something like this:
import re
import pandas as pd
# Start your list of list off with your expected headers
mystringlist = [["Company Name",
"(Code) at End of Month",
"Reason for Alteration",
"Value,etc",
"after Alteration",
"Remarks Shares"]]
# This will be used to store the start and end indexes of each record
indexlist = []
# A recursive function to find the start location of each record. It expects a list of 1s and 0s
def find_start(thestring, startloc=0):
if startloc >=len(thestring):
return
else:
foundindex = thestring.find("1",startloc)
indexlist.append(foundindex)
return find_start(thestring, foundindex+1)
# Split on your delimiter
mystring = thestring.split(";")
# Use a list comprehension to make your list of 1s
# and 0s based on the location of a fixed regular expressible record
stringloc = "".join([1 if re.match(x, "\(\d+\)\d+,\d+,\d+") else 0 for x in mystring])
find_start(stringloc)
# Make your list of list based on found indexes
# We subtract 1 from the index position because we want the element
# that immediately precedes the element we find (it's an easier regex
# to make when it's a consistent structure.
for x in indexlist:
if mystringlist.index(x)+1 != len(indexlist):
mystringlist.append(mystring[x-1:indexlist[indexlist.index(x)+1]-1])
# Turn mystring list into a data frame
mydf = pd.DataFrame(mystringlist)

How to fill missing data in excel time series

I need a hand on this problem: In an Excel workbook I reported 10 time series (with monthly frequency) of 10 titles that should cover the past 15 years. Unfortunately, not all titles can cover the 15-year time series. For example, a title only goes up to 2003; So in the column of that title, I have the first 5 years with a "Not Available" instead of a value. Once I’have imported the data into Matlab, obviously, in the column of the title with the shorter series appears NaN where there are no values.
>> Prices = xlsread('PrezziTitoli.xls');
>> whos
Name Size Bytes Class Attributes
Prices 182x10 6360 double
My goal is to estimate the variance-covariance matrix, however, because of the lack of data, the calculation is not possible for me. I thought to an interpolation, before the calculation of the variance-covariance matrix, to cover the values that in Matlab return NaN, for example with a "fillts", but have difficulties in its use.
There is some code that can be useful to me? Can you help me?
Thanks!
Do you have the statistics toolbox installed? In that case, the solution is simple:
>> x = randn(10,4); // x is a 10x4 matrix of random numbers
>> x(randi(40,10,1)) = NaN; // set some random entries to NaN
>> disp(x)
-1.1480 NaN -2.1384 2.9080
0.1049 -0.8880 NaN 0.8252
0.7223 0.1001 1.3546 1.3790
2.5855 -0.5445 NaN -1.0582
-0.6669 NaN NaN NaN
NaN -0.6003 0.1240 -0.2725
-0.0825 0.4900 1.4367 1.0984
-1.9330 0.7394 -1.9609 -0.2779
-0.4390 1.7119 -0.1977 0.7015
-1.7947 -0.1941 -1.2078 -2.0518
>> nancov(x) // Compute covariances after removing all NaN rows
1.2977 0.0520 1.6248 1.3540
0.0520 0.5359 -0.0967 0.3966
1.6248 -0.0967 2.2940 1.6071
1.3540 0.3966 1.6071 1.9358
>> nancov(x, 'pairwise') // Compute covariances pairwise, ignoring NaNs
1.9195 -0.5221 1.4491 -0.0424
-0.5221 0.7325 -0.1240 0.2917
1.4491 -0.1240 2.1454 0.2279
-0.0424 0.2917 0.2279 2.1305
If you don't have the statistics toolbox, we need to think harder - let me know!

Resources