How to loop through each row of Pandas DataFrame - python-3.x

I'm using an API to find information on smart contracts. Each Contract has a unique address and the information can be pulled by plugging it into an API link. Example:
'https://contract-api.dev/contracts/xxxxxxxxxxxxxxxxxxxxxxxxx'
I have a CSV with 1000 contract addresses. Example:
Address
0
0xc02aaa39b223fe8d0a0e5c4f27ead9083c756cc2
1
0xa5409ec958c83c3f309868babaca7c86dcb077c1
2
0xdac17f958d2ee523a2206206994597c13d831ec7
3
0xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb48
4
0xa2327a938febf5fec13bacfb16ae10ecbc4cbdcf
...
...
And this code allows me to get exactly what I want for one row,
ADDRESS = df['Address'][0]
total = requests.get(f"https://contract-api.dev/contracts/{ADDRESS}", headers={'CF-Access-Client-Id': 'xxxxxx.access', 'CF-Access-Client-Secret': 'xxxxx'}).json()
total['deployment']['created_by_user']
Where the output is:
'0x4f26ffbe5f04ed43630fdc30a87638d53d0b0876'
I just need to find a way to loop through every row and insert the contract Address into the API link, retrieve the "created_by_user" address, then move to the next row.
What do you think the best way to do this is?

def retrieve_created_by_user(address_):
return requests.get(f"https://contract-api.dev/contracts/{address_}", headers={'CF-Access-Client-Id': 'xxxxxx.access', 'CF-Access-Client-Secret': 'xxxxx'}).json()['deployment']['created_by_user']
for address in df['Address']:
created_by_user = retrieve_created_by_user(address)
...
Or since it can probably take a bit you can use a progress bar like tqdm:
from tqdm import tqdm
for address is tqdm(df['Address']):
created_by_user = retrieve_created_by_user(address)
...
Here do with that rows data whatever you plan to do.
If for some reason you want to be able to index the values (e.g. to restart from a manual index later) you can use list(df['Address']).

Related

Import Balance Sheet in an automatic organized manner from SEC to Dataframe

I am looking at getting the Balance Sheet data automatically and properly organized for any company using Beautiful Soup.
I am not planning on getting each variable but rather the whole Balance sheet. Originally, I was trying to do many codes to extract the URL for a particular company of my choice.
For Example, suppose I want to get the Balance Sheet data from the following URL:
URL1:'https://www.sec.gov/Archives/edgar/data/1418121/000118518520000213/aple20191231_10k.htm'
or from
URL2:'https://www.sec.gov/Archives/edgar/data/1326801/000132680120000046/form8-k03312020earnings.htm'
I am trying to write a function (suppose it is known as get_balancesheet(URL) ) such that regardless of the URL you will get the Dataframe that contains the balance sheet in an organized manner.
# Import libraries
import requests
import re
from bs4 import BeautifulSoup
I wrote the following function that needs a lot of improvement
def Get_Data_Balance_Sheet(url):
page = requests.get(url)
# Create a BeautifulSoup object
soup = BeautifulSoup(page.content)
futures1 = soup.find_all(text=re.compile('CONSOLIDATED BALANCE SHEETS'))
Table=[]
for future in futures1:
for row in future.find_next("table").find_all("tr"):
t1=[cell.get_text(strip=True) for cell in row.find_all("td")]
Table.append(t1)
# Remove list from list of lists if list is empty
Table = [x for x in Table if x != []]
return Table
Then I execute the following
url='https://www.sec.gov/Archives/edgar/data/1326801/000132680120000013/fb-12312019x10k.htm'
Tab=Get_Data_Balance_Sheet(url)
Tab
Note that this is not what I am planning for to have It is not simply putting it in a dataframe but we need to change it such that regardless of which URL we can get the Balance Sheet.
Well, this being EDGAR it's not going to be simple, but it's doable.
First things first - with the CIK you can extract specific filings of specific types made the CIK filer during a spacific period. So let say you are interested in Forms 10-K and 10-Q, original or amended (as in "FORM 10-K/A", for example), filed by this CIK filer from 2019 through 2020.
start = 2019
end = 2020
cik = 220000320193
short_cik = str(cik)[-6:] #we will need it later to form urls
First we need to get a list of filings meeting these criteria and load it into beautifulsoup:
import requests
from bs4 import BeautifulSoup as bs
url = f"https://www.sec.gov/cgi-bin/srch-edgar?text=cik%3D%{cik}%22+AND+form-type%3D(10-q*+OR+10-k*)&first={start}&last={end}"
req = requests.get(url)
soup = bs(req.text,'lxml')
There are 8 filings meeting the criteria: two Form 10-K and 6 Form 10-Q. Each of these filings has an accession number. The accession number is hiding in the url of each of these filings and we need to extract it to get to the actual target - the Excel file which contains the financial statements which are attached to each specific filing.
acc_nums = []
for link in soup.select('td>a[href]'):
target = link['href'].split(short_cik,1)
if len(target)>1:
acc_num = target[1].split('/')[1]
if not acc_num in acc_nums: #we need this filter because each filing has two forms: text and html, with the same accession number
acc_nums.append(acc_num)
At this point, acc_nums contains the accession number for each of these 8 filings. We can now download the target Excel file. Obviusly, you can loop through acc_num and download all 8, but let's say you are only looking for (randomly) the Excel file attached to the third filing:
fs_url = f"https://www.sec.gov/Archives/edgar/data/{short_cik}/{acc_nums[2]}/Financial_Report.xlsx"
fs = requests.get(fs_url)
with open('random_edgar.xlsx', 'wb') as output:
output.write(fs.content)
And there you'll have more than you'll ever want to know about Apple's financials at that point in time...

How to process the data returned from a function (Python 3.7)

Background:
My question should be relatively easy, however I am not able to figure it out.
I have written a function regarding queueing theory and it will be used for ambulance service planning. For example, how many calls for service can I expect in a given time frame.
The function takes two parameters; a starting value of the number of ambulances in my system starting at 0 and ending at 100 ambulances. This will show the probability of zero calls for service, one call for service, three calls for service….up to 100 calls for service. Second parameter is an arrival rate number which is the past historical arrival rate in my system.
The function runs and prints out the result to my screen. I have checked the math and it appears to be correct.
This is Python 3.7 with the Anaconda distribution.
My question is this:
I would like to process this data even further but I don’t know how to capture it and do more math. For example, I would like to take this list and accumulate the probability values. With an arrival rate of five, there is a cumulative probability of 61.56% of at least five calls for service, etc.
A second example of how I would like to process this data is to format it as percentages and write out a text file
A third example would be to process the cumulative probabilities and exclude any values higher than the 99% cumulative value (because these vanish into extremely small numbers).
A fourth example would be to create a bar chart showing the probability of n calls for service.
These are some of the things I want to do with the queueing theory calculations. And there are a lot more. I am planning on writing a larger application. But I am stuck at this point. The function writes an output into my Python 3.7 console. How do I “capture” that output as an object or something and perform other processing on the data?
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import math
import csv
def probability_x(start_value = 0, arrival_rate = 0):
probability_arrivals = []
while start_value <= 100:
probability_arrivals = [start_value, math.pow(arrival_rate, start_value) * math.pow(math.e, -arrival_rate) / math.factorial(start_value)]
print(probability_arrivals)
start_value = start_value + 1
return probability_arrivals
#probability_x(arrival_rate = 5, x = 5)
#The code written above prints to the console, but my goal is to take the returned values and make other calculations.
#How do I 'capture' this data for further processing is where I need help (for example, bar plots, cumulative frequency, etc )
#failure. TypeError: writerows() argument must be iterable.
with open('ExpectedProbability.csv', 'w') as writeFile:
writer = csv.writer(writeFile)
for value in probability_x(arrival_rate = 5):
writer.writerows(value)
writeFile.close()
#Failure. Why does it return 2. Yes there are two columns but I was expecting 101 as the length because that is the end of my loop.
print(len(probability_x(arrival_rate = 5)))
The problem is, when you write
probability_arrivals = [start_value, math.pow(arrival_rate, start_value) * math.pow(math.e, -arrival_rate) / math.factorial(start_value)]
You're overwriting the previous contents of probability_arrivals. Everything that it held previously is lost.
Instead of using = to reassign probability_arrivals, you want to append another entry to the list:
probability_arrivals.append([start_value, math.pow(arrival_rate, start_value) * math.pow(math.e, -arrival_rate) / math.factorial(start_value)])
I'll also note, your while loop can be improved. You're basically just looping over start_value until it reaches a certain value. A for loop would be more appropriate here:
for s in range(start_value, 101): # The end value is exclusive, so it's 101 not 100
probability_arrivals = [s, math.pow(arrival_rate, s) * math.pow(math.e, -arrival_rate) / math.factorial(s)]
print(probability_arrivals)
Now you don't need to manually worry about incrementing the counter.

Using List Comprehension to look up single values in a Dataframe

I'm trying to essentially filter values in a dataframe which looks like this.
The user will pass in a list of Symbols and i'd like to return the corresponding MarketCaps in a list for example. I know you can use .isin() to return a series and then convert the values to a list but Im wondering if theres not a better way to do this.
Shouldn't my list comprehension attempt be able to accomplish this?
I've tried this:
tickers = pd.read_csv('NASDAQcompanylist.csv')
stocks_list = ['AAPL','GOOG']
print(tickers.head())
x=stocks_list
y= tickers[tickers['Symbol']==stocks_list]['MarketCap']
I've also tried:
y=[tickers['Symbol'][stock]['MarketCap'] for stock in stocks_list]
Expected output for y would be:
(market caps of AAPL and GOOG)
[85436200000000,7001920000000]
Dataframe head is here.
Symbol ... Industry
0 ABMD ... Medical/Dental Instruments
1 ATVI ... Computer Software: Prepackaged Software
2 ADBE ... Computer Software: Prepackaged Software
3 AMD ... Semiconductors
4 AGNC ... Real Estate Investment Trusts
you can set Symbol as your index (it makes sense, it's a unique identifier), then use it with .loc to get that row.
try:
tickers = pd.read_csv('NASDAQcompanylist.csv').set_index('Symbol')
stocks_list = ['AAPL','GOOG']
y=[tickers.loc[stock,'MarketCap'] for stock in stocks_list]
EDIT: after comment, also a solution using .isin():
y = tickers[tickers.Symbol.isin(stocks_list)]["MarketCap"].tolist()

Pandas - iterating to fill values of a dataframe

I'm trying to build a data-frame of time series data. I have to retrieve the data from an API and every (i,j) entry in the data-frame (where "i" is the row and "j" is the column) has to be iterated through and filled individually.
Here's an idea of the type of thing i'm trying to do (note the API i'm using doesn't have historical data for what i'm trying to analyze):
import pandas as pd
import numpy as np
import time
def retrievedata(string):
take string
do some stuff with api
return float
label_list = ['label1','label1','label1', etc...]
discrete_points = 720
df = pd.DataFrame(index=np.arange(0, discrete_points), columns=(i for i in label_list))
So at this point I've pre-allocated a data frame. What comes next is the issue.
Now, I want to iterate over it and assign values to every (i,j) entry in the data frame based on a function i wrote to pull data. Note that the function I wrote has to be specific to a certain column (as it is taking as input the column label). And on top of that, each row will have different values b/c it is time-series data.
EDIT: Yuck, I found a gross way to make it work:
for row in range(discrete_points):
for label in label_list:
df.at[row, label] = retrievedata(label)
This is obviously a non-pythonic, non-numpy, non-pandas way of doing things. So i'd like to find a nicer and more efficient/less computing power intensive way of doing this.
I'm assuming it's gonna have to be some combination of: iter.rows(); iter.tuples(); df.loc(); df.at()
I'm stumped though.
Any ideas?

How to reshape/recast data in Excel from wide- to long-format

I want to reshape my data in Excel, which is currently in "wide" format into "long" format. You can see each variable (Column Name) corresponds to a tenure, race and cost burden. I want to more easily put these data into a pivot table, but I'm not sure how to do this. Any ideas out there?
FYI, the data are HUD CHAS (Department of Housing and Urban Development, Comprehensive Housing Affordability Strategy), which has over 20 tables that would need to be reshaped.
There is a simple R script that will help with this. The function accepts the path to your csv file and the number of header variables you have. In the example image/data I provided, there are 7 header variables. That is, the actual data (T9_est1) starts on the 8th column.
# Use the command below if you do not have the tidyverse package installed.
# install.packages("tidyverse")
library(tidyverse)
read_data_long <- function(path_to_csv, header_vars) {
data_table <- read_csv(path_to_csv)
fields_to_melt <- names(data_table[,as.numeric(header_vars+1):ncol(data_table)])
melted <- gather(data_table, fields_to_melt, key = 'variable', value = 'values')
return(melted)
}
# Change the file path to where your data is and where you want it written to.
# Also change "7" to the number of header variables your data has.
melted_data <- read_data_long("path_to_input_file.csv", 7)
write_csv(melted_data, "new_path_to_melted_file.csv")
(Updated 7/25/18 with a more elegant solution; Again 9/28/18 with small change.)

Resources