Aggregating the total count with Json Data - python-3.x

I have an API endpoint which has the details of confirmed / recovered / tested count for each state
https://data.covid19india.org/v4/min/data.min.json
I would like to to aggregate the total count of confirmed / recovered / tested across each state.. What is the easiest way to achieve the results?

To write the final results in pandas we can procede by adding this to the code.
import Pandas as Pd
columns = ('Confirmed', 'Deceased', 'Recovered', 'Tested')
Panda = pd.DataFrame(data = StateWiseData).T # T for transpose
print(Panda)
The output will be:
confirmed deceased recovered tested
AN 7557 129 7420 0
AP 2003342 13735 1975448 9788047
AR 52214 259 50695 398545
AS 584434 5576 570847 326318
BR 725588 9649 715798 17107895
CH 65066 812 64213 652657
CT 1004144 13553 989728 338344
DL 1437334 25079 1411881 25142853
DN 10662 4 10620 72410
GA 173221 3186 169160 0
GJ 825302 10079 815041 10900176
HP 211746 3553 206094 481328
HR 770347 9667 760004 3948145
JH 347730 5132 342421 233773
JK 324295 4403 318838 139552
KA 2939767 37155 2882331 9791334
KL 3814305 19494 3631066 3875002
LA 20491 207 20223 110068
LD 10309 51 10194 234256
MH 6424651 135962 6231999 8421643
ML 74070 1281 69859 0
MN 111212 1755 105751 13542
MP 792101 10516 781499 3384824
MZ 52472 200 46675 0
NL 29589 610 27151 116359
OR 1001698 7479 986334 2774807
PB 600266 16352 583426 2938477
PY 122934 1808 120330 567923
RJ 954023 8954 944917 5852578
SK 29340 367 27185 0
TG 654989 3858 644747 0
TN 2600885 34709 2547005 4413963
TR 82092 784 80150 607962
TT 0 0 0 0
UP 1709119 22792 1685954 23724581
UT 342749 7377 329006 2127358
WB 1543496 18371 1515789 0

Yes, my interpretation was incorrect earlier. We have to get the districts total and add them.
import json
file = open('data.min.json')
dictionary = json.load(file)
stateCodes = ['AN', 'AP', 'AR', 'AS', 'BR', 'CH', 'CT', 'DL', 'DN', 'GA', 'GJ', 'HP', 'HR', 'JH', 'JK', 'KA', 'KL', 'LA', 'LD', 'MH', 'ML', 'MN', 'MP', 'MZ', 'NL', 'OR', 'PB', 'PY', 'RJ', 'SK', 'TG', 'TN', 'TR', 'TT', 'UP', 'UT', 'WB']
StateWiseData = {}
for state in stateCodes:
StateInfo = dictionary[state]
Confirmed = 0
Recovered = 0
Tested = 0
Deceased = 0
StateData = {}
if "districts" in StateInfo:
for District in StateInfo['districts']:
DistrictInfo = StateInfo['districts'][District]['total']
if 'confirmed' in DistrictInfo:
if type(Confirmed) == type(DistrictInfo['confirmed']):
Confirmed += (DistrictInfo['confirmed'])
if 'recovered' in DistrictInfo:
if type(Recovered) == type(DistrictInfo['recovered']):
Recovered += (DistrictInfo['recovered'])
if 'tested' in DistrictInfo:
if type(Tested) == type(DistrictInfo['tested']):
Tested += (DistrictInfo['tested'])
if 'deceased' in DistrictInfo:
if type(Deceased) == type(DistrictInfo['deceased']):
Deceased += (DistrictInfo['deceased'])
StateData['confirmed'] = Confirmed
StateData['deceased'] = Deceased
StateData['recovered'] = Recovered
StateData['tested'] = Tested
StateWiseData[state] = StateData
print(StateWiseData)

Related

Pandas appends duplicate rows even though if statement shouldn't be triggered

I have many csv files that only have one row of data. I need to take data from two of the cells and put them into a master csv file ('new_gal.csv'). Initially this will only contain the headings, but no data.
#The file I am pulling from:
file_name = "N4261_pacs160.csv"
#I have the code written to separate gal_name, cat_name, and cat_num (N4261, pacs, 160)
An example of the csv is given here. I am trying to pull "flux" and "rms" from this file. (Sorry it isn't aligned nicely; I can't figure out the formatting).
name band ra dec raerr decerr flux snr snrnoise stn rms strn fratio fwhmxfit fwhmyfit flag_elong edgeflag flag_blend warmat
obsid ssomapflag dist angle
HPPSC160A_J121923.1+054931 red 184.846389 5.8254 0.000151 0.00015
227.036 10.797 21.028 16.507 13.754 37.448 1.074 15.2 11 0.7237
f 0 f 1342199758 f 1.445729 296.577621
I read this csv and pull the data I need
with open(file_name, 'r') as table:
reader = csv.reader(table, delimiter=',')
read = iter(reader)
next(read)
for row in read:
fluxP = row[6]
errP = row[10]
#Open the master csv with pandas
df = pd.read_csv('new_gal.csv')
The master csv file has format:
Galaxy Cluster Mult. Detect. LumDist z W1 W1 err W2 W2 err W3 W3 err W4 W4 err 70 70 err 100 100 err 160 160 err 250 250 err 350 350 err 500 500 err
The main problem I have, is that I want to search the "Galaxy" column in the 'new_gal.csv' for the galaxy name. If it is not there, I need to add a new row with the galaxy name and the flux and error measurement. When I run this multiple times, I get duplicate rows even though I have the append command nested in the if statement. I only want it to append a new row if the galaxy name is not already there; otherwise, it should only change the values of the flux and error measurements for that galaxy.
if cat_name == 'pacs':
if gal_name not in df["Galaxy"]:
df = df.append({"Galaxy": gal_name}, ignore_index=True)
if cat_num == "70":
df.loc[df.Galaxy == gal_name, ["70"]] = fluxP
df.loc[df.Galaxy == gal_name, ["70 err"]] = errP
elif cat_num == "100":
df.loc[df.Galaxy == gal_name, ["100"]] = fluxP
df.loc[df.Galaxy == gal_name, ["100 err"]] = errP
elif cat_num == "160":
df.loc[df.Galaxy == gal_name, ["160"]] = fluxP
df.loc[df.Galaxy == gal_name, ["160 err"]] = errP
else:
if cat_num == "70":
df.loc[df.Galaxy == gal_name, ["70"]] = fluxP
df.loc[df.Galaxy == gal_name, ["70 err"]] = errP
elif cat_num == "100":
df.loc[df.Galaxy == gal_name, ["100"]] = fluxP
df.loc[df.Galaxy == gal_name, ["100 err"]] = errP
elif cat_num == "160":
df.loc[df.Galaxy == gal_name, ["160"]] = fluxP
df.loc[df.Galaxy == gal_name, ["160 err"]] = errP
After running the code 5 times with the same file, I have 5 identical lines in the table.
I think I've got something that'll work after tinkering with it this morning...
Couple points... You shouldn't incrementally build in pandas...get the data setup done externally then do 1 build. In what I have below, I'm building a big dictionary from the small csv files and then using merge to put that together with the master file.
If your .csv files aren't formatted properly, you can either try to replace the split character below or switch over to csv reader that is a bit more powerful.
You should put all of the smaller .csv files in a folder called 'orig_data' to make this work.
main prog
# galaxy compiler
import os, re
import pandas as pd
# folder location for the small .csvs, NOT the master
data_folder = 'orig_data' # this folder should be in same directory as program
result = {}
splitter = r'(.+)_([a-zA-Z]+)([0-9]+)\.' # regex to break up file name into 3 groups
for file in os.listdir(data_folder):
file_data = {}
# split up the filename and process
galaxy, cat_name, cat_num = re.match(splitter, file).groups()
#print(galaxy, cat_name, cat_num)
with open(os.path.join(data_folder, file), 'r') as src:
src.readline() # read the header and disregard it
data = src.readline().replace(' ','').strip().split(',') # you can change the split char
flux = float(data[2])
rms = float(data[3])
err_tag = cat_num + ' err'
file_data = { 'cat_name': cat_name,
cat_num: flux,
err_tag: rms}
result[galaxy] = file_data
df2 = pd.DataFrame.from_dict(result, orient='index')
df2.index.rename('galaxy', inplace=True)
# check the resulting build!
#print(df2)
# build master dataframe
master_df = pd.read_csv('master_data.csv')
#print(master_df.head())
# merge the 2 dataframes on galaxy name. See the dox on merge for other
# options and whether you want an "outer" join or other type of join...
master_df = master_df.merge(df2, how='outer', on='galaxy')
# convert boolean flags properly
conv = {'t': True, 'f': False}
master_df['flag_nova'] = master_df['flag_nova'].map(conv).astype('bool')
print(master_df)
print()
print(master_df.info())
print()
print(master_df.describe())
example data files in orig_data folder
filename: A99_dbc100.csv
band,weight,flux,rms
junk, 200.44,2e5,2e-8
filename: B250_pacs100.csv
band,weight,flux,rms
nada,2.44,19e-5, 74
...etc.
example master csv
galaxy,color,stars,flag_nova
A99,red,15,f
B250,blue,4e20,t
N1000,green,3e19,f
X99,white,12,t
Result:
galaxy color stars ... 200 err 100 100 err
0 A99 red 1.500000e+01 ... NaN 200000.00000 2.000000e-08
1 B250 blue 4.000000e+20 ... NaN 0.00019 7.400000e+01
2 N1000 green 3.000000e+19 ... 88.0 NaN NaN
3 X99 white 1.200000e+01 ... NaN NaN NaN
[4 rows x 9 columns]
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 9 columns):
galaxy 4 non-null object
color 4 non-null object
stars 4 non-null float64
flag_nova 4 non-null bool
cat_name 3 non-null object
200 1 non-null float64
200 err 1 non-null float64
100 2 non-null float64
100 err 2 non-null float64
dtypes: bool(1), float64(5), object(3)
memory usage: 292.0+ bytes
None
stars 200 200 err 100 100 err
count 4.000000e+00 1.0 1.0 2.000000 2.000000e+00
mean 1.075000e+20 1900000.0 88.0 100000.000095 3.700000e+01
std 1.955121e+20 NaN NaN 141421.356103 5.232590e+01
min 1.200000e+01 1900000.0 88.0 0.000190 2.000000e-08
25% 1.425000e+01 1900000.0 88.0 50000.000143 1.850000e+01
50% 1.500000e+19 1900000.0 88.0 100000.000095 3.700000e+01
75% 1.225000e+20 1900000.0 88.0 150000.000048 5.550000e+01
max 4.000000e+20 1900000.0 88.0 200000.000000 7.400000e+01

How do I remove square brackets from my dataframe?

Im trying to create a comparison between my predicted and actual values.
Here is my try:
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(df[['Op1', 'Op2', 'S2', 'S3', 'S4', 'S7', 'S8', 'S9', 'S11', 'S12','S13', 'S14', 'S15', 'S17', 'S20', 'S21']], df.unit)
predicted = []
actual = []
for i in range(1,len(df.unit.unique())):
xp = df[(df.unit == i) & (df.cycles == len(df[df.unit == i].cycles))]
xa = xp.cycles.values
xp = xp.values[0,2:].reshape(1,-2)
predicted.append(reg.predict(xp))
actual.append(xa)
and to display the dataframe:
data = {'Actual cycles': actual, 'Predicted cycles': predicted }
df_2 = pd.DataFrame(data)
df_2.head()
I will get an output:
Actual cycles Predicted cycles
0 [192] [56.7530579842869]
1 [287] [50.76877712361329]
2 [179] [42.72575900074571]
3 [189] [42.876506912637524]
4 [269] [47.40087182743173]
ignoring the values that are way off, how do I remove the square brackets in the dataframe? and is there a neater way to write my code? Thank you!
print(df_2)
Actualcycles Predictedcycles
0 [192] [56.7530579842869]
1 [287] [50.76877712361329]
2 [179] [42.72575900074571]
3 [189] [42.876506912637524]
4 [269] [47.40087182743173]
df=df_2.apply(lambda x:x.str.strip('[]'))
print(df)
Actualcycles Predictedcycles
0 192 56.7530579842869
1 287 50.76877712361329
2 179 42.72575900074571
3 189 42.876506912637524
4 269 47.40087182743173
Below is a minimal example of your cycles column with brackets:
import pandas as pd
df = pd.DataFrame({
'cycles' : [[192], [287], [179], [189], [269]]
})
This code gets you the column without brackets:
df['cycles'] = df['cycles'].str[0]
The output looks like this:
print(df)
cycles
0 192
1 287
2 179
3 189
4 269

Add columns to pandas data frame with for-loop

The code block below produces the this table:
Trial Week Branch Num_Dep Tot_dep_amt
1 1 1 4 4200
1 1 2 7 9000
1 1 3 6 4800
1 1 4 6 5800
1 1 5 5 3800
1 1 6 4 3200
1 1 7 3 1600
. . . . .
. . . . .
1 1 8 5 6000
9 19 40 3 2800
Code:
trials=10
dep_amount=[]
branch=41
total=[]
week=1
week_num=[]
branch_num=[]
dep_num=[]
trial_num=[]
weeks=20
df=pd.DataFrame()
for a in range(1,trials):
print("Starting trial", a)
for b in range(1,weeks):
for c in range(1,branch):
depnum = int(np.round(np.random.normal(5,2,1)/1)*1)
acc_dep=0
for d in range(1,depnum):
dep_amt=int(np.round(np.random.normal(1200,400,1)/200)*200)
acc_dep=acc_dep+dep_amt
temp = pd.DataFrame.from_records([{'Trial': a, 'Week': b, 'branch': c,'Num_Dep': depnum, 'Tot_dep_amt':acc_dep }])
df = pd.concat([df, temp])
df = df[['Trial', 'Week', 'branch', 'Num_Dep','Tot_dep_amt']]
df=df.reset_index()
df=df.drop('index',axis=1)
I would like to be able to break branches apart in the for-loop and instead have the resultant df represented with headers:
Trial Week Branch_1_Num_Dep Branch_1_Tot_dep_amount Branch_2_Num_ Dep .....etc
I know this could be done by generating the DF and performing an encoding, but for this task I would like it to be generated in the for loop if possible?
In order to achieve this with minimal changes to your code, you can do something like the following:
df = pd.DataFrame()
for a in range(1, trials):
print("Starting trial", a)
for b in range(1, weeks):
records = {'Trial': a, 'Week': b}
for c in range(1, branch):
depnum = int(np.round(np.random.normal(5, 2, 1) / 1) * 1)
acc_dep = 0
for d in range(1, depnum):
dep_amt = int(np.round(np.random.normal(1200, 400, 1) / 200) * 200)
acc_dep = acc_dep + dep_amt
records['Branch_{}_Num_Dep'.format(c)] = depnum
records['Branch_{}_Tot_dep_amount'.format(c)] = acc_dep
temp = pd.DataFrame.from_records([records])
df = pd.concat([df, temp])
df = df.reset_index()
df = df.drop('index', axis=1)
Overall it seems that what you are doing can be done in more elegant and faster ways. I would recommend taking a look to vectorization as a concept (e.g. here).

Cannot plot dataframe as barh because TypeError: Empty 'DataFrame': no numeric data to plot

I have been all over this site and google trying to solve this problem.
It appears as though I'm missing a fundamental concept in making a plottable dataframe.
I've tried to ensure that I have a column of strings for the "Teams" and a column of ints for the "Points"
Still I get: TypeError: Empty 'DataFrame': no numeric data to plot
import csv
import pandas
import numpy
import matplotlib.pyplot as plt
from matplotlib.ticker import StrMethodFormatter
set_of_teams = set()
def load_epl_games(file_name):
with open(file_name, newline='') as csvfile:
reader = csv.DictReader(csvfile)
raw_data = {"HomeTeam": [], "AwayTeam": [], "FTHG": [], "FTAG": [], "FTR": []}
for row in reader:
set_of_teams.add(row["HomeTeam"])
set_of_teams.add(row["AwayTeam"])
raw_data["HomeTeam"].append(row["HomeTeam"])
raw_data["AwayTeam"].append(row["AwayTeam"])
raw_data["FTHG"].append(row["FTHG"])
raw_data["FTAG"].append(row["FTAG"])
raw_data["FTR"].append(row["FTR"])
data_frame = pandas.DataFrame(data=raw_data)
return data_frame
def calc_points(team, table):
points = 0
for row_number in range(table["HomeTeam"].count()):
home_team = table.loc[row_number, "HomeTeam"]
away_team = table.loc[row_number, "AwayTeam"]
if team in [home_team, away_team]:
home_team_points = 0
away_team_points = 0
winner = table.loc[row_number, "FTR"]
if winner == 'H':
home_team_points = 3
elif winner == 'A':
away_team_points = 3
else:
home_team_points = 1
away_team_points = 1
if team == home_team:
points += home_team_points
else:
points += away_team_points
return points
def get_goals_scored_conceded(team, table):
scored = 0
conceded = 0
for row_number in range(table["HomeTeam"].count()):
home_team = table.loc[row_number, "HomeTeam"]
away_team = table.loc[row_number, "AwayTeam"]
if team in [home_team, away_team]:
if team == home_team:
scored += int(table.loc[row_number, "FTHG"])
conceded += int(table.loc[row_number, "FTAG"])
else:
scored += int(table.loc[row_number, "FTAG"])
conceded += int(table.loc[row_number, "FTHG"])
return (scored, conceded)
def compute_table(df):
raw_data = {"Team": [], "Points": [], "GoalDifference":[], "Goals": []}
for team in set_of_teams:
goal_data = get_goals_scored_conceded(team, df)
raw_data["Team"].append(team)
raw_data["Points"].append(calc_points(team, df))
raw_data["GoalDifference"].append(goal_data[0] - goal_data[1])
raw_data["Goals"].append(goal_data[0])
data_frame = pandas.DataFrame(data=raw_data)
data_frame = data_frame.sort_values(["Points", "GoalDifference", "Goals"], ascending=[False, False, False]).reset_index(drop=True)
data_frame.index = numpy.arange(1,len(data_frame)+1)
data_frame.index.names = ["Finish"]
return data_frame
def get_finish(team, table):
return table[table.Team==team].index.item()
def get_points(team, table):
return table[table.Team==team].Points.item()
def display_hbar(tables):
raw_data = {"Team": [], "Points": []}
for row_number in range(tables["Team"].count()):
raw_data["Team"].append(tables.loc[row_number+1, "Team"])
raw_data["Points"].append(int(tables.loc[row_number+1, "Points"]))
df = pandas.DataFrame(data=raw_data)
#df = pandas.DataFrame(tables, columns=["Team", "Points"])
print(df)
print(df.dtypes)
df["Points"].apply(int)
print(df.dtypes)
df.plot(kind='barh',x='Points',y='Team')
games = load_epl_games('epl2016.csv')
final_table = compute_table(games)
#print(final_table)
#print(get_finish("Tottenham", final_table))
#print(get_points("West Ham", final_table))
display_hbar(final_table)
The output:
Team Points
0 Chelsea 93
1 Tottenham 86
2 Man City 78
3 Liverpool 76
4 Arsenal 75
5 Man United 69
6 Everton 61
7 Southampton 46
8 Bournemouth 46
9 West Brom 45
10 West Ham 45
11 Leicester 44
12 Stoke 44
13 Crystal Palace 41
14 Swansea 41
15 Burnley 40
16 Watford 40
17 Hull 34
18 Middlesbrough 28
19 Sunderland 24
Team object
Points int64
dtype: object
Team object
Points int64
dtype: object
Traceback (most recent call last):
File "C:/Users/Michael/Documents/Programming/Python/Premier League.py", line 99, in <module>
display_hbar(final_table)
File "C:/Users/Michael/Documents/Programming/Python/Premier League.py", line 92, in display_hbar
df.plot(kind='barh',x='Points',y='Team')
File "C:\Program Files (x86)\Python36-32\lib\site- packages\pandas\plotting\_core.py", line 2941, in __call__
sort_columns=sort_columns, **kwds)
File "C:\Program Files (x86)\Python36-32\lib\site-packages\pandas\plotting\_core.py", line 1977, in plot_frame
**kwds)
File "C:\Program Files (x86)\Python36-32\lib\site-packages\pandas\plotting\_core.py", line 1804, in _plot
plot_obj.generate()
File "C:\Program Files (x86)\Python36-32\lib\site-packages\pandas\plotting\_core.py", line 258, in generate
self._compute_plot_data()
File "C:\Program Files (x86)\Python36-32\lib\site-packages\pandas\plotting\_core.py", line 373, in _compute_plot_data
'plot'.format(numeric_data.__class__.__name__))
TypeError: Empty 'DataFrame': no numeric data to plot
What am I doing wrong in my display_hbar function that is preventing me from plotting my data?
Here is the csv file
df.plot(x = "Team", y="Points", kind="barh");
You should swap x and y in df.plot(...). Because y must be numeric according to the pandas documentation.

Extracting data from web page to CSV file, only last row saved

I'm faced with the following challenge: I want to get all financial data about companies and I wrote a code that does it and let's say that the result is like below:
Unnamed: 0 I Q 2017 II Q 2017 \
0 Przychody netto ze sprzedaży (tys. zł) 137 134
1 Zysk (strata) z działal. oper. (tys. zł) -423 -358
2 Zysk (strata) brutto (tys. zł) -501 -280
3 Zysk (strata) netto (tys. zł)* -399 -263
4 Amortyzacja (tys. zł) 134 110
5 EBITDA (tys. zł) -289 -248
6 Aktywa (tys. zł) 27 845 26 530
7 Kapitał własny (tys. zł)* 22 852 22 589
8 Liczba akcji (tys. szt.) 13 921,975 13 921,975
9 Zysk na akcję (zł) -0029 -0019
10 Wartość księgowa na akcję (zł) 1641 1623
11 Raport zbadany przez audytora N N
but 464 times more.
Unfortunately when I want to save all 464 results in one CSV file I can save only one last result. Not all 464 results, just one... Could you help me save all? Below is my code.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.bankier.pl/gielda/notowania/akcje'
page = requests.get(url)
soup = BeautifulSoup(page.content,'lxml')
# Find the second table on the page
t = soup.find_all('table')[0]
#Read the table into a Pandas DataFrame
df = pd.read_html(str(t))[0]
#get
names_of_company = df["Walor AD"].values
links_to_financial_date = []
#all linkt with the names of companies
links = []
for i in range(len(names_of_company)):
new_string = 'https://www.bankier.pl/gielda/notowania/akcje/' + names_of_company[i] + '/wyniki-finansowe'
links.append(new_string)
############################################################################
for i in links:
url2 = f'https://www.bankier.pl/gielda/notowania/akcje/{names_of_company[0]}/wyniki-finansowe'
page2 = requests.get(url2)
soup = BeautifulSoup(page2.content,'lxml')
# Find the second table on the page
t2 = soup.find_all('table')[0]
df2 = pd.read_html(str(t2))[0]
df2.to_csv('output.csv', index=False, header=None)
You've almost got it. You're just overwriting your CSV each time. Replace
df2.to_csv('output.csv', index=False, header=None)
with
with open('output.csv', 'a') as f:
df2.to_csv(f, header=False)
in order to append to the CSV instead of overwriting it.
Also, your example doesn't work because this:
for i in links:
url2 = f'https://www.bankier.pl/gielda/notowania/akcje/{names_of_company[0]}/wyniki-finansowe'
should be:
for i in links:
url2 = i
When the website has no data, skip and move on to the next one:
try:
t2 = soup.find_all('table')[0]
df2 = pd.read_html(str(t2))[0]
with open('output.csv', 'a') as f:
df2.to_csv(f, header=False)
except:
pass

Resources