Having trouble appending dataframes - python-3.x

I would like to access and edit individual dataframes after creating them by a for loop.
#Let's get those files!!!
bdc_files = {'nwhl': 'https://raw.githubusercontent.com/bigdatacup/Big-Data-Cup-2021/main/hackathon_nwhl.csv',
'olympics': 'https://raw.githubusercontent.com/bigdatacup/Big-Data-Cup-2021/main/hackathon_womens.csv',
'erie': 'https://raw.githubusercontent.com/bigdatacup/Big-Data-Cup-2021/main/hackathon_scouting.csv'}
df_list = []
for (a,b) in bdc_files.items():
#Grab csv file
c = pd.read_csv(b)
c.name = a
#a = a.append(c)
#Manipuling the Data as we please
c['Game_ID'] = c['game_date'] + c['Home Team'] + c['Away Team']
c['Detail 3'] = c['Detail 3'].replace('t', 'with traffic')
c['Detail 3'] = c['Detail 3'].replace('f', 'without traffic')
c['Detail 4'] = c['Detail 4'].replace('t', 'one-timer')
c['Detail 4'] = c['Detail 4'].replace('f', 'not one-timer')
c['Details'] = c['Detail 1'].astype(str).add(' ').add(c['Detail 2'].astype(str)).add(' ').add(c['Detail 3'].astype(str)).add(' ').add(c['Detail 4'].astype(str))
c['is_goal'] = 0
c['is_shot'] = 0
c.loc[c['Event'] == 'Shot', 'is_shot'] = 1
c.loc[c['Event'] == 'Goal', 'is_goal'] = 1
c['Goal Differential'] = c['Home Team Goals'] - c['Away Team Goals']
c['Clock'] = pd.to_datetime(c['Clock'], format = '%M:%S')
c['Seconds Remaining'] = ((c['Clock'].dt.minute)*60) + (c['Clock'].dt.second)
df_list.append(a)
#Printing Datasheet info
Title = "The sample of games from the {}".format(c.name)
print(c.name)
print(Title + " dataset is:", len(list(c['Game_ID'].value_counts())))
print(c['Event'].value_counts())
print(c.columns.values)
print(c.loc[c['Event'] == 'Shot', 'Details'].value_counts())
print(c.head())
print(c.info())
print(df_list)
print(nwhl)
However, if I want to print the nwhl database, I get the following output...
Empty DataFrame
Columns: []
Index: []
And if I were to use an append, I would get this error
AttributeError: 'str' object has no attribute 'append'
Long story short, based off of the code I have, how can I be able to print and perform other tasks with the dataframes outside of the for loop? Any assistance is truly appreciated.

Use a dictionary of dataframes, df_dict:
Add
df_dict = {}
...
for (a,b) in bdc_files.items():
#Grab csv file
c = pd.read_csv(b)
c.name = a
# Add this line to build dictionary
df_dict[a] = c
And, at the end print
df_dict['nwhl']

Related

Plotting multiple lines with a Nested Dictionary, and unknown variables to Line Graph

I was able to find somewhat of an answer to my question, but it was not as nested as my dictionary and so I am really unsure how to proceed as I am still very new to python. I currently have a nested dictionary like
{'140.10': {'46': {'1': '-49.50918', '2': '-50.223637', '3': '49.824406'}, '28': {'1': '-49.50918', '2': '-50.223637', '3': '49.824406'}}}:
I am wanting to plot it so that '140.10' becomes the title of the graph and '46' and '28' become the individual lines and key '1' for example is on the y axis and the x axis is the final number (in this case '-49.50918). Essentially a graph like this:
I generated this graph with a csv file that is written at another part of the code just with excel:
[![enter image description here][2]][2]
The problem I am running into is that these keys are autogenerated from a larger csv file and I will not know their exact value until the code has been run. As each of the keys are autogenerated in an earlier part of the script. As I will be running it over various files called the Graph name, and each file will have a different values for:
{key1:{key2_1: {key3_1: value1, key3_2: value2, key3_3: value3}, key_2_2 ...}}}
I have tried to do something like this:
for filename in os.listdir(Directory):
if filename.endswith('.csv'):
q = filename.split('.csv')[0]
s = q.split('_')[0]
if s in time_an_dict:
atom = list(time_an_dict[s])
ion = time_an_dict[s]
for f in time_an_dict[s]:
x_val = []
y_val = []
fz = ion[f]
for i in time_an_dict[s][f]:
pos = (fz[i])
frame = i
y_val.append(frame)
x_val.append(pos)
'''ions = atom
frame = frames
position = pos
plt.plot(frame, position, label = frames)
plt.xlabel("Frame")
plt.ylabel("Position")
plt.show()
#plt.savefig('{}_Pos.png'.format(s))'''
But it has not run as intended.
I have also tried:
for filename in os.listdir(Directory):
if filename.endswith('_Atom.csv'):
q = filename.split('.csv')[0]
s = q.split('_')[0]
if s in window_dict:
name = s + '_Atom.csv'
time_an_dict[s] = analyze_time(name,window_dict[s])
new = '{}_A_pos.csv'.format(s)
ions = list(time_an_dict.values())[0].keys()
for i in ions:
x_axis_values = []
y_axis_values = []
frame = list(time_an_dict[s][i])
x_axis_values.append(frame)
empty = []
print(x_axis_values)
for x in frame:
values = time_an_dict[s][i][x]
empty.append(values)
y_axis_values.append(empty)
plt.plot(x_axis_values, y_axis_values, label = x )
plt.show()
But keep getting the error:
Traceback (most recent call last): File "Atoms_pos.py", line 175, in
plt.plot(x_axis_values, y_axis_values, label = x ) File "/Users/hxb51/opt/anaconda3/lib/python3.8/site-packages/matplotlib/pyplot.py",
line 2840, in plot
return gca().plot( File "/Users/hxb51/opt/anaconda3/lib/python3.8/site-packages/matplotlib/axes/_axes.py",
line 1743, in plot
lines = [*self._get_lines(*args, data=data, **kwargs)] File "/Users/hxb51/opt/anaconda3/lib/python3.8/site-packages/matplotlib/axes/_base.py",
line 273, in call
yield from self._plot_args(this, kwargs) File "/Users/hxb51/opt/anaconda3/lib/python3.8/site-packages/matplotlib/axes/_base.py",
line 394, in _plot_args
self.axes.xaxis.update_units(x) File "/Users/hxb51/opt/anaconda3/lib/python3.8/site-packages/matplotlib/axis.py",
line 1466, in update_units
default = self.converter.default_units(data, self) File "/Users/hxb51/opt/anaconda3/lib/python3.8/site-packages/matplotlib/category.py",
line 107, in default_units
axis.set_units(UnitData(data)) File "/Users/hxb51/opt/anaconda3/lib/python3.8/site-packages/matplotlib/category.py",
line 176, in init
self.update(data) File "/Users/hxb51/opt/anaconda3/lib/python3.8/site-packages/matplotlib/category.py",
line 209, in update
for val in OrderedDict.fromkeys(data): TypeError: unhashable type: 'numpy.ndarray'
Here is the remainder of the other parts of the code that generate the files and dictionaries I am using. I was told in another question I asked that this could be helpful.
# importing dependencies
import math
import sys
import pandas as pd
import MDAnalysis as mda
import os
import numpy as np
import csv
import matplotlib.pyplot as plt
################################################################################
###############################################################################
Directory = '/Users/hxb51/Desktop/Q_prof/Displacement_Charge/Blah'
os.chdir(Directory)
################################################################################
''' We are only looking at the positions of the CLAs and SODs and not the DRUDE counterparts. We are assuming the DRUDE
are very close and it is not something that needs to be concerned with'''
def Positions(dcd, topo):
fields = ['Window', 'ION', 'ResID', 'Location', 'Position', 'Frame', 'Final']
with open('{}_Atoms.csv'.format(s), 'a') as d:
writer = csv.writer(d)
writer.writerow(fields)
d.close()
CLAs = u.select_atoms('segid IONS and name CLA')
SODs = u.select_atoms('segid IONS and name SOD')
CLA_res = len(CLAs)
SOD_res = len(SODs)
frame = 0
for ts in u.trajectory[-10:]:
frame +=1
CLA_pos = CLAs.positions[:,2]
SOD_pos = SODs.positions[:,2]
for i in range(CLA_res):
ids = i + 46
if CLA_pos[i] < 0:
with open('{}_Atoms.csv'.format(s), 'a') as q:
new_line = [s,'CLA', ids, 'Bottom', CLA_pos[i], frame,10]
writes = csv.writer(q)
writes.writerow(new_line)
q.close()
else:
with open('{}_Atoms.csv'.format(s), 'a') as q:
new_line = [s,'CLA', ids, 'Top', CLA_pos[i], frame, 10]
writes = csv.writer(q)
writes.writerow(new_line)
q.close()
for i in range(SOD_res):
ids = i
if SOD_pos[i] < 0:
with open('{}_Atoms.csv'.format(s), 'a') as q:
new_line = [s,'SOD', ids, 'Bottom', SOD_pos[i], frame,10]
writes = csv.writer(q)
writes.writerow(new_line)
q.close()
else:
with open('{}_Atoms.csv'.format(s), 'a') as q:
new_line = [s,'SOD', ids, 'Top', SOD_pos[i], frame, 10]
writes = csv.writer(q)
writes.writerow(new_line)
q.close()
csv_Data = pd.read_csv('{}_Atoms.csv'.format(s))
filename = s + '_Atom.csv'
sorted_df = csv_Data.sort_values(["ION", "ResID", "Frame"],
ascending=[True, True, True])
sorted_df.to_csv(filename, index = False)
os.remove('{}_Atoms.csv'.format(s))
''' this function underneath looks at the ResIds, compares them to make sure they are the same and then counts how many
times the ion flip flops around the boundaries'''
def turn_dict(f):
read = open(f)
reader = csv.reader(read, delimiter=",", quotechar = '"')
my_dict = {}
new_list = []
for row in reader:
new_list.append(row)
for i in range(len(new_list[:])):
prev = i - 1
if new_list[i][2] == new_list[prev][2]:
if new_list[i][3] != new_list[prev][3]:
if new_list[i][2] in my_dict:
my_dict[new_list[i][2]] += 1
else:
my_dict[new_list[i][2]] = 1
return my_dict
def plot_flips(f):
dict = turn_dict(f)
ions = list(dict.keys())
occ = list(dict.values())
plt.bar(range(len(dict)), occ, tick_label = ions)
plt.title("{}".format(s))
plt.xlabel("Residue ID")
plt.ylabel("Boundary Crosses")
plt.savefig('{}_Flip.png'.format(s))
def analyze_time(f, dicts):
read = open(f)
reader = csv.reader(read, delimiter=",", quotechar='"')
new_list = []
keys = list(dicts.keys())
time_dict = {}
pos_matrix = {}
for row in reader:
new_list.append(row)
fields = ['ResID', 'Position', 'Frame']
with open('{}_A_pos.csv'.format(s), 'a') as k:
writer = csv.writer(k)
writer.writerow(fields)
k.close()
for i in range(len(new_list[:])):
if new_list[i][2] in keys:
with open('{}_A_pos.csv'.format(s), 'a') as k:
new_line = [new_list[i][2], new_list[i][4], new_list[i][5]]
writes = csv.writer(k)
writes.writerow(new_line)
k.close()
read = open('{}_A_pos.csv'.format(s))
reader = csv.reader(read, delimiter=",", quotechar='"')
time_list = []
for row in reader:
time_list.append(row)
for j in range(len(keys)):
for i in range(len(time_list[1:])):
if time_list[i][0] == keys[j]:
pos_matrix[time_list[i][2]] = time_list[i][1]
time_dict[keys[j]] = pos_matrix
return time_dict
window_dict = {}
for filename in os.listdir(Directory):
s = filename.split('.dcd')[0]
fors = s + '.txt'
topos = '/Users/hxb51/Desktop/Q_prof/Displacement_Charge/topo.psf'
if filename.endswith('.dcd'):
print('We are starting with {} \n '.format(s))
u = mda.Universe(topos, filename)
Positions(filename, topos)
name = s + '_Atom.csv'
plot_flips(name)
window_dict[s] = turn_dict(name)
continue
time_an_dict = {}
for filename in os.listdir(Directory):
if filename.endswith('.csv'):
q = filename.split('.csv')[0]
s = q.split('_')[0]
if s in window_dict:
name = s + '_Atom.csv'
time_an_dict[s] = analyze_time(name,window_dict[s])
for filename in os.listdir(Directory):
if filename.endswith('.csv'):
q = filename.split('.csv')[0]
s = q.split('_')[0]
if s in time_an_dict:
atom = list(time_an_dict[s])
ion = time_an_dict[s]
for f in time_an_dict[s]:
x_val = []
y_val = []
fz = ion[f]
for i in time_an_dict[s][f]:
pos = (fz[i])
frame = i
y_val.append(frame)
x_val.append(pos)
'''ions = atom
frame = frames
position = pos
plt.plot(frame, position, label = frames)
plt.xlabel("Frame")
plt.ylabel("Position")
plt.show()
#plt.savefig('{}_Pos.png'.format(s))'''
Everything here runs well except this last bottom block of code. That deals with trying to make a graph from a nested dictionary. Any help would be appreciated!
Thanks!
I figured out the answer:
for filename in os.listdir(Directory):
if filename.endswith('_Atom.csv'):
q = filename.split('.csv')[0]
s = q.split('_')[0]
if s in window_dict:
name = s + '_Atom.csv'
time_an_dict[s] = analyze_time(name,window_dict[s])
new = '{}_A_pos.csv'.format(s)
ions = list(time_an_dict[s])
plt.yticks(np.arange(-50, 50, 5))
plt.xlabel('Frame')
plt.ylabel('Z axis position(Ang)')
plt.title([s])
for i in ions:
x_value = []
y_value = []
time_frame =len(time_an_dict[s][i]) +1
for frame in range(1,time_frame):
frame = str(frame)
x_value.append(int(frame))
y_value.append(float(time_an_dict[s][i][frame]))
plt.plot(x_value, y_value, label=[i])
plt.xticks(np.arange(1, 11, 1))
plt.legend()
plt.savefig('{}_Positions.png'.format(s))
plt.clf()
os.remove("{}_A_pos.csv".format(s))
From there, with the combo of the other parts of the code, it produces these graphs:
For more than 1 file as long as there is more '.dcd' files.

How to create new columns based off specific conditions?

I have a multi-index dataframe. The index's are represented by an ID and date. The 3 columns I have are cost, revenue, and expenditure.
I want to create 3 new columns based off certain conditions.
1) The first new column I would want to create would be based off the condition, for the 3 most previous dates per ID, if the cost column decreases consistently, label the new row values as 'NEG', if not then label it 'No'.
2) The second column I would want to create would be based off the condition, for the 3 most recent dates, if the revenue column decreases consistently, label the new row values as 'NEG', if not then label it 'No'.
3) The third column I would want to create would be based off the condition, for the 3 most recent dates, if the expenditure column increases consistently, label the new row value as 'POS' or if it stays the same label the new row value as 'STABLE'.
idx = pd.MultiIndex.from_product([['001', '002', '003','004'],
['2017-06-30', '2017-12-31', '2018-06-30','2018-12-31','2019-06-30']],
names=['ID', 'Date'])
col = ['Cost', 'Revenue','Expenditure']
dict2 = {'Cost':[12,6,-2,-10,-16,-10,14,12,6,7,4,2,1,4,-4,5,7,9,8,1],
'Revenue':[14,13,2,1,-6,-10,14,12,6,7,4,2,1,4,-4,5,7,9,18,91],
'Expenditure':[17,196,20,1,-6,-10,14,12,6,7,4,2,1,4,-4,5,7,9,18,18]}
df = pd.DataFrame(dict2,idx,col)
i have tried creating a function then applying it to my DF but i keep getting errors...
the solution i want to end up with would look like this..
idx = pd.MultiIndex.from_product([['001', '002', '003','004'],
['2017-06-30', '2017-12-31', '2018-06-30','2018-12-31','2019-06-30']],
names=['ID', 'Date'])
col = ['Cost', 'Revenue','Expenditure', 'Cost Outlook', 'Revenue Outlook', 'Expenditure Outlook']
dict3= {'Cost': [12,6,-2,-10,-16,
-10,14,12,6,7,
4,2,1,4,-4,
5,7,9,8,1],
'Cost Outlook': ['no','no','NEG','NEG','NEG',
'no','no','no','NEG','NEG',
'no','no','NEG','no','no',
'no','no','no','no','NEG'],
'Revenue':[14,13,2,1,-6,
-10,14,12,6,7,
4,2,1,4,-4,
5,7,9,18,91],
'Revenue Outlook': ['no','no','NEG','NEG','NEG',
'no','no','no','NEG','NEG',
'no','no','NEG','no','no',
'no','no','no','no','no'],
'Expenditure':[17,196,1220,1220, -6,
-10,14,120,126,129,
4,2,1,4,-4,
5,7,9,18,18],
'Expenditure Outlook':['no','no','POS','POS','no',
'no','no','POS','POS','POS',
'no','no','no','no','no',
'no','no','POS','POS','STABLE']
}
df_new = pd.DataFrame(dict3,idx,col)
Here's what I would do:
# update Cost and Revenue Outlooks
# because they have similar conditions
for col in ['Cost', 'Revenue']:
groups = df.groupby('ID')
outlook = f'{col} Outlook'
df[outlook] = groups[col].diff().lt(0)
# moved here
df[outlook] = np.where(groups[outlook].rolling(2).sum().eq(2), 'NEG', 'no')
# update Expenditure Outlook
col = 'Expenditure'
outlook = f'{col} Outlook'
s = df.groupby('ID')[col].diff()
df[outlook] = np.select( (s.eq(0).groupby(level=0).rolling(2).sum().eq(2),
s.gt(0).groupby(level=0).rolling(2).sum().eq(2)),
('STABLE', 'POS'), 'no')
See if this does the job:
is_descending = lambda a: np.all(a[:-1] > a[1:])
is_ascending = lambda a: np.all(a[:-1] <= a[1:])
df1 = df.reset_index()
df1["CostOutlook"] = df1.groupby("ID").Cost.rolling(3).apply(is_descending).fillna(0).apply(lambda x: "NEG" if x > 0 else "no").to_list()
df1["RevenueOutlook"] = df1.groupby("ID").Revenue.rolling(3).apply(is_descending).fillna(0).apply(lambda x: "NEG" if x > 0 else "no").to_list()
df1["ExpenditureOutlook"] = df1.groupby("ID").Expenditure.rolling(3).apply(is_ascending).fillna(0).apply(lambda x: "POS" if x > 0 else "no").to_list()
df1 = df1.set_index(["ID", "Date"])
Note: The requirement for "STABLE" is not handled.
Edit:
This is alternative solution:
is_descending = lambda a: np.all(a[:-1] > a[1:])
def is_ascending(a):
if np.all(a[:-1] <= a[1:]):
if a[-1] == a[-2]:
return 2
return 1
return 0
for col in ['Cost', 'Revenue']:
outlook = df[col].unstack(level="ID").rolling(3).apply(is_descending).fillna(0).replace({0.0:"no", 1.0:"NEG"}).unstack().rename(f"{col} outlook")
df = df.join(outlook)
col = "Expenditure"
outlook = df[col].unstack(level="ID").rolling(3).apply(is_ascending).fillna(0).replace({0.0:"no", 1.0:"POS", 2.0:"STABLE"}).unstack().rename(f"{col} outlook")
df = df.join(outlook)

Unable to retrieve data from frame

I am trying to retrieve specific data from data-frame with particular condition, but it show empty data frame. I am new to data science, trying to learn data science. Here is my code.
file = open('/home/jeet/files1/files/ch03/adult.data', 'r')
def chr_int(a):
if a.isdigit(): return int(a)
else: return 0
data = []
for line in file:
data1 = line.split(',')
if len(data1) == 15:
data.append([chr_int(data1[0]), data1[1],
chr_int(data1[2]), data1[3],
chr_int(data1[4]), data1[5],
data1[6], data1[7], data1[8],
data1[9], chr_int(data1[10]),
chr_int(data1[11]),
chr_int(data1[12]),
data1[13], data1[14]])
import pandas as pd
df = pd.DataFrame(data)
df.columns = ['age', 'type-employer', 'fnlwgt', 'education','education_num', 'marital','occupation', 'relationship','race','sex','capital_gain','capital_loss','hr_per_week','country','income']
ml = df[(df.sex == 'Male')] # here i retrive data who is male
ml1 = df[(df.sex == 'Male') & (df.income == '>50K\n')]
print(ml1.head()) # here i printing that data
fm =df[(df.sex == 'Female')]
fm1 = df [(df.sex == 'Female') & (df.income =='>50K\n')]
output:
Empty DataFrame
Columns: [age, type-employer, fnlwgt, education, education_num, marital, occupation, relationship, race, sex, capital_gain, capital_loss, hr_per_week, country, income]
Index: []
what's wrong with the code. why data frame is empty.
If you check the values carefully, you may see the problem:
print(df.income.unique())
>>> [' <=50K\n' ' >50K\n']
There are spaces in front of each values. So values should be either processed to get rid of these spaces, or the code should be modified like this:
ml1 = df[(df.sex == 'Male') & (df.income == ' >50K\n')]
fm1 = df [(df.sex == 'Female') & (df.income ==' <=50K\n')]

Python return all items from loop?

I cannot figure how to return all the items using this code:
#staticmethod
def create_dataset():
cols = Colleagues.get_all_colleagues()
cols_abs = ((col['Firstname'] + " " + col['Surname'], col['Absences']) for col in cols)
for col in cols_abs:
dataset = list()
sum_days = list()
for d in col[1]:
start_date = d[0]
end_date = d[1]
s = datetime.strptime(start_date, "%Y-%m-%d")
e = datetime.strptime(end_date, "%Y-%m-%d")
startdate = s.strftime("%b-%y")
days = numpy.busday_count(s, e) + 1
sum_days.append(days)
days_per_month = startdate, days
dataset.append(days_per_month)
dict_gen1 = dict(dataset)
comb_days = sum(sum_days)
dict_gen2 = {'Name': col[0], 'Spells': len(col[1]), 'Total(Days)': comb_days}
dict_comb = [{**dict_gen1, **dict_gen2}]
return dict_comb
It only returns the first "col". If I move the return statement outside of the loop it returns only the last item in my set of data. This is the output that is returned from col_abs:
('Jonny Briggs', [['2015-08-01', '2015-08-05'], ['2015-11-02', '2015-11-06'], ['2016-01-06', '2016-01-08'], ['2016-03-07', '2016-03-11']])
('Matt Monroe[['2015-12-08', '2015-12-11'], ['2016-05-23', '2016-05-26']])
('Marcia Jones', [['2016-02-02', '2016-02-04']])
('Pat Collins', [])
('Sofia Marowzich', [['2015-10-21', '2015-10-30'], ['2016-03-09', '2016-03-24']])
('Mickey Quinn', [['2016-06-06', '2016-06-08'], ['2016-01-18', '2016-01-21'], ['2016-07-21', '2016-07-22']])
('Jenifer Andersson', [])
('Jon Fletcher', [])
('James Gray', [['2016-04-01', '2016-04-06'], ['2016-07-04', '2016-07-07']])
('Matt Chambers', [['2016-05-02', '2016-05-04']])
Can anyone help me understand this better as I want to return a "dict_comb" for each entry in col_abs ?
Replace your return statement with a yield statement. This will allow your method to continue to loop while "yielding" or returning values after each iteration.

How to print results from this function

I'm new to Python and programming in general and need a little help with this (partially finished) function. It's calling a text file with a bunch of rows of comma delimited data (age, salary, education and so on). However, I've run into a problem from the outset. I don't know how to return the results.
My aim is to create dictionaries for each category and for each row to be sorted and tallied.
e.g. 100 people over 50, 200 people under 50 and so on.
Am I in the correct ball park?
file = "adultdata.txt"
def make_data(file):
try:
f = open(file, "r")
except IOError as e:
print(e)
return none
large_list = []
avg_age = 0
row_count_under50 = 0
row_count_over50 = 0
#create 2 dictionaries per category
employ_dict_under50 = {}
employ_dict_over50 = {}
for row in f:
edited_row = row.strip()
my_list = edited_row.split(",")
try:
#Age Category
my_list[0] = int(my_list[0])
#Work Category
if my_list[-1] == " <=50K":
if my_list[1] in employ_dict_under50:
employ_dict_under50[my_list[1]] += 1
else:
employ_dict_under50[my_list[1]] = 1
row_count_u50 += 1
else:
if my_list[1] in emp_dict_o50:
employ_dict_over50[my_list[1]] += 1
else:
employ_dict_over50[my_list[1]] = 1
row_count_o50 += 1
# Other categories here
print(my_list)
#print(large_list)
#return
# Ignored categories here - e.g. my_list[insert my list numbers here] = None
I do not have access to your file but I had a go at correcting most of the errors you had in your code.
These are a list of the mistakes I found in your code:
your function make_data is essentially useless and is out of scope. You need to remove it entirely
When using a file object f, you need to use readline to extract data from the file.
It is also best to use a with statement when using IO resources like files
You had numerous variables which were badly named in the inner loop and did not exist
You declared a try in the inner loop without a catch. You can remove the try because you are not trying to catch any Error
You have some very basic errors which are related to general programming, can I assume your new to this? If thats the case then you should probably follow some more beginner tutorials online until you get a grasp of what commands you need to use to perform basic tasks.
Try compare your code to this and see if you can understand what i'm trying to say:
file = "adultdata.txt"
large_list = []
avg_age = 0
row_count_under50 = 0
row_count_over50 = 0
#create 2 dictionaries per category
employ_dict_under50 = {}
employ_dict_over50 = {}
with open(file, "r") as f:
row = f.readline()
edited_row = row.strip()
my_list = edited_row.split(",")
#Age Category
my_list[0] = int(my_list[0])
#Work Category
if my_list[-1] == " <=50K":
if my_list[1] in employ_dict_under50:
employ_dict_under50[my_list[1]] += 1
else:
employ_dict_under50[my_list[1]] = 1
row_count_under50 += 1
else:
if my_list[1] in employ_dict_over50:
employ_dict_over50[my_list[1]] += 1
else:
employ_dict_over50[my_list[1]] = 1
row_count_over50 += 1
# Other categories here
print(my_list)
#print(large_list)
#return
I cannot say for certain if this code will work or not without your file but it should give you a head start.

Resources