i have a dataframe of with 4 attributes, it can be seen blow.
what i wanted to do it that take the name and age of a person and count the number of friends he has. then of two ppl have the same age with different names, take the average number of friends for that age group. final divide the age range into age group and then take the average. this is how i tried.
#loc the attribute or features of interest
friends = df.iloc[:,3]
ages = df.iloc[:,2]
# default of dictionary with age as key and value as a list of friends
dictionary_age_friends = defaultdict(list)
# populating the dictionary with key age and values friend
for i,j in zip(ages,friends):
dictionary_age_friends[i].append(j)
print("first dict")
print(dictionary_age_friends)
#second dictionary, the same age is collected and the number of friends is added
set_dict ={}
for x in dictionary_age_friends:
list_friends =[]
for y in dictionary_age_friends[x]:
list_friends.append(y)
set_list_len = len(list_friends) # assign a friend with a number 1
set_dict[x] = set_list_len
print(set_dict)
# set_dict ={}
# for x in dictionary_age_friends:
# print("inside the loop")
# lis_1 =[]
# for y in dictionary_age_friends[x]:
# lis_1.append(y)
# set_list = lis_1
# set_list = [1 for x in set_list] # assign a friend with a number 1
# set_dict[x] = sum(set_list)
# a dictionary that assign the age range into age-groups
second_dict = defaultdict(list)
for i,j in set_dict.items():
if i in range(16,20):
i = 'teens_youthAdult'
second_dict[i].append(j)
elif i in range(20,40):
i ="Adult"
second_dict[i].append(j)
elif i in range(40,60):
i ="MiddleAge"
second_dict[i].append(j)
elif i in range(60,72):
i = "old"
second_dict[i].append(j)
print(second_dict)
print("final dict stared")
new_dic ={}
for key,value in second_dict.items():
if key == 'teens_youthAdult':
new_dic[key] = round((sum(value)/len(value)),2)
elif key =='Adult':
new_dic[key] = round((sum(value)/len(value)),2)
elif key =='MiddleAge' :
new_dic[key] = round((sum(value)/len(value)),2)
else:
new_dic[key] = round((sum(value)/len(value)),2)
new_dic
end_time = datetime.datetime.now()
print(end_time-start_time)
print(new_dic)
some of the feedback i got is: 1, no need to build a list if u want just to count number of friends.
2, two ppl with the same age, 18. One has 4 friends, the other 3. the current code conclude that there are 7 average friends.
3, the code is not correct and optimal.
any suggestions or help? thanks indavance for all suggestion or helps?
I haven't understood names of attributes and you haven't mention by which age groups you need to split your data. In my answer I'll treat the data as if the attributes were:
index, name, age, friend
To find amount of friends by name, I would suggest you to use groupby.
input:
groups = df.groupby([df.iloc[:,0],df.iloc[:,1]]) # grouping by name(0), age(1)
amount_of_friends_df = groups.size() # gathering amount of friends for a person
print(amount_of_friends_df)
output:
name age
EUNK 25 1
FBFM 26 1
MYYD 30 1
OBBF 28 2
RJCW 25 1
RQTI 21 1
VLIP 16 1
ZCWQ 18 1
ZMQE 27 1
To find amount of friends by age you also can use groups
input:
groups = df.groupby([df.iloc[:,1]]) # groups by age(1)
age_friends = groups.size()
age_friends=age_friends.reset_index()
age_friends.columns=(['age','amount_of_friends'])
print(age_friends)
output:
age amount_of_friends
0 16 1
1 18 1
2 21 1
3 25 2
4 26 1
5 27 1
6 28 2
7 30 1
To calculate average amount of friends per age group you can use categories and groupby.
input:
mean_by_age_group_df = age_friends.groupby(pd.cut(age_friends.age,[20,40,60,72]))\
.agg({'amount_of_friends':'mean'})
print(mean_by_age_group_df)
pd.cut returns caregorical series which we use to group data. Afterwards we use agg function to aggregate groups in dataframe.
output:
amount_of_friends
age
(20, 40] 1.333333
(40, 60] NaN
(60, 72] NaN
Related
I have the following data
ID DATE AGE COUNT
1 Nat 16 1
1 2021-06-06 19 2
1 2020-01-05 20 3
2 Nat 23 3
2 Nat 16 3
2 2019-02-04 36 12
I want to aggregate this so that the DATE will be the earliest valid date (in time), while AGE will be extracted from the corresponding row the earliest date is selected. The output should be
ID DATE AGE COUNT
1 2021-06-06 19 1
2 2019-02-04 36 3
My code which gives this error TypeError: Must provide 'func' or named aggregation **kwargs..
df_agg = pd.pivot_table(df, index=['ID'],
values=['DATE', 'AGE'],
aggfunc={'DATE': np.min, 'AGE': None, 'COUNT': np.min})
I don't want to use 'AGE': np.min since for ID=1, AGE=16 will be extracted which is not what I want.
///////////// Edits ///////////////
Edits made to provide a more generic example.
You can try .first_valid_index():
x = df.loc[df.groupby("ID").apply(lambda x: x["DATE"].first_valid_index())]
print(x)
Prints:
ID DATE AGE
1 1 2021-06-06 19
5 2 2019-02-04 36
EDIT: Using .pivot_table(). You can extract the "DATE"/"AGE" together as a list, for "COUNT" you can use np.min or "min". The second step would be explode the "DATE"/"AGE" list to separate columns:
df_agg = pd.pivot_table(
df,
index=["ID"],
values=["DATE", "AGE", "COUNT"],
aggfunc={
"DATE": lambda x: df.loc[x.first_valid_index()][
["DATE", "AGE"]
].tolist(),
"COUNT": "min",
},
)
df_agg[["DATE", "AGE"]] = pd.DataFrame(df_agg["DATE"].apply(pd.Series))
print(df_agg)
Prints:
COUNT DATE AGE
ID
1 1 2021-06-06 19
2 3 2019-02-04 36
You can sort values and drop the duplicates (sort_index is optional)
df.sort_values(['DATE']).drop_duplicates('ID').sort_index()
ID DATE AGE
1 1 2021-06-06 19
5 2 2019-02-04 36
With groupby and transform:
df[df['DATE'] == df.groupby("ID")['DATE'].transform('min')]
Assuming you have an index, a simple solution would be:
def min_val(group):
group = group.loc[group.DATE.idxmin]
return group
df.groupby(['ID']).apply(min_val)
If you do not have an index you can use:
df.reset_index().groupby(['ID']).apply(min_val).drop(columns=['ID'])
I dont know how to search this code in internet so I ask here
My code :
# This code is in Tes.py
n = [str]*3
x = [int]*3
MyLib.name(n)
MyLib.number(x)
MyLib.Result(n,x)
# This code in MyLib.py with 3 def
def name(data) :
for i in range (3) :
n[i] = str(input("Enter Name : ")
def number(data) :
for s in range (3) :
x[i] = int(input("Enter Number : ")
def result(data1,data2) :
for i in data1 :
for i in data2 :
print("Your Name",n,"Your Number",x)
examples :
input 1 : Jack
Rino
Gust
input 2 : 1232
1541
2021
output what I want : Your Name Jack Your Number 1232
Your Name Rino Your Number 1541
Your Name Gust Your Number 2021
output that i got : Your Name Jack Your Number 1232
Your Name Jack Your Number 1541
Your Name Jack Your Number 2021
Your Name Rino Your Number 1232
Your Name Rino Your Number 1541
Your Name Rino Your Number 2021
Your Name Gust Your Number 1232
Your Name Gust Your Number 1541
Your Name Gust Your Number 2021
How to get output like what I want, I want to search it in the google but I dont know what I must type.
Is this what you mean?
for i in range(min(len(n), len(x))):
print("Your Name",n[i],"Your Number",x[i])
total = 3
n = [str]*total
x = [int]*total
for i in range (total):
n[i] = str(input("Enter Name : "))
for i in range (total):
x[i] = int(input("Enter Number : "))
for i in range (total):
print("Your Name",n[i],"Your Number",x[i])
If you give this code the input you mentioned, it will show the desired result. As you wrote two loops,
for i in n :
for i in x :
your print will be triggered 3*3 = 9 times! You just need one single loop, since at the both input you are taking same number of inputs (3 names and 3 numbers!). Even I would say why do you need 3 loops? why not just this:
total = 3
n = [str]*total
x = [int]*total
for i in range (total):
n[i] = str(input("Enter Name : "))
x[i] = int(input("Enter Number : "))
print("Your Name",n[i],"Your Number",x[i])
I need to search the values from the df1['numsearch'] column into the lists in df2['Numbers']. If the number is in those lists, then I want to add values from the df2['Score'] column to df1. See desired output below.
df1 = pd.DataFrame(
{'Day':['M','Tu','W','Th','Fr','Sa','Su'],
'numsearch':['1','20','14','99','19','6','101']
})
df2 = pd.DataFrame(
{'Letters':['a','b','c','d'],
'Numbers':[['1','2','3','4'],['5','6','7','8'],['10','20','30','40'],['11','12','13','14']],
'Score': ['1.1','2.2','3.3','4.4']})
desired output
Day numsearch Score
0 M 1 1.1
1 Tu 20 3.3
2 W 4 4.4
3 Th 99 "No score"
4 Fr 19 "No score"
5 Sa 6 2.2
6 Su 101 "No score"
I have written a for loop that works with the test data.
scores = []
for s,ns in enumerate(ppr_data['SN']):
match = ''
for k,q in enumerate(jcr_data['All_ISSNs']):
if ns in q:
scores.append(jcr_data['Journal Impact Factor'][k])
match = 1
else:
continue
if match == "":
scores.append('No score')
match = ""
df1['Score'] = np.array(scores)
In my small test, but above code works, but when working with larger data files, it is creating duplicates. So this clearly isn't the best way to do this.
I'm sure there's a more pandas-proper line of code that ends in .fillna("No score") .
I tried to use a loc statement, but I get hung up on searching the values of one dataframe in a column that contains lists.
Can anyone shed some light?
df2=df2.explode('Numbers')#Explode df2 on Numbers
d=dict(zip(df2.Numbers, df2.Score))#dict Numbers and Scores
df1['Score']=df1.numsearch.map(d).fillna('No Score')#Map dict to df1 filling NaN with No Score
Can shorten it as follows:
df2=df2.explode('Numbers')#Explode df2 on Numbers
df1['Score']=df1.numsearch.map(dict(zip(df2.Numbers, df2.Score))).fillna('No Score')
Day numsearch Score
0 M 1 1.1
1 Tu 20 3.3
2 W 14 4.4
3 Th 99 No Score
4 Fr 19 No Score
5 Sa 6 2.2
6 Su 101 No Score
You can try left join and fillna:
df1.merge(df2.explode('Numbers'),
left_on='numsearch',
right_on='Numbers', how='left')[['Day', 'numsearch', 'Score']].fillna("No score")
Output:
Day numsearch Score
0 M 1 1.1
1 Tu 20 3.3
2 W 14 4.4
3 Th 99 No score
4 Fr 19 No score
5 Sa 6 2.2
6 Su 101 No score
I need to travel values of the dictionary in a single for loop to add and compare values. Here is my dictionary.
I am aware of iterating dictionary based on key but not sure if we can iterate based on an index of values of the dictionary.
dict = {
0: [1,2,3,4,7]
1: [2,4,6,9,0]
2: [4,6,8,2,1]
}
if I iterate over key-value, it will not help me. I need to execute for loop for each index of the value list.
Assumption: length of list for each dictionary key will be the same.
Total highest_class
7 2
12 2
17 2
15 1
8 0
Basically, sum all number from the value of the dictionary at the same index and also return which key has the highest number.
import pandas as pd
dict = {
0: [1,2,3,4,7],
1: [2,4,6,9,0],
2: [4,6,8,2,1]
}
data = [(sum(item),item.index(max(item))) for item in list(zip(*dict.values()))]
df = pd.DataFrame(data, columns =['Total', 'highest_class'])
print (df)
output:
Total highest_class
0 7 2
1 12 2
2 17 2
3 15 1
4 8 0
EDIT:
data = [(sum(item),item.index(max(item))) for item in list(zip(*dict.values()))]
print ('Total', 'highest_class')
for item in data:
print ("{:<10} {:<10}".format(item[0], item[1]))
output:
Total highest_class
7 2
12 2
17 2
15 1
8 0
As much as I understood what you want. This might help.
mydict = {
0: [1,2,3,4,7],
1: [20,4,6,9,0],
2: [4,6,8,2,1]
}
maxElement={"highest_class":None,
"Total":-1
}
for key,value in mydict.items():
print(sum(value) , maxElement["Total"])
if sum(value) > maxElement["Total"]:
maxElement["highest_class"] = key
maxElement["Total"] = sum(value)
print(maxElement)
I am trying to calculate the CGPA of a number of students. The idea here is that each student takes N courses (in this case, N = 3). Every course has its course load which is an integer and can range from 1 to 6. At the end of the semester, the CGPA is calculated based on the unit load of all the courses taken by each student and the grades obtained.
I am trying to do this using a for statement to loop through the entire dataset a row at a time and then an if suite to determine the number of units to assign to each student according to the grade scored. The problem here is that the code works but it doesn't follow through. So if the first student in the dataframe had an A in course1, the code gives him 15units and all other students also get 15units irregardless of if they score a D or an F.
I really want to know what I am doing wrong and how I can fix it. I would also appreciate it if you can suggest smarter ways of accomplishing this task. Thanks.
I have added breaks in the first course section but I am afraid the code is still not generalizing well.
A = 5; B = 4; C = 3; D = 2; E = 1; F = 0;
course1_cl = 3; course2_cl = 3; course3_cl = 3
def calculate_CGPA(dataframe, a, b, c, d):
for row in dataframe[d]:
if dataframe[a].any()=='A':
dataframe['units'] = A * course1_cl
break
elif dataframe[a].any()=='B':
dataframe['units'] = B * course1_cl
break
elif dataframe[a].any()=='C':
dataframe['units'] = C * course1_cl
break
elif dataframe[a].any()=='D':
dataframe['units'] = D * course1_cl
break
elif dataframe[a].any()=='E':
dataframe[units] = E * course1_cl
else:
dataframe[units]= 0
print("Done generating units for: "+ format(a))
for row in dataframe[d]:
if dataframe[b].any()=='A':
dataframe['units2']=A * course2_cl
elif dataframe[b].any()=='B':
dataframe['units2'] = B*course2_cl
elif dataframe[b].any()=='C':
dataframe['units2'] = C*course2_cl
elif dataframe[b].any()=='D':
dataframe['units2'] = D*course2_cl
elif dataframe[b].any()=='E':
dataframe['units2'] = E*course2_cl
else:
dataframe['units2'] = 0
print("Done generating units for: "+format(b))
for row in dataframe[d]:
if dataframe[c].any()=='A':
dataframe['units3']= A * course3_cl
elif dataframe[c].any()=='B':
dataframe['units3'] = B*course3_cl
elif dataframe[c].any()=='C':
dataframe['units3'] = C*course3_cl
elif dataframe[c].any()=='D':
dataframe['units3'] = D*course3_cl
elif dataframe[c].any()=='E':
dataframe['units3'] = E*course3_cl
else:
dataframe['units3'] = 0
print("Done generating units for: "+format(c))
df['CGPA'] = (dataframe['units'] + dataframe['units2'] + dataframe['units3'])/(course1_cl + course2_cl + course3_cl)
The resulting dataframe should have 4 newly added columns: One units column for each of the three courses and a CGPA column as seen below. The values in the units and CGPA columns should change dynamically based on the grades scored by the individual.
S/N,Name,ExamNo,Course1,Course2,Course3,Units,Units2,Units3,CGPA
1,Mary Beth,A1,A,A,B,15,15,12,4.67
2,Elizabeth Fowler,A2,B,A,A,12,15,15,4.67
3,Bright Thompson,A12,C,C,B,9,9,12,3.33
4,Jack Daniels,A24,C,E,C,9,3,9,2.33
5,Ciroc Brute,A31,A,B,C,15,12,9,4.0
I do not know how complicated you actual data is but for your sample data you do not need the if statements:
from io import StringIO
# sample data
s = """S/N,Name,ExamNo,Course1,Course2,Course3
1,Mary Beth,A1,A,A,B
2,Elizabeth Fowler,A2,B,A,A
3,Bright Thompson,A12,C,C,B
4,Jack Daniels,A24,C,E,C
5,Ciroc Brute,A31,A,B,C"""
df = pd.read_csv(StringIO(s))
# create a dict
d = {'A':5, 'B':4, 'C':3, 'D':2, 'E':1, 'F':0}
# replace the letter grade with number and assign it to units cols
df[['Units', 'Units2', 'Units3']] = df[['Course1','Course2','Course3']].replace(d) * 3
# calc CGPA with sum div 3
df['CGPA'] = df[['Course1','Course2','Course3']].replace(d).sum(1) / 3
S/N Name ExamNo Course1 Course2 Course3 Units Units2 Units3 \
0 1 Mary Beth A1 A A B 15 15 12
1 2 Elizabeth Fowler A2 B A A 12 15 15
2 3 Bright Thompson A12 C C B 9 9 12
3 4 Jack Daniels A24 C E C 9 3 9
4 5 Ciroc Brute A31 A B C 15 12 9
CGPA
0 4.666667
1 4.666667
2 3.333333
3 2.333333
4 4.000000