Python Pandas Pivot Table - counting points - python-3.x

I have an issue with Pivot table in Python. Let's say that I have below values in list:
team_A_id = [1,5,10]
team_A_result = 0
and below data frame:
id points
3 36
4 0
5 11
7 6
10 23
How could I using (perhaps) "for loop" find by team A id in list points and count them. Output should be:
result_team_A = 34
Thanks for any help

You are looking for isin and sum
team_A_id = [1,5,10]
df.loc[df.id.isin(team_A_id),'points'].sum()
Out[136]: 34

this will return the rows for the team A:
df.iloc[team_A_id]
result team A can be obtained by:
df['points].sum()
TLDR:
df.iloc[team_A_id]['points].sum()

Related

Loop through two Matrixes in the same for loop

I need to loop through two matrixes, but I can't use Numpy, I'm asked not to use it.
altura_mar = [[1,5,7],
[3,2,31]]
precipitacion_anual = [[10,5,70],
[5,7,8]]
for marIndex, anualIndex in zip(altura_mar, precipitacion_anual):
#some code
so, my problem here is that zip works with just a list not a 2d list if I try to use it with a matrix it gives this error:"ValueError: too many values to unpack (expected 2)"
Now I'm thinking of two possible solutions:
Make the multidimensional list into a single list like altura_mar = [1,5,7,3,2,31] but how could I do that?
Work with the 2 matrixes as they are in a nested for loop (I don't know about this, but maybe you guys can help here).
Since it's a 2d list, you can apply the zip twice:
>>> for a,b in zip(altura_mar, precipitacion_anual):
... for c,d in zip(a,b):
... print(c,d)
...
1 10
5 5
7 70
3 5
2 7
31 8
This solution does not scale well when the nest depth is variable, though.
I think the most basic solution is best, just iterating over the indices rather than nested zip statements (see this post)
>>> for i in range(len(altura_mar)):
... for j in range(len(altura_mar[i])):
... print(altura_mar[i][j], precipitacion_anual[i][j])
...
1 10
5 5
7 70
3 5
2 7
31 8
Assuming you want to pair 1 with 10 not [1,5,7] with [10, 5, 70] then in addition to the other perfectly fine answers, you could also use list comprehension generators like:
altura_mar = [[1,5,7], [3,2,31]]
altura_mar_flat = (cell for row in altura_mar for cell in row)
precipitacion_anual = [[10,5,70], [5,7,8]]
precipitacion_anual_flat = (cell for row in precipitacion_anual for cell in row)
for marIndex, anualIndex in zip(altura_mar_flat, precipitacion_anual_flat):
print(marIndex, anualIndex)
This also gives you:
1 10
5 5
7 70
3 5
2 7
31 8

Average data points in a range while condition is met in a Pandas DataFrame

I have a very large dataset with over 400,000 rows and growing. I understand that you are not supposed to use iterows to modify a pandas data frame. However I'm a little lost on what I should do in this case, since I'm not sure I could use .loc() or some rolling filter to modify a data frame in the way I need to. I'm trying to figure out if I can take a data frame and average the range while the condition is met. For example:
Condition
Temp.
Pressure
1
8
20
1
7
23
1
8
22
1
9
21
0
4
33
0
3
35
1
9
21
1
11
20
1
10
22
While the condition is == 1 the outputed dataframe would look like this:
Condition
Avg. Temp.
Avg. Pressure
1
8
21.5
1
10
21
Has anyone attempted something similar that can put me on the right path? I was thinking of using something like this:
df = pd.csv_read(csv_file)
for index, row in df.iterrows():
if row['condition'] == 1:
#start index = first value that equals 1
else: #end index & calculate rolling average of range
len = end - start
new_df = df.rolling(len).mean()
I know that my code isn't great, I also know I could brute force it doing something similar as I have shown above, but as I said it has a lot of rows and continues to grow so I need to be efficient.
TRY:
result = df.groupby((df.Condition != df.Condition.shift()).cumsum()).apply(
lambda x: x.rolling(len(x)).mean().dropna()).reset_index(drop=True)
print(result.loc[result.Condition.eq(1)]) # filter by required condition
OUTPUT:
Condition Temp. Pressure
0 1.0 8.0 21.5
2 1.0 10.0 21.0

Setting a value to a cell in a pandas dataframe

I have the following pandas dataframe:
K = pd.DataFrame({"A":[1,2,3,4], "B":[5,6,7,8]})
Then I set the cell in the first row and first column to 11:
K.iloc[0]["A"] = 11
And when I check the dataframe again, I see that the value assignment is done and K.iloc[0]["A"] is equal to 11. However when I add a column to this data frame and do the same operation for a cell in the new column, the value assignment is not successful:
K["C"] = 0
K.iloc[0]["C"] = 11
So, when I check the dataframe again, the value of K.iloc[0]["C"] is still zero. I appreciate if somebody can tell me what is going on here and how I can resolve this issue.
For simplicity, I would do the operations in a different order and use loc:
K.loc[0, 'C'] = 0
K.loc[0, ['A', 'C']] = 11
When you use K.iloc[0]["C"], you first take the first line, so you have a copy of a slice from your dataframe, then you take the column C. So you change the copy from the slice, not the original dataframe.
That your first call, K.iloc[0]["A"] = 11 worked fine was in some sens a luck.
The good habit is to use loc in "one shot", so you have access to the original value of the dataframe, not on a slice copy :
K.loc[0,"C"] = 11
Be careful that iloc and loc are different function, even if they seems quite similar here.
If default index, RangeIndex is possible use DataFrame.loc, but it set index values by label 0 (what is same like position 0):
K['C'] = 0
K.loc[0, ["A", "C"]] = 11
print (K)
A B C
0 11 5 11
1 2 6 0
2 3 7 0
3 4 8 0
Reason why your solution failed is possible find in docs:
This can work at times, but it is not guaranteed to, and therefore should be avoided:
dfc['A'][0] = 111
Solution with DataFrame.iloc is possible with get positions of columns by Index.get_indexer:
print (K.columns.get_indexer(["A", "C"]))
[0 2]
K['C'] = 0
K.iloc[0, K.columns.get_indexer(["A", "C"])] = 11
print (K)
A B C
0 11 5 11
1 2 6 0
2 3 7 0
3 4 8 0
loc should work :
K.loc[0]['C'] = 11
K.loc[0, 'C'] = 11
Both the above versions of loc will be able to assign values to the dataframe K.

Assigning variables to cells in a Pandas table (Python)

I'm working on a script that takes test data from a website, assigns the data to a variable, then creates a pie chart of the responses for later analysis. I'm able to pull the data without a problem and format the information into a table, but I can't figure out how to assign a specific variable to a cell in the table.
For example, say question 1 had 20% of students answer A, 20% answer B, 30% answer C, and 30% answer D. I would like to take this information and assign it to the variables 1A for A, 1B, for B, etc.
I think the answer lies in this code. I've tried splitting columns and rows, but it looks like the column header doesn't correlate to the data below it. I'm also attaching the results of 'print(df)' below.
header = table.find_all('tr')[2]
cols = header.find_all('td')
cols = [ele.text.strip() for ele in cols]
cols = cols[0:3] + cols[4:8] + cols[9:]
df = pd.DataFrame(data, columns = cols)
print(df)
A/1 B/2 C/3 D/4 CORRECT MC ANSWER
0 6 84 1 9 B
1 6 1 91 2 C
2 12 1 14 72 D
3 77 3 11 9 A
4 82 7 8 2 A
Do you want try something like this with 'autopct'?
df1 = df.T.set_axis(['Question '+str(i+1) for i in df.T.columns.values], axis=1, inplace=False).iloc[:4]
ax = df1.plot.pie(subplots=True,autopct='%1.1f%%',layout=(5,1),figsize=(3,15),legend=False)

Efficiently concatanate a large number of columns

I tried to concatenate a large number of columns containing integers in one string.
Basically, starting from:
df = pd.DataFrame({'id':[1,2,3,4],'a':[0,1,2,3], 'b':[4,5,6,7], 'c':[8,9,0,1]})
To obtain:
id join
0 1 481
1 2 592
2 3 603
3 4 714
I found several methods to do this (here and here):
Method 1:
conc['glued']=''
i=1
while i < len(df.columns):
conc['glued'] = conc['glued'] + df[df.columns[i]].values.astype(str)
i=i+1
This method work, but is a bit long (45min on my "test" case of 18,000 rows x 40,000 columns). I am concerned by the loop on the columns as this program should be applied at the end on tables of 600.000 columns and I am afraid it will be too long.
Method 2a
conc['join']=[''.join(row) for row in df[df.columns[1:]].values.astype(str)]
Method 2b
conc['apply'] = df[df.columns[1:]].apply(lambda x: ''.join(x.astype(str)), axis=1)
Both of these methods are 10 times more efficient than the previous one, iterate on rows which is good and work perfectly on my "debug" table df. But, when I apply it to my "test" table of 18k x 40k, it leads to a MemoryError: (I have 60% of my 32GB of RAM occupied after reading the corresponding csv file).
I can copy my DataFrame without overpass the memory, but curiously, applying this method make the code crash.
Do you see how I can fix and improve this code to use an efficient row based iteration? Thank you !
Appendix:
Here is the code I use on my test case:
geno_reader = pd.read_csv(genotype_file,header=0,compression='gzip', usecols=geno_columns_names)
fimpute_geno = pd.DataFrame({'SampID': geno_reader['SampID']})
I should use the chunksize option to read this file but I haven't yet really understand how to use it after reading.
Method 1:
fimpute_geno['Calls'] = ''
for i in range(1,len(geno_reader.columns)):
fimpute_geno['Calls'] = fimpute_geno['Calls']\
+ geno_reader[geno_reader.columns[i]].values.astype(int).astype(str)
This work in 45min.
There is some quite disgusting piece of code like the .astype(int).astype(str). I don't know why Python don't recognize my integers and consider them as float.
Method 2:
fimpute_geno['Calls'] = geno_reader[geno_reader.columns[1:]]\
.apply(lambda x: ''.join(x.astype(int).astype(str)), axis=1)
This leads to an MemoryError:
Here' something to try. It would require that you convert your columns to strings though. your sample frame
b c id
0 4 8 1
1 5 9 2
2 6 0 3
3 7 1 4
then
#you could also do this conc[['b','c','id']] for the next two lines
conc.ix[:,'b':'id'] = conc.ix[:,'b':'id'].astype('str')
conc['join'] = np.sum(conc.ix[:,'b':'id'],axis=1)
Would give
a b c id join
0 0 4 8 1 481
1 1 5 9 2 592
2 2 6 0 3 603
3 3 7 1 4 714

Resources