Counting number of certain values in each column using pandas and collections - python-3.x

I have a txt file including 9 columns and 6 rows. The first 8 columns are either of these values: "1" , "2" and "3". I named these columns from "A" to "H". I named the last column: "class".
The last column is a name : "HIGH". Here is the txt file (data.txt):
1,1,1,1,2,1,1,3,HIGH
1,1,1,2,2,1,1,3,HIGH
1,1,1,1,1,1,1,3,HIGH
1,1,1,2,1,1,1,3,HIGH
1,1,1,3,2,1,1,3,HIGH
1,1,1,2,1,2,1,3,HIGH
I am trying to count the number of each value in each column and print a list that should have 3 components including the numbers of "1", "2" and "3" values in that column respectively. For example in the first column (e.g A) all values are "1". I expect to get : A : [6,0,0]. For the 8th column (e.g. H) where all values are "3", I expect to get: H : [0,0,6] or for the fourth column (e.g. D) I have two "1" , three "2" and one "3". So I expect : D : [2,3,1]. I tried to get it done using pandas and collection . Here is what I did:
import pandas as pd
from collections import Counter
df = pd.read_csv('data.txt')
df.columns = ['A','B','C','D','E','F','G','H','class']
X = df.ix[:, 0:8].values
y = df.ix[:, 8].values
deg = ['HIGH']
names = ['A','B','C','D','E','F','G','H']
for j in range(0, 8):
freqs = Counter(X[y == deg[0], j])
print(names[j],':',list(freqs.values()))
The output of the above code are empty lists. Here is what it returns:
A : []
B : []
C : []
D : []
E : []
F : []
G : []
H : []
How can I modify the above code to get what I want?
Thanks!

Use pandas.Series.value_counts
df.loc[:, :"H"].apply(pd.Series.value_counts).fillna(0).to_dict("l")
Output:
{'A': [6.0, 0.0, 0.0],
'B': [6.0, 0.0, 0.0],
'C': [6.0, 0.0, 0.0],
'D': [2, 3, 1],
'E': [3.0, 3.0, 0.0],
'F': [5.0, 1.0, 0.0],
'G': [6.0, 0.0, 0.0],
'H': [0.0, 0.0, 6.0]}

Define the following function:
def cntInts(col):
vc = col.value_counts()
return [ vc.get(i, 0) for i in range(1,4) ]
Then apply it and print results:
for k, v in df.loc[:, 'A':'H'].apply(cntInts).iteritems():
print(f'{k}: {v}')
For your data sample I got:
A: [6, 0, 0]
B: [6, 0, 0]
C: [6, 0, 0]
D: [2, 3, 1]
E: [3, 3, 0]
F: [5, 1, 0]
G: [6, 0, 0]
H: [0, 0, 6]
Or maybe it is enough to call just:
df.loc[:, 'A':'H'].apply(cntInts)
This time the result is a Series, which when printed yields:
A [6, 0, 0]
B [6, 0, 0]
C [6, 0, 0]
D [2, 3, 1]
E [3, 3, 0]
F [5, 1, 0]
G [6, 0, 0]
H [0, 0, 6]
dtype: object
Edit
Following your comments I suppose that there is something wrong with your data.
To trace the actual reason:
Define a string variable:
txt = '''1,1,1,1,2,1,1,3,HIGH
1,1,1,2,2,1,1,3,HIGH
1,1,1,1,1,1,1,3,HIGH
1,1,1,2,1,1,1,3,HIGH
1,1,1,3,2,1,1,3,HIGH
1,1,1,2,1,2,1,3,HIGH'''
Run:
import io
df = pd.read_csv(io.StringIO(txt), names=['A','B','C','D','E','F','G','H','class'])
Run my code on my data. The result should be just as expected.
Then read your input file (also into df) and run my code again.
Probably there is some difference between your data and mine.
Especially look for any extra spaces in your input file,
check also column types (after read_csv).

Solution with collections is select all columns without last, convert Counter to Series, so output is DataFrame, replace missing values by DataFrame.fillna, convert values to integers and last to dictionary by DataFrame.to_dict:
from collections import Counter
d = (df.iloc[:, :-1].apply(lambda x: pd.Series(Counter(x)))
.fillna(0)
.astype(int)
.to_dict("list"))
print (d)
{'A': [6, 0, 0], 'B': [6, 0, 0],
'C': [6, 0, 0], 'D': [1, 4, 1],
'E': [3, 3, 0], 'F': [5, 1, 0],
'G': [6, 0, 0], 'H': [0, 0, 6]}
Only pandas solution with pandas.value_counts:
d = (df.iloc[:, :-1].apply(pd.value_counts)
.fillna(0)
.astype(int)
.to_dict("list"))
print (d)
{'A': [6, 0, 0], 'B': [6, 0, 0],
'C': [6, 0, 0], 'D': [2, 3, 1],
'E': [3, 3, 0], 'F': [5, 1, 0],
'G': [6, 0, 0], 'H': [0, 0, 6]}

Working within python, since your end result is a dictionary:
from string import ascii_uppercase
from collections import Counter, defaultdict
from itertools import chain, product
import csv
d = defaultdict(list)
fieldnames = ascii_uppercase[:9]
# test.csv is your file above
with open('test.csv') as csvfile:
reader = csv.DictReader(csvfile, fieldnames = list(fieldnames))
reader = Counter(chain.from_iterable(row.items() for row in reader))
for col, value in product(fieldnames, ("1","2","3")):
if col != fieldnames[-1]:
d[col].append(reader.get((col,value), 0))
print(d)
defaultdict(list,
{'A': [6, 0, 0],
'B': [6, 0, 0],
'C': [6, 0, 0],
'D': [2, 3, 1],
'E': [3, 3, 0],
'F': [5, 1, 0],
'G': [6, 0, 0],
'H': [0, 0, 6]})

Related

Transform a dataframe using pivot

I am trying to transform a dataframe using pivot. Since the column contains duplicate entries, i tried to add a count column following what's suggested here (Question 10 posted in this answer).
import pandas as pd
from pprint import pprint
if __name__ == '__main__':
d = {
't': [0, 1, 2, 0, 1, 2, 0, 2, 0, 1],
'input': [2, 2, 2, 2, 2, 2, 4, 4, 4, 4],
'type': ['A', 'A', 'A', 'B', 'B', 'B', 'A', 'A', 'B', 'B'],
'value': [0.1, 0.2, 0.3, 1, 2, 3, 1, 2, 1, 1],
}
df = pd.DataFrame(d)
df = df.drop('t', axis=1)
df.insert(0, 'count', df.groupby('input').cumcount())
pd.pivot(df, index='count', columns='type', values='value')
But I still get the same error raise ValueError("Index contains duplicate entries, cannot reshape") ValueError: Index contains duplicate entries, cannot reshape.
Could someone please suggest how to resolve this error?
As far as you have more then one value associated with 'A' and 'B' you have to aggregate values somehow.
So if I've understood your issue right possible solution is the following:
#pip install pandas
import pandas as pd
d = {
't': [0, 1, 2, 0, 1, 2, 0, 2, 0, 1],
'input': [2, 2, 2, 2, 2, 2, 4, 4, 4, 4],
'type': ['A', 'A', 'A', 'B', 'B', 'B', 'A', 'A', 'B', 'B'],
'value': [0.1, 0.2, 0.3, 1, 2, 3, 1, 2, 1, 1],
}
df = pd.DataFrame(d)
df
# I've used aggfunc='sum' argument for example, the default value is 'mean'
pd.pivot_table(df, index='t', columns='type', values='value', aggfunc='sum')
Returns

How to merge lists value with shared key of two dictionaries?

e.g.
d1 = {'a':[1, 2, 3], 'b': [1, 2, 3]}
d2 = {'a':[4, 5, 6], 'b': [3, 4, 5]}
The output should be like this:
{'a':[1, 2, 3, 4, 5, 6], 'b': [1, 2, 3, 4, 5]}
If the value repeats itself, it should be recorded only once.
Assuming both dictionaries have the same keys and all keys are present in both dictionaries.
One way to achieve could be:
d1 = {'a':[1, 2, 3], 'b': [1, 2, 3]}
d2 = {'a':[4, 5, 6], 'b': [3, 4, 5]}
# make a list of both dictionaries
ds = [d1, d2]
# d will be the resultant dictionary
d = {}
for k in d1.keys():
d[k] = [d[k] for d in ds]
d[k] = list(set([item for sublist in d[k] for item in sublist]))
print(d)
Output
{'a': [1, 2, 3, 4, 5, 6], 'b': [1, 2, 3, 4, 5]}

3D matrix addition python

I am trying to add 3D matrix but third loop is not starting from 0.
Here shape of matrix is (2,3,3).
Code:
for i in range(0,r):
for j in range(0,c):
for l in range(0,k):
sum[i][j][k]=A1[i][j][k]+A2[i][j][k]
Output:
IndexError: index 3 is out of bounds for axis 0 with size 3
For element-wise addition of two matrices, you can simply use the + operator between two numpy arrays:
#create two matrices of random integers
matrix1 = np.random.randint(10, size=(2,3,3))
matrix2 = np.random.randint(10, size=(2,3,3))
#add the two matrices element-wise
sum_matrix = matrix1 + matrix2
print(matrix1, matrix2, sum_matrix, sep='\n__________\n')
I don't get IndexError. Maybe you post your whole code?
This is my code:
arr1 = [[[2, 4, 8], [7, 7, 1], [4, 9, 0]], [[5, 0, 0], [3, 8, 6], [0, 5, 8]]]
arr2 = [[[3, 8, 0], [1, 5, 2], [0, 3, 9]], [[9, 7, 7], [1, 2, 5], [1, 1, 3]]]
sumArr = [[[0, 0, 0], [0, 0, 0], [0, 0, 0]], [[0, 0, 0], [0, 0, 0],[0, 0, 0]]]
for i in range(2): #can also use range(0,2)
for j in range(3):
for k in range(3):
sumArr[i][j][k]=arr1[i][j][k]+arr2[i][j][k]
print(sumArr)
By the way, is it necessary to use for loop?
If not, you can use numpy library.
import numpy as np
Convert your manual array to numpy matrix array, then do addition.
arr1 = [[[2, 4, 8], [7, 7, 1], [4, 9, 0]], [[5, 0, 0], [3, 8, 6], [0, 5, 8]]]
arr2 = [[[3, 8, 0], [1, 5, 2], [0, 3, 9]], [[9, 7, 7], [1, 2, 5], [1, 1, 3]]]
m1 = np.array(arr1)
m2 = np.array(arr2)
print("M1: \n", m1)
print("M2: \n", m2)
print("Sum: \n", m1 + m2)
You iterate with 'l' in the third loop but to access in list, you used k. As a result, your code is trying to access k-th index which doesn't exists, and you're getting an error.
Use this:
for i in range(0, r):
for j in range(0, c):
for l in range(0, k):
sum[i][j][l] = A1[i][j][l] + A2[i][j][l]

Function to generate incremental weights based on np.select conditions

Objective: Define function to use flags (1,2,3) as conditions that trigger different weights (.2,.4,0). Output is a new df with the weights only.
The np.select is generating this error:
TypeError: invalid entry 0 in condlist: should be boolean ndarray
Image shows desired output as "incremental weight output"
import pandas as pd
import numpy as np
flags = pd.DataFrame({'Date': ['2020-01-01','2020-02-01','2020-03-01'],
'flag_1': [1, 2, 3],
'flag_2': [1, 1, 1],
'flag_3': [2, 1, 2],
'flag_4': [3, 1, 3],
'flag_5' : [1, 2, 2],
'flag_6': [2, 1, 2],
'flag_7': [1, 1, 1],
'flag_8': [1, 1, 1],
'flag_9': [3, 3, 2]})
flags = flags.set_index('Date')
def inc_weights(dfin, wt1, wt2, wt3):
dfin = pd.DataFrame(dfin.iloc[:,::-1])
dfout = pd.DataFrame()
conditions = [1,2,3]
choices = [wt1,wt2,wt3]
dfout=np.select(conditions, choices, default=np.nan)
return(dfout.iloc[:,::-1])
inc_weights = inc_weights(flags, .2, .4, 0)
print(inc_weights)
Input and Output
np.select was unnecessary. simple solution using df.replace with a mapping dict.
import pandas as pd
import numpy as np
flags = pd.DataFrame({'Date': ['2020-01-01','2020-02-01','2020-03-01'],
'flag_1': [1, 2, 3],
'flag_2': [1, 1, 1],
'flag_3': [2, 1, 2],
'flag_4': [3, 1, 3],
'flag_5' : [1, 2, 2],
'flag_6': [2, 1, 2],
'flag_7': [1, 1, 1],
'flag_8': [1, 1, 1],
'flag_9': [3, 3, 2]})
flags = flags.set_index('Date')
print(flags)
def inc_weights(dfin, wt1, wt2, wt3):
dfin = pd.DataFrame(dfin.iloc[:,::-1])
dfout = pd.DataFrame()
mapping = {1:wt1,2:wt2,3:wt3}
dfout=dfin.replace(mapping)
return(dfout.iloc[:,::-1])
inc_weights = inc_weights(flags, .2, .4, 0)
print(inc_weights)

idiom for getting contiguous copies

In the help of numpy.broadcst-array, an idiom is introduced.
However, the idiom give exactly the same output as original command.
Waht is the meaning of "getting contiguous copies instead of non-contiguous views."?
https://docs.scipy.org/doc/numpy/reference/generated/numpy.broadcast_arrays.html
x = np.array([[1,2,3]])
y = np.array([[1],[2],[3]])
np.broadcast_arrays(x, y)
[array([[1, 2, 3],
[1, 2, 3],
[1, 2, 3]]), array([[1, 1, 1],
[2, 2, 2],
[3, 3, 3]])]
Here is a useful idiom for getting contiguous copies instead of non-contiguous views.
[np.array(a) for a in np.broadcast_arrays(x, y)]
[array([[1, 2, 3],
[1, 2, 3],
[1, 2, 3]]), array([[1, 1, 1],
[2, 2, 2],
[3, 3, 3]])]
To understand the difference try writing into the new arrays:
Let's begin with the contiguous copies.
>>> import numpy as np
>>> x = np.array([[1,2,3]])
>>> y = np.array([[1],[2],[3]])
>>>
>>> xc, yc = [np.array(a) for a in np.broadcast_arrays(x, y)]
>>> xc
array([[1, 2, 3],
[1, 2, 3],
[1, 2, 3]])
We can modify an element and nothing unexpected will happen.
>>> xc[0, 0] = 0
>>> xc
array([[0, 2, 3],
[1, 2, 3],
[1, 2, 3]])
>>> x
array([[1, 2, 3]])
Now, let's try the same with the broadcasted arrays:
>>> xb, yb = np.broadcast_arrays(x, y)
>>> xb
array([[1, 2, 3],
[1, 2, 3],
[1, 2, 3]])
Although we only write to the top left element ...
>>> xb[0, 0] = 0
... the entire left column will change ...
>>> xb
array([[0, 2, 3],
[0, 2, 3],
[0, 2, 3]])
... and also the input array.
>>> x
array([[0, 2, 3]])
It means that broadcast_arrays function doesn't create entirely new object. It creates views from original arrays which means the elements of it's results have memory addresses as those arrays which may or may not be contiguous. But when you create a list you're creating new copies within a list which guarantees that its items are stored contiguous in memory.
You can check this like following:
arr = np.broadcast_arrays(x, y)
In [144]: arr
Out[144]:
[array([[1, 2, 3],
[1, 2, 3],
[1, 2, 3]]), array([[1, 1, 1],
[2, 2, 2],
[3, 3, 3]])]
In [145]: x
Out[145]: array([[1, 2, 3]])
In [146]: arr[0][0] = 0
In [147]: arr
Out[147]:
[array([[0, 0, 0],
[0, 0, 0],
[0, 0, 0]]), array([[1, 1, 1],
[2, 2, 2],
[3, 3, 3]])]
In [148]: x
Out[148]: array([[0, 0, 0]])
As you can see, changing the arr's elements is changing both its elements and the original x array.

Resources