Transform a dataframe using pivot - python-3.x

I am trying to transform a dataframe using pivot. Since the column contains duplicate entries, i tried to add a count column following what's suggested here (Question 10 posted in this answer).
import pandas as pd
from pprint import pprint
if __name__ == '__main__':
d = {
't': [0, 1, 2, 0, 1, 2, 0, 2, 0, 1],
'input': [2, 2, 2, 2, 2, 2, 4, 4, 4, 4],
'type': ['A', 'A', 'A', 'B', 'B', 'B', 'A', 'A', 'B', 'B'],
'value': [0.1, 0.2, 0.3, 1, 2, 3, 1, 2, 1, 1],
}
df = pd.DataFrame(d)
df = df.drop('t', axis=1)
df.insert(0, 'count', df.groupby('input').cumcount())
pd.pivot(df, index='count', columns='type', values='value')
But I still get the same error raise ValueError("Index contains duplicate entries, cannot reshape") ValueError: Index contains duplicate entries, cannot reshape.
Could someone please suggest how to resolve this error?

As far as you have more then one value associated with 'A' and 'B' you have to aggregate values somehow.
So if I've understood your issue right possible solution is the following:
#pip install pandas
import pandas as pd
d = {
't': [0, 1, 2, 0, 1, 2, 0, 2, 0, 1],
'input': [2, 2, 2, 2, 2, 2, 4, 4, 4, 4],
'type': ['A', 'A', 'A', 'B', 'B', 'B', 'A', 'A', 'B', 'B'],
'value': [0.1, 0.2, 0.3, 1, 2, 3, 1, 2, 1, 1],
}
df = pd.DataFrame(d)
df
# I've used aggfunc='sum' argument for example, the default value is 'mean'
pd.pivot_table(df, index='t', columns='type', values='value', aggfunc='sum')
Returns

Related

How to generate all possible combinations from list elements in Python having Pandas DataFrames in list?

Good afternoon!
I have a list of lists in Python. Example:
mylist = [['a', 'b', 'c'],
[1, 2],
[df1, df2]]
df1, df2 are Pandas DataFrames. I want to generate result similar to itertools.product(*mylist).
The problem is that Pandas DataFrames are iterables themselves, so the result which product returns is not what I want. I want:
[('a', 1, df1),
('a', 1, df2),
('a', 2, df1),
('a', 2, df2),
('b', 1, df1),
('b', 1, df2),
('b', 2, df1),
('b', 2, df2),
('c', 1, df1),
('c', 1, df2),
('c', 2, df1),
('c', 2, df2)]
But product, of course, can not generate the desired ouptut, since it begins to iterate over df1 and df2 columns. How can I solve this problem in an elegant and Pythonic way?
Any help appreciated
Are you sure? product() iterates over the iterables passed to, it but only one level deep.
>>> from itertools import product
>>> mylist = [[1, 2], ['a', 'b'], [[4, 6], [8, 9]]]
>>> for x in product(*mylist):
... print(x)
(1, 'a', [4, 6])
(1, 'a', [8, 9])
(1, 'b', [4, 6])
(1, 'b', [8, 9])
(2, 'a', [4, 6])
(2, 'a', [8, 9])
(2, 'b', [4, 6])
(2, 'b', [8, 9])
See? That [4, 6] and [8, 9] are themselves iterables is irrelevant to product().

Is it possible to make a dictionary with lists as values from a list of dictionaries, in a one line comprehension?

If I have list of dicts A:
A = [{ 'a': 1, 'b': 2}, {'a': 3, 'b': 4}]
can I make the following dict:
B = {'a': [1, 3], 'b': [2, 4]}
using only dict/list comprehension?
bonus: can I also account for varied keys in A e.g:
A = [{ 'a': 1, 'b': 2}, {'a': 3, 'b': 4, 'c': 5}]
B = {'a': [1, 3], 'b': [2, 4], 'c': [None, 5]}
I have managed to do this with a for loops and if statements, was hoping for something that processes faster
Try:
A = [{"a": 1, "b": 2}, {"a": 3, "b": 4, "c": 5}]
out = {k: [d.get(k) for d in A] for k in set(k for d in A for k in d)}
print(out)
Prints:
{'a': [1, 3], 'b': [2, 4], 'c': [None, 5]}

Counting number of certain values in each column using pandas and collections

I have a txt file including 9 columns and 6 rows. The first 8 columns are either of these values: "1" , "2" and "3". I named these columns from "A" to "H". I named the last column: "class".
The last column is a name : "HIGH". Here is the txt file (data.txt):
1,1,1,1,2,1,1,3,HIGH
1,1,1,2,2,1,1,3,HIGH
1,1,1,1,1,1,1,3,HIGH
1,1,1,2,1,1,1,3,HIGH
1,1,1,3,2,1,1,3,HIGH
1,1,1,2,1,2,1,3,HIGH
I am trying to count the number of each value in each column and print a list that should have 3 components including the numbers of "1", "2" and "3" values in that column respectively. For example in the first column (e.g A) all values are "1". I expect to get : A : [6,0,0]. For the 8th column (e.g. H) where all values are "3", I expect to get: H : [0,0,6] or for the fourth column (e.g. D) I have two "1" , three "2" and one "3". So I expect : D : [2,3,1]. I tried to get it done using pandas and collection . Here is what I did:
import pandas as pd
from collections import Counter
df = pd.read_csv('data.txt')
df.columns = ['A','B','C','D','E','F','G','H','class']
X = df.ix[:, 0:8].values
y = df.ix[:, 8].values
deg = ['HIGH']
names = ['A','B','C','D','E','F','G','H']
for j in range(0, 8):
freqs = Counter(X[y == deg[0], j])
print(names[j],':',list(freqs.values()))
The output of the above code are empty lists. Here is what it returns:
A : []
B : []
C : []
D : []
E : []
F : []
G : []
H : []
How can I modify the above code to get what I want?
Thanks!
Use pandas.Series.value_counts
df.loc[:, :"H"].apply(pd.Series.value_counts).fillna(0).to_dict("l")
Output:
{'A': [6.0, 0.0, 0.0],
'B': [6.0, 0.0, 0.0],
'C': [6.0, 0.0, 0.0],
'D': [2, 3, 1],
'E': [3.0, 3.0, 0.0],
'F': [5.0, 1.0, 0.0],
'G': [6.0, 0.0, 0.0],
'H': [0.0, 0.0, 6.0]}
Define the following function:
def cntInts(col):
vc = col.value_counts()
return [ vc.get(i, 0) for i in range(1,4) ]
Then apply it and print results:
for k, v in df.loc[:, 'A':'H'].apply(cntInts).iteritems():
print(f'{k}: {v}')
For your data sample I got:
A: [6, 0, 0]
B: [6, 0, 0]
C: [6, 0, 0]
D: [2, 3, 1]
E: [3, 3, 0]
F: [5, 1, 0]
G: [6, 0, 0]
H: [0, 0, 6]
Or maybe it is enough to call just:
df.loc[:, 'A':'H'].apply(cntInts)
This time the result is a Series, which when printed yields:
A [6, 0, 0]
B [6, 0, 0]
C [6, 0, 0]
D [2, 3, 1]
E [3, 3, 0]
F [5, 1, 0]
G [6, 0, 0]
H [0, 0, 6]
dtype: object
Edit
Following your comments I suppose that there is something wrong with your data.
To trace the actual reason:
Define a string variable:
txt = '''1,1,1,1,2,1,1,3,HIGH
1,1,1,2,2,1,1,3,HIGH
1,1,1,1,1,1,1,3,HIGH
1,1,1,2,1,1,1,3,HIGH
1,1,1,3,2,1,1,3,HIGH
1,1,1,2,1,2,1,3,HIGH'''
Run:
import io
df = pd.read_csv(io.StringIO(txt), names=['A','B','C','D','E','F','G','H','class'])
Run my code on my data. The result should be just as expected.
Then read your input file (also into df) and run my code again.
Probably there is some difference between your data and mine.
Especially look for any extra spaces in your input file,
check also column types (after read_csv).
Solution with collections is select all columns without last, convert Counter to Series, so output is DataFrame, replace missing values by DataFrame.fillna, convert values to integers and last to dictionary by DataFrame.to_dict:
from collections import Counter
d = (df.iloc[:, :-1].apply(lambda x: pd.Series(Counter(x)))
.fillna(0)
.astype(int)
.to_dict("list"))
print (d)
{'A': [6, 0, 0], 'B': [6, 0, 0],
'C': [6, 0, 0], 'D': [1, 4, 1],
'E': [3, 3, 0], 'F': [5, 1, 0],
'G': [6, 0, 0], 'H': [0, 0, 6]}
Only pandas solution with pandas.value_counts:
d = (df.iloc[:, :-1].apply(pd.value_counts)
.fillna(0)
.astype(int)
.to_dict("list"))
print (d)
{'A': [6, 0, 0], 'B': [6, 0, 0],
'C': [6, 0, 0], 'D': [2, 3, 1],
'E': [3, 3, 0], 'F': [5, 1, 0],
'G': [6, 0, 0], 'H': [0, 0, 6]}
Working within python, since your end result is a dictionary:
from string import ascii_uppercase
from collections import Counter, defaultdict
from itertools import chain, product
import csv
d = defaultdict(list)
fieldnames = ascii_uppercase[:9]
# test.csv is your file above
with open('test.csv') as csvfile:
reader = csv.DictReader(csvfile, fieldnames = list(fieldnames))
reader = Counter(chain.from_iterable(row.items() for row in reader))
for col, value in product(fieldnames, ("1","2","3")):
if col != fieldnames[-1]:
d[col].append(reader.get((col,value), 0))
print(d)
defaultdict(list,
{'A': [6, 0, 0],
'B': [6, 0, 0],
'C': [6, 0, 0],
'D': [2, 3, 1],
'E': [3, 3, 0],
'F': [5, 1, 0],
'G': [6, 0, 0],
'H': [0, 0, 6]})

Python count two arrays of strings and integers

I have two arrays X and Y, X holds a title of a product that repeats, Y holds integer amounts of how many of X was sold.
I used Counter to count the number of occurrences for each element in X, but it does not take into account Y.
from collections import Counter
x = ['a','a','b','c','c','c','c','d','d','d','e','e']
y = [1, 5, 3, 1, 1, 1, 3, 5, 2, 1, 8, 1]
countX = Counter(x)
Use defaultdict:
from collections import defaultdict
x = ['a', 'a', 'b', 'c', 'c', 'c', 'c', 'd', 'd', 'd', 'e', 'e']
y = [1, 5, 3, 1, 1, 1, 3, 5, 2, 1, 8, 1]
output = defaultdict(int)
for prod, count in zip(x, y):
output[prod] += count
print(output)
# defaultdict(<class 'int'>, {'a': 6, 'b': 3, 'c': 6, 'd': 8, 'e': 9})

How to get the row/column labels of a Confusion Matrix from scikit-learn?

How would I confirm the the columns/rows of an outputted Confusion Matrix if I didn't initially specify them when creating the matrix such as in the below code:
y_true = ["cat", "ant", "cat", "cat", "ant", "bird"]
y_pred = ["ant", "ant", "cat", "cat", "ant", "cat"]
cm=confusion_matrix(y_true, y_pred)
array([[2, 0, 0],
[0, 0, 1],
[1, 0, 2]])
From the docs I know it says If none is given, those that appear at least once in y_true or y_pred are used in sorted order so I would assume the columns/rows would be ("ant", "bird", "cat") but how do I confirm that?
I tried something like cm.labels but that doesn't work.
In the source code of the confusion_matrix:
if labels is None:
labels = unique_labels(y_true, y_pred)
What is unique_labels and where is it imported from?
from sklearn.utils.multiclass import unique_labels
unique_labels(y_true, y_pred)
Returns
array(['ant', 'bird', 'cat'],
dtype='<U4')
unique_labels extracts an ordered array of unique labels.
Examples:
>>> from sklearn.utils.multiclass import unique_labels
>>> unique_labels([3, 5, 5, 5, 7, 7])
array([3, 5, 7])
>>> unique_labels([1, 2, 3, 4], [2, 2, 3, 4])
array([1, 2, 3, 4])
>>> unique_labels([1, 2, 10], [5, 11])
array([ 1, 2, 5, 10, 11])
Maybe a more intuitive example:
unique_labels(['z', 'x', 'y'], ['a', 'z', 'c'], ['e', 'd', 'y'])
Returns:
array(['a', 'c', 'd', 'e', 'x', 'y', 'z'],
dtype='<U1')

Resources