I have a dataframe which is containing 3 columns (['A','B','C]) and 3 rows in it.
We are using a for loop to fetch value(storing into variable) from above dataframe based upon certain condition from column B.
Further we are using list to store value present in variable.
Here question is upon checking list value, we are getting variable value, its type.
I'm not sure why it is happening. As list should contain only variable value only.
Please can anyone help us to get ideal solution for same.
Thanks,
Bhuwan
dataframe: columns-A,B,C rows value- a to i :df = ([a,b,c][d,b,f][g,b,i]).
list_1=[]
for i in range(0,9):
variable_1=df['A'][df.B == 'b']
list_1.append(variable_1)
print(list_1):
Ideal output: ['a','d','g']
while we are getting output as
['a type: object','d type: object','g type: object'].
You can get your ideal output like this:
import pandas as pd
df = pd.DataFrame({'A': ['a', 'd', 'g'], 'B': ['b', 'b', 'b'], 'C': ['c', 'f', 'i']})
list_1 = list(df[df['B'] == 'b']['A'].values) # <- this line
print(list_1)
> ['a', 'd', 'g']
You just need:
1) to filter your dataframe by column "B" df[df['B'] == 'b']
2) and only then take values of the resulted column "A", turning them into list
Related
I have a pandas series that looks like this:
import numpy as np
import string
import pandas as pd
np.random.seed(0)
data = np.random.randint(1,6,10)
index = list(string.ascii_lowercase)[:10]
a = pd.Series(data=data,index=index,name='apple')
a
>>>
a 5
b 1
c 4
d 4
e 4
f 2
g 4
h 3
i 5
j 1
Name: apple, dtype: int32
I want to group the series by its values and return a dict of of list of indices for those values i.e. this result:
{1: ['b', 'j'], 2: ['f'], 3: ['h'], 4: ['c', 'd', 'e', 'g'], 5: ['a', 'i']}
Here is how I achieve that at the moment:
b = a.reset_index().set_index('apple').squeeze()
grouped = b.groupby(level=0).apply(list).to_dict()
grouped
>>>
{1: ['b', 'j'], 2: ['f'], 3: ['h'], 4: ['c', 'd', 'e', 'g'], 5: ['a', 'i']}
However, it does not feel particularly pythonic to explicitly transform the series first so that I can get to the result. Is there a way to do this directly by applying a single function (ideally) or combination of functions in one line to achieve the same result?
Thanks!
You can use the groupby function and apply a lambda expression to it in order to get the desired result in one line:
grouped = a.groupby(a.values).apply(lambda x: list(x.index)).to_dict()
Alternatively, you could use the following:
grouped = dict(a.groupby(a.values).apply(lambda x: x.index.get_level_values(0)))
grouped = dict(a.groupby(a.values).apply(lambda x: x.index.tolist()))
For example, given list of str: ['a', 'b', 'a', 'a', 'b'], I want to get the counts of distinct string {'a' : 3, 'b' : 2}.
the naive method is like following:
lst = ['a', 'b', 'a', 'a', 'b']
counts = dict()
for w in lst:
counts[w] = counts.get(w, 0) + 1
However, it needs twice Hash Table queries. In fact, when we firstly called the get method, we have already known the bucket location. In principle, we can modify the bucket value in-place
without searching the bucket location twice. I know in Java we can use map.merge to get this optimization: https://stackoverflow.com/a/33711386/10969942
How to do it in Python?
This is no such method in Python. Whether visible or not, at least under the covers the table lookup will be done twice. But, as the answer you linked to said about Java, nobody much cares - hash table lookup is typically fast, and since you just looked up a key all the info to look it up again is likely sitting in L1 cache.
Two ways of spelling your task that are more idiomatic, but despite that the double-lookup isn't directly visible in either, it still occurs under covers:
>>> lst = ['a', 'b', 'a', 'a', 'b']
>>> from collections import defaultdict
>>> counts = defaultdict(int) # default value is int(); i.e., 0
>>> for w in lst:
... counts[w] += 1
>>> counts
defaultdict(<class 'int'>, {'a': 3, 'b': 2})
and
>>> from collections import Counter
>>> Counter(lst)
Counter({'a': 3, 'b': 2})
import numpy as np
import pandas as pd
from numpy.random import randn
np.random.seed(101)
print(randn(5, 4))
df = pd.DataFrame( randn(5, 4), ['A', 'B', 'C', 'D', 'F'], ['W', 'X', 'Y', 'Z'] )
tmp_df = df['X']
print(type(tmp_df)) # Here type is Series (as expected)
tmp_df.loc[:] = 12.3
print(tmp_df)
print(df)
This code changes the content of (original) DataFrame df.
np.random.seed(101)
print(randn(5, 4))
df = pd.DataFrame( randn(5, 4), ['A', 'B', 'C', 'D', 'F'], ['W', 'X', 'Y', 'Z'] )
tmp_df = df.loc[['A', 'B'], ['W', 'X']]
print(type(tmp_df)) # Type is DataFrame
tmp_df.loc[:] = 12.3 # whereas, here when I change the content of the tmp_df it doesn't reflect on original array.
print(tmp_df)
print(df)
So, does that mean if we slice the Series out of DataFrame, reference is passed to sliced object.
Whereas, if it's DataFrame that has been sliced then it doesn't point to original DataFrame.
Please confirm whether my conclusion above is correct or not? Help would be appreciated.
To put it in a simple manner: Indexing with lists in loc always returns a copy.
Let's work with a DataFrame df:
df=pd.DataFrame({'A':[i for i in range(100)]})
df.head(3)
Output:
0 0
1 1
2 2
When we try to do an operation on the sliced data.
h=df.loc[[0,1,2],['A']]
h.loc[:] = 12.3
h
Output of h:
0 12.3
1 12.3
2 22.3
The results don't reflect like how it happened in your case:
df.head(3)
Output:
0 0
1 1
2 2
But when you're doing this tmp_df = df['X'], the series tmp_df is referring to contents of "X" in column df. Which is meant to change when you modify tmp_df.
I have a list of list. I want to obtain the frequency of element in the inner list and concatenate that with the element in the outer list.
aa =['a', ['b', 'b', 'b', 'b', 'd', 'd']]
I try to use Counter to get the frequency of occurrence of each element in the inner list as :
from collections import Counter
Counter(aa[1])
It gives:
Counter({'b': 4, 'd': 2})
I want to concatenate this with the outer list element and obtain as follows:
'ab4d2'
I can also iterate through the Counter and get key, value in a list:
y = []
for k, v in surr.items():
y.append(str(k) + str(v))
Output: ['O4', 'Sb2']
There are many answers to get the frequency of occurrence but I did not find any which does this (the problem is joining with outer 'a' in an efficient way) . Could anyone please help me on this. Thanks in advance.
You can use a generator expression with str.join:
aa[0] + ''.join('%s%d' % t for t in Counter(aa[1]).items())
Given aa = ['a', ['b', 'b', 'b', 'b', 'd', 'd']], this returns:
ab4d2
df = pd.DataFrame([[3,3,3]]*4,index=['a','b','c','d'])
While we can extract a copy of a section of an index via specifying row numbers like below:
i1=df.index[1:3].copy()
Unfortunately we can't extract a copy of a section of an index via specifying the key (like the case of df.loc method). When I try the below:
i2=df.index['a':'c'].copy()
I get the below error:
TypeError: slice indices must be integers or None or have an __index__ method
Is there any alternative to call a subset of an index based on its keys? Thank you
Simpliest is loc with index:
i1 = df.loc['b':'c'].index
print (i1)
Index(['b', 'c'], dtype='object')
Or is possible use get_loc for positions:
i1 = df.index
i1 = i1[i1.get_loc('b') : i1.get_loc('d') + 1]
print (i1)
Index(['b', 'c'], dtype='object')
i1 = i1[i1.get_loc('b') : i1.get_loc('d') + 1]
print (i1)
Index(['b', 'c', 'd'], dtype='object')
Alternative:
i1 = i1[i1.searchsorted('b') : i1.searchsorted('d') + 1]
print (i1)
Index(['b', 'c', 'd'], dtype='object')
Try using .loc, see this documentation:
i2 = df.loc['a':'c'].index
print(i2)
Output:
Index(['a', 'b', 'c'], dtype='object')
or
df.loc['a':'c'].index.tolist()
Output:
['a', 'b', 'c']