Find image in pandas DataFrame - python-3.x

I have a dataframe like this:
Image Id
0 1a.jpg w_1
1 c4.jpg w_1
2 b01.jpg w_2
3 d5.jpg w_1
4 df.jpg w_f
5 c2.jpg w_2
6 ab.jpg w_3e
What is the pandas way to return this output?
output:(1a.jpg,c4.jpg,d5.jpg)(b01.jpg,c2.jpg)(df.jpg)(ab.jpg)

Use groupby and convert values to tuples first and then to list:
L = df.groupby('Id', sort=False)['Image'].apply(tuple).tolist()
print (L)
[('1a.jpg', 'c4.jpg', 'd5.jpg'), ('b01.jpg', 'c2.jpg'), ('df.jpg',), ('ab.jpg',)]
Similar for convert to lists instead tuples:
L1 = df.groupby('Id', sort=False)['Image'].apply(list).tolist()
print (L1)
[['1a.jpg', 'c4.jpg', 'd5.jpg'], ['b01.jpg', 'c2.jpg'], ['df.jpg'], ['ab.jpg']]
And if need strings:
s = ''.join('(' + df.groupby('Id', sort=False)['Image'].apply(', '.join) +')')
print (s)
(1a.jpg, c4.jpg, d5.jpg)(b01.jpg, c2.jpg)(df.jpg)(ab.jpg)

Related

Count element in list if it is present in each row of a column. Add to a new column (pandas)

I have a pandas df like this:
MEMBERSHIP
[2022_K_, EWREW_NK]
[333_NFK_,2022_K_, EWREW_NK, 000]
And I have a list of keys:
list_k = ["_K_","_NK_","_NKF_","_KF_"]
I want to add and create a column that count if any of that element is in the column. The desired output is:
MEMBERSHIP | COUNT
[2022_K_, EWREW_NK] | 2
[333_NFK_,2022_K_, EWREW_NK, 000] | 3
Can you help me?
IIUC, you can use pandas .str acccess methods with regex:
import pandas as pd
df = pd.DataFrame({'MEMBERSHIP':[['2022_K_', 'EWREW_NK'],
['333_NFK_','2022_K_', 'EWREW_NK', '000']]})
list_k = ["_K_","_NK","_NFK_","_KF_"] #I changed this list a little
reg = '|'.join(list_k)
df['count'] = df['MEMBERSHIP'].explode().str.contains(reg).groupby(level=0).sum()
print(df)
Output:
MEMBERSHIP count
0 [2022_K_, EWREW_NK] 2
1 [333_NFK_, 2022_K_, EWREW_NK, 000] 3
you can use a lambda function:
def check(x):
total=0
for i in x:
if type(i) != str: #if value is not string pass.
pass
else:
for j in list_k:
if j in i:
total+=1
return total
df['count']=df['MEMBERSHIP'].apply(lambda x: check(x))
I come up with this dumb code
count_row=0
df['Count']= None
for i in df['MEMBERSHIP_SPLIT']:
count_element=0
for sub in i:
for e in list_k:
if e in sub:
count_element+=1
df['Count'][count_row]=count_element
count_row += 1

How do you read .txt with tab separated items that are on one line to a list?

I have this .txt:
'4 1 15 12'
It's just one long line separating its' items with tab. I need to read it into a list containing int items.
I can't seem to make pandas, csv module or open to do the trick.
This kinda works:
f = open('input.txt')
for line in f:
memory = line.split()
for item in memory:
item = int(item)
print(memory)
['4', '1', '15', '12']
But it gives me an error when i compare its' max value to an int:
max_val = max(memory)
while max_val > 0:
TypeError: '>' not supported between instances of 'str' and 'int'
It appears as though the text in the question was not tab spaced.
I have created a tab spaced file and the following works:
import pandas as pd
test_file = "C:\\Users\\lefcoe\\Desktop\\test.txt"
df = pd.read_csv(test_file, delimiter='\t', header=None)
print(df)
#%% convert to a list of ints
my_list = df.loc[0, :].values.tolist()
my_list_int = [int(x) for x in my_list]
my_list_int
#%% get the max
m = max(my_list_int)
print(m)
result:
1 1 2 3
0 4 1 15 12
15
its a TypeError you cant check if a type(str) is a type(int) because they are both different types
max_val = max(memory)
print(type(max_val))
>>> <class 'str'>
just change max_val to an int for example
max_val = max(memory)
while int(max_val) > 0:

Issue in passing an array to an index in Series object(TypeError: len() of unsized object)

I have a data as ndarray
a = [0,1,0,1,2,2]
b = ['x1','x2','x3','x4','x5','x6']
Then I tried:
univals = set(a)
serObj=pd.Series()
for ele in univals:
indexfound=np.where(a == ele)
Xpointsfromindex=np.take(b, indexfound)
serobj1=pd.Series(Xpointsfromindex[0],index=ele) ##error happening here
serObj.apend(serobj1)
print(serObj)
I expect output to be like
0 ['x1','x3']
1 ['x2','x4']
2 ['x5','x6']
But it is giving me an error like "TypeError: len() of unsized object"
Where am I doing wrong?
I believe here is possible create DataFrame if same length of lists and then create lists with groupby:
a = [0,1,0,1,2,2]
b = ['x1','x2','x3','x4','x5','x6']
df = pd.DataFrame({'a':a, 'b':b})
print(df)
a b
0 0 x1
1 1 x2
2 0 x3
3 1 x4
4 2 x5
5 2 x6
serObj = df.groupby('a')['b'].apply(list)
print (serObj)
a
0 [x1, x3]
1 [x2, x4]
2 [x5, x6]
Name: b, dtype: object
Just to stick to what OP was doing, here is the full code that works -
import pandas as pd
import numpy as np
a = [0,1,0,1,2,2]
b = ['x1','x2','x3','x4','x5','x6']
univals = set(a)
serObj=pd.Series()
for ele in univals:
indexfound=np.where([i==ele for i in a])
Xpointsfromindex=np.take(b, indexfound)
print(Xpointsfromindex)
serobj1=pd.Series(Xpointsfromindex[0],index=[ele for _ in range(np.shape(indexfound)[1])]) ##error happening here
serObj.append(serobj1)
print(serObj)
Output
[['x1' 'x3']]
[['x2' 'x4']]
[['x5' 'x6']]
Explanation
indexfound=np.where(a == ele) will always return False because you are trying to compare a list with a scalar. Changing it to list comprehension fetches the indices
The next change is using list comprehension at the index parameter of the pd.Series.
This will set you on your way to what you want to achieve

how to replace a cell in a pandas dataframe

After forming the below python pandas dataframe (for example)
import pandas
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pandas.DataFrame(data,columns=['Name','Age'])
If I iterate through it, I get
In [62]: for i in df.itertuples():
...: print( i.Index, i.Name, i.Age )
...:
0 Alex 10
1 Bob 12
2 Clarke 13
What I would like to achieve is to replace the value of a particular cell
In [67]: for i in df.itertuples():
...: if i.Name == "Alex":
...: df.at[i.Index, 'Age'] = 100
...:
Which seems to work
In [64]: df
Out[64]:
Name Age
0 Alex 100
1 Bob 12
2 Clarke 13
The problem is that when using a larger different dataset, and do:
First, I create a new column named like NETELEMENT with a default value of ""
I would like to replace the default value "" with the string that the function lookup_netelement returns
df['NETELEMENT'] = ""
for i in df.itertuples():
df.at[i.Index, 'NETELEMENT'] = lookup_netelement(i.PEER_SRC_IP)
print( i, lookup_netelement(i.PEER_SRC_IP) )
But what I get as a result is:
Pandas(Index=769, SRC_AS='', DST_AS='', COMMS='', SRC_COMMS=nan, AS_PATH='', SRC_AS_PATH=nan, PREF='', SRC_PREF='0', MED='0', SRC_MED='0', PEER_SRC_AS='0', PEER_DST_AS='', PEER_SRC_IP='x.x.x.x', PEER_DST_IP='', IN_IFACE='', OUT_IFACE='', PROTOCOL='udp', TOS='0', BPS=35200.0, SRC_PREFIX='', DST_PREFIX='', NETELEMENT='', IN_IFNAME='', OUT_IFNAME='') routerX
meaning that it should be:
NETELEMENT='routerX' instead of NETELEMENT=''
Could you please advise what I am doing wrong ?
EDIT: for reasons of completeness the lookup_netelement is defined as
def lookup_netelement(ipaddr):
try:
x = LOOKUP['conn'].hget('ipaddr;{}'.format(ipaddr), 'dev') or b""
except:
logger.error('looking up `ipaddr` for netelement caused `{}`'.format(repr(e)), exc_info=True)
x = b""
x = x.decode("utf-8")
return x
Hope you are looking for where for conditional replacement i.e
def wow(x):
return x ** 10
df['new'] = df['Age'].where(~(df['Name'] == 'Alex'),wow(df['Age']))
Output :
Name Age new
0 Alex 10 10000000000
1 Bob 12 12
2 Clarke 13 13
3 Alex 15 576650390625
Based on your edit your trying to apply the function i.e
df['new'] = df['PEER_SRC_IP'].apply(lookup_netelement)
Edit : For your comment on sending two columns, use lambda with axis 1 i.e
def wow(x,y):
return '{} {}'.format(x,y)
df.apply(lambda x : wow(x['Name'],x['Age']),1)

How to split a DataFrame in pandas in predefined percentages?

I have a pandas dataframe sorted by a number of columns. Now I'd like to split the dataframe in predefined percentages, so as to extract and name a few segments.
For example, I want to take the first 20% of rows to create the first segment, then the next 30% for the second segment and leave the remaining 50% to the third segment.
How would I achieve that?
Use numpy.split:
a, b, c = np.split(df, [int(.2*len(df)), int(.5*len(df))])
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.random((20,5)), columns=list('ABCDE'))
#print (df)
a, b, c = np.split(df, [int(.2*len(df)), int(.5*len(df))])
print (a)
A B C D E
0 0.543405 0.278369 0.424518 0.844776 0.004719
1 0.121569 0.670749 0.825853 0.136707 0.575093
2 0.891322 0.209202 0.185328 0.108377 0.219697
3 0.978624 0.811683 0.171941 0.816225 0.274074
print (b)
A B C D E
4 0.431704 0.940030 0.817649 0.336112 0.175410
5 0.372832 0.005689 0.252426 0.795663 0.015255
6 0.598843 0.603805 0.105148 0.381943 0.036476
7 0.890412 0.980921 0.059942 0.890546 0.576901
8 0.742480 0.630184 0.581842 0.020439 0.210027
9 0.544685 0.769115 0.250695 0.285896 0.852395
print (c)
A B C D E
10 0.975006 0.884853 0.359508 0.598859 0.354796
11 0.340190 0.178081 0.237694 0.044862 0.505431
12 0.376252 0.592805 0.629942 0.142600 0.933841
13 0.946380 0.602297 0.387766 0.363188 0.204345
14 0.276765 0.246536 0.173608 0.966610 0.957013
15 0.597974 0.731301 0.340385 0.092056 0.463498
16 0.508699 0.088460 0.528035 0.992158 0.395036
17 0.335596 0.805451 0.754349 0.313066 0.634037
18 0.540405 0.296794 0.110788 0.312640 0.456979
19 0.658940 0.254258 0.641101 0.200124 0.657625
Creating a dataframe with 70% values of original dataframe
part_1 = df.sample(frac = 0.7)
Creating dataframe with rest of the 30% values
part_2 = df.drop(part_1.index)
I've written a simple function that does the job.
Maybe that might help you.
P.S:
Sum of fractions must be 1.
It will return len(fracs) new dfs. so you can insert fractions list at long as you want (e.g: fracs=[0.1, 0.1, 0.3, 0.2, 0.2])
np.random.seed(100)
df = pd.DataFrame(np.random.random((99,4)))
def split_by_fractions(df:pd.DataFrame, fracs:list, random_state:int=42):
assert sum(fracs)==1.0, 'fractions sum is not 1.0 (fractions_sum={})'.format(sum(fracs))
remain = df.index.copy().to_frame()
res = []
for i in range(len(fracs)):
fractions_sum=sum(fracs[i:])
frac = fracs[i]/fractions_sum
idxs = remain.sample(frac=frac, random_state=random_state).index
remain=remain.drop(idxs)
res.append(idxs)
return [df.loc[idxs] for idxs in res]
train,test,val = split_by_fractions(df, [0.8,0.1,0.1]) # e.g: [test, train, validation]
print(train.shape, test.shape, val.shape)
outputs:
(79, 4) (10, 4) (10, 4)

Resources