I have a CSV file that I read using pandas. I would like to make a comparison between some of the columns and then use the outcome of the comparison to make a decision. An example of the data is shown below.
A
B
C
D
6
[5, 3, 4, 1]
-4.2974843
[-5.2324843, -5.2974843, -6.2074043, -6.6974803]
2
[3, 6,4, 7]
-6.4528433
[-6.2324843, -7.0974845, -7.2034041, -7.6974804]
3
[6, 2, 4, 5]
-3.5322451
[-4.3124440, -4.9073840, -5.2147042, -6.1904800]
1
[4, 3, 6,2]
-5.9752843
[-5.2324843, -5.2974843, -6.2074043, -6.6974803]
7
[2, 3, 4, 1]
-1.2974652
[-3.1232843, -4.2474643, -5.2074043, -6.1994802]
5
[1, 3, 7, 2]
-9.884843
[-8.0032843, -8.0974843, -9.2074043, -9.6904603]
4
[7, 3, 1, 4]
-2.3984843
[-7.2324843, -8.2094845, -9.2044013, -9.7914001]
Here is the code I am using:
n_A = data['A']
n_B = data['B']
n_C = data['C']
n_D = data['D']
result_compare = []
for w, e in enumerate(n_A):
for ro, ver in enumerate(n_B):
for row, m in enumerate(n_C):
for r, t in enumerate(n_D):
if ro==w:
if r ==row:
if row==ro:
if r==0:
if t[r]>m:
b = ver[r]
result_compare.append(b)
else:
b = e
result_compare.append(b)
elif r>=0:
q = r-r
if t[q]>m:
b = ver[q]
result_compare.append(b)
else:
b = e
result_compare.append(b)
I had to select only the columns required for the comparison and that was why I did the following.
n_A = data['A']
n_B = data['B']
n_C = data['C']
n_D = data['D']
Results could be as:
result_compare = [6, 3 , 3, 4, 7 , 1, 4 ]
The values in D are arranged in descending order which is why the first element of the list is selected in this case. So when the first element in the row of the list D is greater than the one of C, we choose the first element of the list B, otherwise A. I would like an efficient way since my code takes lots of time to provide results most especially in the case of large data.
I would do this in your case
data['newRow']=data.apply(lambda row: row["B"][0] if row["D"][0] > row["C"] else row['A'], axis=1)
And if you need it as a list by the end:
list(data['newRow'])
I have Pandas dataframe like this:
df =
A B
[1, 5, 8, 10] [str1, str_to_be_removed*, str_to_be_removed*, str2]
[4, 7, 10] [str2, str5, str10]
[5, 2, 7, 9, 15] [str6, str2, str_to_be_removed*, str_to_be_removed*, str_to_be_removed*]
... ...
Given, str_to_be_removed, I would like to keep only the 1st instance of the string which contains str_to_be_removed, in column B and remove the other ones. In addition, I would also like to remove the corresponding ids from A. Size of lists contained in A and B in each row is the same.
How to do this?
Desired Output:
df =
A B
[1, 5, 10] [str1, str_to_be_removed*, str2]
[4, 7, 10] [str2, str5, str10]
[5, 2, 7] [str6, str2, str_to_be_removed*]
EDIT:
So, this is the sample df:
df =
df = pd.DataFrame({'A':[[1,2],[2,3],[3,4,5]],[10,11,12]\
'B':[['str_to_be_removed_bla','str_to_be_removed_bla'],['str_to_be_removed_bla','b'],['b','c','d'],\
['str_to_be_removed_bla_bla','str_to_be_removed_bla','f']\
]})
Desired Output:
df =
A B
[1] [str_to_be_removed_bla]
[2,3] [str_to_be_removed_bla, b]
[3,4,5] [b, c, d]
[10, 12] [str_to_be_removed_bla_bla,f]
steps:
use zip to combine relate elements together.
unnest the zip list, use explode
drop duplicates, keep first
goupby index, agg as list.
df = pd.DataFrame({'A':[
[1,2],[2,3],[3,4,5],[10, 11, 12]
],
'B':[['str_to_be_removed_bla','str_to_be_removed_bla'],
['str_to_be_removed_bla','b'],
['b', 'c', 'd'],
['str_to_be_removed_bla_bla','str_to_be_removed_bla','f']
]})
# zip and explode step
# df_obj = df.apply(lambda x: list(zip(x.A, x.B)),
# axis = 1
# ).to_frame()
# df_obj = df_obj.explode(0).reset_index()
# df_obj['A'] = df_obj[0].str[0]
# df_obj['B'] = df_obj[0].str[1]
update with #Joe Ferndz's solution, simplify the steps.
# explode the columns
dfTemp = df.apply(lambda x: x.explode())
df_obj = dfTemp.reset_index()
# can add more conditions to filter and handle to drop_duplicates
cond = df_obj['B'].astype(str).str.contains("str_to_be_removed")
df1 = df_obj[cond].drop_duplicates(['index'], keep='first')
df2 = df_obj[~cond]
df3 = pd.concat([df1, df2]).sort_index()
result = df3.groupby('index').agg({'A':list, 'B':list})
result
A B
index
0 [1] [str_to_be_removed_bla]
1 [2, 3] [str_to_be_removed_bla, b]
2 [3, 4, 5] [b, c, d]
3 [10, 12] [str_to_be_removed_bla_bla, f]
extend this solution to multiple patterns in 1 step: str_to_be_removed1, str_to_be_removed2,
df = pd.DataFrame({'A':[
[1,2],[2,3],[3,4,5, 6],[10, 11, 12]
],
'B':[['str_to_be_removed1_bla','str_to_be_removed1_bla'],
['str_to_be_removed1_bla','b'],
['b', 'b', 'c', 'd'],
['str_to_be_removed2_bla_bla','str_to_be_removed2_bla','f']
]})
print(df)
# A B
# 0 [1, 2] [str_to_be_removed1_bla, str_to_be_removed1_bla]
# 1 [2, 3] [str_to_be_removed1_bla, b]
# 2 [3, 4, 5, 6] [b, b, c, d]
# 3 [10, 11, 12] [str_to_be_removed2_bla_bla, str_to_be_removed2_bla, f]
df_obj = df.apply(lambda x: x.explode()).reset_index()
df_obj['tag'] = df_obj['B'].str.extract(r'(str_to_be_removed1|str_to_be_removed2)')
print(df_obj)
# index A B tag
# 0 0 1 str_to_be_removed1_bla str_to_be_removed1
# 1 0 2 str_to_be_removed1_bla str_to_be_removed1
# 2 1 2 str_to_be_removed1_bla str_to_be_removed1
# 3 1 3 b NaN
# 4 2 3 b NaN
# 5 2 4 b NaN
# 6 2 5 c NaN
# 7 2 6 d NaN
# 8 3 10 str_to_be_removed2_bla_bla str_to_be_removed2
# 9 3 11 str_to_be_removed2_bla str_to_be_removed2
# 10 3 12 f NaN
cond = df_obj['tag'].notnull()
df1 = df_obj[cond].drop_duplicates(['index', 'tag'], keep='first')
df2 = df_obj[~cond]
df3 = pd.concat([df1, df2]).sort_index()
print(df3.groupby('index').agg({'A':list, 'B':list}))
# A B
# index
# 0 [1] [str_to_be_removed1_bla]
# 1 [2, 3] [str_to_be_removed1_bla, b]
# 2 [3, 4, 5, 6] [b, b, c, d]
# 3 [10, 12] [str_to_be_removed2_bla_bla, f]
Here's an another approach to do this. Here' you can give the search string and it will remove it from the list.
import pandas as pd
pd.set_option('display.max_colwidth', 500)
df = pd.DataFrame({'A':[[1,2],[2,3],[3,4,5],[10,11,12]],
'B':[['str_to_be_removed_bla','str_to_be_removed_bla'],
['str_to_be_removed_bla','b'],
['b','c','d'],
['str_to_be_removed_bla_bla','str_to_be_removed_bla','f']]})
print (df)
search_str = 'str_to_be_removed'
#explode each column into its own rows
dfTemp = df.apply(lambda x: x.explode())
#flag columns that contains search string
dfTemp['Found'] = dfTemp.B.str.contains(search_str)
#cumcount by Index and by search strings to get duplicates
dfTemp['Bx'] = dfTemp.groupby([dfTemp.index,'Found']).cumcount()
#Exclude all records where search string was found and count is more than 1
dfTemp = dfTemp[~(dfTemp['Found'] & dfTemp['Bx'] > 0)]
#Extract back all the records into the dataframe and store as list
df_final = dfTemp.groupby(dfTemp.index).agg({'A':list, 'B':list})
del dfTemp
print (df_final)
The output of this is:
Original dataframe:
A B
0 [1, 2] [str_to_be_removed_bla, str_to_be_removed_bla]
1 [2, 3] [str_to_be_removed_bla, b]
2 [3, 4, 5] [b, c, d]
3 [10, 11, 12] [str_to_be_removed_bla_bla, str_to_be_removed_bla, f]
search string: 'str_to_be_removed'
Final dataframe:
A B
0 [1] [str_to_be_removed_bla]
1 [2, 3] [str_to_be_removed_bla, b]
2 [3, 4, 5] [b, c, d]
3 [10, 12] [str_to_be_removed_bla_bla, f]
Let us try dict
out = pd.DataFrame([[list(dict(zip(y[::-1],x[::-1])).values())[::-1],list(dict(zip(y[::-1],x[::-1])).keys())[::-1]] for x , y in zip(df.A,df.B)])
out
0 1
0 [1] [a]
1 [2, 2] [a, b]
Sample dataframe
df = pd.DataFrame({'A':[[1,2],[2,2]],'B':[['a','a'],['a','b']]})
Don't think a native pandas solution is possible for this case. And even there is, you are not likely to have a performance gain with it. A traditional for loop might work better for your case:
def remove_dupes(a, b, dupe_pattern):
seen_dupe = False
_a, _b = [], []
for x, y in zip(a, b):
if dupe_pattern in y:
if not seen_dupe:
seen_dupe = True
_a.append(x)
_b.append(y)
else:
_a.append(x)
_b.append(y)
return _a, _b
_A, _B = [], []
for a, b in zip(df.A, df.B):
_a, _b = remove_dupes(a, b, 'str_to_be_removed')
_A.append(_a)
_B.append(_b)
df['A'], df['B'] = _A, _B
print(df)
# A B
#0 [1] [str_to_be_removed_bla]
#1 [2, 3] [str_to_be_removed_bla, b]
#2 [3, 4, 5] [b, c, d]
#3 [10, 12] [str_to_be_removed_bla_bla, f]
Try run it here.
i have:
dataA=[[1,2,3],[1,2,5]]
dataB=[1,2]
I want to multiply index [0] dataA with index [0] dataB, and index [1] dataA with index [1] dataB, how to do it.
I tried it, but the results didn't match expectations
dataA=[[1,2,3],[1,2,5]]
dataB=[1,2]
tmp=[]
for a in dataA:
tampung = []
for b in a:
cou=0
hasil = b*dataB[cou]
tampung.append(hasil)
cou+=1
tmp.append(tampung)
print(tmp)
output : [[1, 2, 3], [1, 2, 5]]
expected output : [[1,2,3],[2,4,10]]
Please help
List-expression are sth wonderful in Python.
result = [[x*y for y in l] for x, l in zip(dataB, dataA)]
This does the same like:
result = []
for x, l in zip(dataB, dataA):
temp = []
for y in l:
temp.append(x * y)
result.append(temp)
result
## [[1, 2, 3], [2, 4, 10]]
If you are working with numbers consider using numpy as it will make your operations much easier.
dataA = [[1,2,3],[1,2,5]]
dataB = [1,2]
# map list to array
dataA = np.asarray(dataA)
dataB = np.asarray(dataB)
# dataA = array([[1, 2, 3], [1, 2, 5]])
# 2 x 3 array
# dataB = array([1, 2])
# 1 x 2 array
dataC_1 = dataA[0] * dataB[0] #multiply first row of dataA w/ first row of dataB
dataC_2 = dataA[1] * dataB[1] #multiply second row of dataA w/ second row of dataB
# dataC_1 = array([1, 2, 3])
# dataC_2 = array([2, 4, 10])
These arrays can always be cast back into lists by passing them into List()
As other contributors have said, please look into the numpy library!
I am completely new to the topic of programming but interested.
I am coding in python 3.x and have a question to my latest topic:
We have a list, containing a few tenthousands of randomly generated integers between 1 and 7.
import random
list_of_states = []
n = int(input('Enter number of elements:'))
for i in range(n):
list_of_states.append(random.randint(1,7))
print (list_of_states)
Afterwards, I would like to count the contiguous numbers in this list and put them into an numpy.array
example: [1, 2, 3, 4, 4, 4, 7, 3, 1, 1, 1]
1 1
2 1
3 1
4 3
7 1
3 1
1 3
I would like to know whether someone has a hint/an idea of how I could do this.
This part is a smaller part of a markov chain wherefor I need the frequency of each number.
Thanks for sharing
Nadim
Below is a crude way of doing this. I am creating a list of lists and then converting it to a numpy array. Please use this only a guidance and improvise on this.
import numpy as np
num_list = [1,1,1,1,2,2,2,3,4,5,6,6,6,6,7,7,7,7,1,1,1,1,3,3,3]
temp_dict = {}
two_dim_list = []
for x in num_list:
if x in temp_dict:
temp_dict[x] += 1
else:
if temp_dict:
for k,v in temp_dict.items():
two_dim_list.append([k,v])
temp_dict = {}
temp_dict[x] = 1
for k,v in temp_dict.items():
two_dim_list.append([k,v])
print ("List of List = %s" %(two_dim_list))
two_dim_arr = np.array(two_dim_list)
print ("2D Array = %s" %(two_dim_arr))
Output:
List of List = [[1, 4], [2, 3], [3, 1], [4, 1], [5, 1], [6, 4], [7, 4], [1, 4], [3, 3]]
2D Array = [[1 4]
[2 3]
[3 1]
[4 1]
[5 1]
[6 4]
[7 4]
[1 4]
[3 3]]