How to properly set an increment in a while loop? - python-3.x

I have a dataframe (DF)
I need to loop over each row and check if some conditions are met in that row
if they are then flag that row (say I add another column labeled "flag" and equalize it to 1)- in the same loop check if there are other rows that have similar conditions, if they do then flag them as well. At the next loop look at the same DF but exclude the flagged rows. The size of the DF will go from NxM to (N-n) x M where n is the number of rows flagged.
The loop will go on until the len(DF)is <=1 (meaning until all rows are flagged as 1). The for loop does not work because as the loop goes on the size of DF shrinks so I can only use while loop with increment. However, how can I set the increment ( it should be dynamic)?
I am really not sure how to tackle this problem.
Here is a failed attempt.
a=len(DF.loc[DF['flag'] != 1]) #should be (NxM) initially
i = 0
# at every loop we redefine size of DF in variable a
while a >= 1:
print(i)
# select first row
row = DF.loc[DF['flag'] != 1].iloc[[i]]
# flag row if conditions are met
DF['flag'].values[i] = np.where(if conditions met, 1, '')
#there is another piece of code that looks for rows with similar
#conditions but won't add it here
# the following variable a redefines length of DF
a=len(allHoldingsLookUp.loc[allHoldingsLookUp['flag'] != 1])
i+=1
I have a problem here. The increment I do not work. Say "i" reaches 100 and the length of DF shrinks to 70, then the code fails. The increase needs to be set differently but not sure how.
Any comments or suggestions are more than welcome.

Please try if this change works..
a=len(DF.loc[DF['flag'] != 1]) #should be (NxM) initially
# at every loop we redefine size of DF in variable a
while a >= 1:
i = 1
# select first row
row = DF.loc[DF['flag'] != 1].iloc[[i]]
# flag row if conditions are met
DF['flag'].values[i] = np.where(if conditions met, 1, '')
#there is another piece of code that looks for rows with similar
#conditions but won't add it here
# the following variable a redefines length of DF
try:
a=len(DF.loc[DF['flag'] != 1])
except:
break

Can you please try this. Hope it works.
def recur(DF):
row = DF.loc[DF['flag'] != 1].iloc[[1]]
DF['flag'].values[1] = np.where(if conditions met, 1, '')
#there is another piece of code that looks for rows with similar
#conditions but won't add it here
# the following variable a redefines length of DF
a=len(DF.loc[DF['flag'] != 1])
if a >= 1:
recur(DF.loc[DF['flag'] != 1])
return none

Related

Converting a csv file containing pixel values to it's equivalent images

This is my first time working with such a dataset.
I have a .csv file containing pixel values (48x48 = 2304 columns) of images, with their labels in the first column and the pixels in the subsequent ones, as below:
A glimpse of the dataset
I want to convert these pixels into their images, and store them into different directories corresponding to their respective labels. Now I have tried the solution posted here but it doesn't seem to work for me.
Here's what I've tried to do:
labels = ['Fear', 'Happy', 'Sad']
with open('dataset.csv') as csv_file:
csv_reader = csv.reader(csv_file)
fear = 0
happy = 0
sad = 0
# skip headers
next(csv_reader)
for row in csv_reader:
pixels = row[1:] # without label
pixels = np.array(pixels, dtype='uint8')
pixels = pixels.reshape((48, 48))
image = Image.fromarray(pixels)
if csv_file['emotion'][row] == 'Fear':
image.save('C:\\Users\\name\\data\\fear\\im'+str(fear)+'.jpg')
fear += 1
elif csv_file['emotion'][row] == 'Happy':
image.save('C:\\Users\\name\\data\\happy\\im'+str(happy)+'.jpg')
happy += 1
elif csv_file['emotion'][row] == 'Sad':
image.save('C:\\Users\\name\\data\\sad\\im'+str(sad)+'.jpg')
sad += 1
However, upon running the above block of code, the following is the error message I get:
Traceback (most recent call last):
File "<ipython-input-11-aa928099f061>", line 18, in <module>
if csv_file['emotion'][row] == 'Fear':
TypeError: '_io.TextIOWrapper' object is not subscriptable
I referred to a bunch of posts that solved the above error (like this one), but I found that the people were trying their hand at a relatively different problem than mine, and others I couldn't understand.
This may well be a very trivial question, but as I mentioned earlier, this is my first time working with such a dataset. Kindly tell me what am I doing wrong and how I can fix my code.
Try -
if str(row[0]) == 'Fear':
And in a similar way for the other conditions:
elif str(row[0]) == 'Happy':
elif str(row[0]) == 'Sad':
(a good practice is to just save the first value of the array as a variable)
The first problem that arose was that the first row was just the column names. In order to take care of this, I used the skiprows parameter like so:
raw = pd.read_csv('dataset.csv', skiprows = 1)
Secondly, I moved the labels column to the end due to it being in the first column. For my own convenience.
Thirdly, after all the preparations were done, the dataset won't iterate over the whole row, and instead just took in the value of the first row and first column, which gave an issue in resizing. So I instead used the df.itertuples() like so:
for row in data.itertuples(index = False, name = 'Pandas'):
Lastly, thanks to #HadarM 's suggestions, I was able to get it to work.
Modified code of the above code snippet that was the problematic block:
for row in data.itertuples(index = False, name = 'Pandas'):
pixels = row[:-1] # without label
pixels = np.array(pixels, dtype='uint8')
pixels = pixels.reshape((48, 48))
image = Image.fromarray(pixels)
if str(row[-1]) == 'Fear':
image.save('C:\\Users\\name\\data\\fear\\im'+str(fear)+'.jpg')
fear += 1
elif str(row[-1]) == 'Happy':
image.save('C:\\Users\\name\\data\\happy\\im'+str(happy)+'.jpg')
happy += 1
elif str(row[-1]) == 'Sad':
image.save('C:\\Users\\name\\data\\sad\\im'+str(sad)+'.jpg')
sad += 1
print('done')

Replace apply logic with something else

I have a small df (173, 21).
I wrote a function that works, however I am using apply() and I would like to, if possible,
do it another way only because of apply()'s reputation for being slow.
On this particular data set it doesn't matter at all as it is so small, but I am trying
to avoid apply() if possible.
The function takes in a row, checks each of five columns (see code below), and if the value
in any given cell is 'YES' increment a counter. Possible cell values are 'YES', 'NO' or 'NaN'
Here is the working code:
def clean_deaths(row):
num_deaths = 0
columns = ['Death1', 'Death2', 'Death3', 'Death4', 'Death5']
for c in columns:
death = row[c]
if death == 'YES':
num_deaths += 1
return num_deaths
true_avengers['Deaths'] = true_avengers.apply(clean_deaths, axis=1)
total = true_avengers['Deaths'].sum()
print(total, '\n') # 88
You are right: you should avoid apply(..., axis=1).
Try this:
true_avengers['Deaths'] = (true_avengers[['Death1', 'Death2', 'Death3', 'Death4', 'Death5']] =='YES').sum(axis=1)

Nested for loop is time inefficient, looking for a smart alternative

Nested for loop is very time inefficient. I have some ideas to make this efficient. Wondering if better alternatives can be shared.
I am trying to create a dataframe in python pulling values from multiple other dataframes. For a small number of variables/columns I can perform simple assignments. In the example below I want a cell each in two dataframes to be compared and make an assignment if equal. If they are not equal I need to iterate through the second dataframe till every cell is evaluated before making any assignment.
"""iterated through each row of first dataframe and then the second. This is to control for values in compared column
are matched correctly. """
for i in range(len(df10)):
for j in range(len(df6)): # this is not an efficient way to perform this action.
if df10.iloc[i,0] == df6.iloc[j,1]:
df10.iloc[i,23] = df6.iloc[j,6]
df10.iloc[i,24] = df6.iloc[j,1]
df10.sample(n=5)
Here is how you can do it, please see comment for description. Leave comment if something is not clear
np.random.seed(10)
df10 = pd.DataFrame(np.random.choice(5, (5,5)))
df6 = pd.DataFrame(np.random.choice(5, (4,6)))
display(df10)
display(df6)
## compare each pair of rows from 0th column of df10 and 1st column of df6
## using numpy broadcast. Which will return matrix of boolean with true at
## element i,j where values are equal
cond = df10.iloc[:,0].values[:,np.newaxis] == df6.iloc[:,1].values
## get matching index in array when the matrix is flatten
indx = np.arange(cond.size)[cond.ravel()]
## convert flattened index to row and colum index (i,j)
## where i crossponds to row index in df10 and j crossponds to
## row index in df6
i,j = indx//len(df6), indx%len(df6)
## set value using fancy indexing
df10.iloc[i,3] = df6.iloc[j,4].values
df10

How do you replace characters in a column of a dataframe using pandas?

From a dataframe, one column has int64 values and also some '?' where the data is not present.
The task is to replace the '?' with the mean of the integers in the column.
The column looks something like this:
30.82
26.67
17.56
?
34.99
?
.
.
.
Till now i tried using a for loop to calculate the mean while skipping the index where s[i] == '?'.
But once i try to replace the characters with mean value it gives me an error.
def fillreal(column)
s = pd.Series(column)
count = 0
summ = 0
for i in range(s.size):
if s[i] == '?':
continue
else:
summ += pd.to_numeric(s[i])
count = count+1
av = round(summ/count,2)
column.replace('?', str(av))
return column
function call is:
dataR = fillreal(df['col2'])
How should i correct the code so that it works fine, and also which functions can be used to optimise the code?
TIA
df.replace('?', np.mean(pd.to_numeric(df['30.82'], errors='coerce')))
30.82 here is the name of the column.
Make sure you have inplace=True if you want the dataframe itself modifed. as shown below. you can assign the above statement to a new variable (ex:new_df) and you will get a new df will ? repalce (original remains as it is)
df.replace('?', np.mean(pd.to_numeric(df['30.82'], errors='coerce')),inplace=True)

Stuck in an if statement in a while loop. How do i exit?

i wrote the following code to iteratively build a matrix depending on the conditions met. But i'm stuck in the 2nd if statement. for some reason it doesnt move to else and prints the contents of my print statements over and over, infinitely.
B=np.empty([u,v])
for i in range(u):
for j in range(v):
B[i][j]=0
vi= df['GroupId'][0]
G_df= pd.DataFrame(index=np.arange(0, v), columns=['j', 'vi'])
G_df['j'][0] = 0
G_df['vi'][0] = 0
j=0
i=0
old_tweet_id=1
new_tweet_id=1
value= df['TFIDF_score'][0]
while i<u:
old_tweet_id=new_tweet_id
while (old_tweet_id == new_tweet_id):
if j < v:
new_js= [G_df.loc[G_df['vi'] == vi, 'j'].item()]
if new_js != 0:
print('new_js',new_js)
new_j= int(''.join(str(x) for x in new_js))
print('new_j', new_j)
B[i][new_j] = value
print('matrix', B)
else:
G_df.loc[-1]=[j,vi]
B[i][j]=value
vi = vi +1
j=j+1
if j>=v:
old_tweet_id = u +10
else:
cc = df['tweet_id'][j:j + 1]
dd = df['TFIDF_score'][j:j + 1]
value = dd[j]
new_tweet_id = cc[j]
i = i + 1
I tried using break and also tried to empty the new_js and new_j variables just before the else line but that didn't work either.
I'm sure I'm missing something but I can't place my finger on it.
EDIT:
I am trying to build a matrix from a dataframe of several columns one of the dataframe columns contains what i will use for the labels of my matrix column and a couple of them are repeating, so i used df.groupy to group the overlapping entries and assign an index to them so that all similar entries would have the same index value. These index values are stored in another dataframe column called GroupId. So while building the matrix, the values of the matrix itself are the df[TFIDF scores] and they will be inputted to the matrix based on a which column and row they belong to. where my problem arises from is while checking to see if a column label has been encountered and this current encounter is an overlap, so we need to use the first occurrence of the column label instead of creating a new column for it. I created a new dataframe (G_df) where it appends all the column labels it encountered and also where it compares the current column label to see if there is an existing one.
I know this is a lot but i've tried everything I know. I've been stuck at this problem for a long time.

Resources