Compare two values in the same column python - python-3.x

I'm trying to compare two values in the same column in a pandas.DataFrame.
If the two values are different I want to create a new value.
My code looks like this:
def f(x, var1, var2):
if (x[var1].shift(1) != x[var1]):
x[var2] = 1
else:
x[var2] = 0
return x
sdf['2008':'2009'].apply(lambda x: f(x, 'ROW1','ROW2'),axis = 1)
Unfortunatly, this one doesn't work. I get the following error massage
'numpy.float64' object has no attribute 'shift'", 'occurred at index 2008-01-01 00:00:00'
Thanks for your help.

I think you need:
df0 = df.shift()
df['Row2'] = np.where(df0['Row1']!=df['Row1'], 1, 0)
EDIT:
As #jpp suggested in comments:
df['Row2'] = (df0['Row1']!=df['Row1']).astype(int)

Related

Issue with pd.DataFrame.apply with arguments

I want to create augmented data in a new dataframe for every row of an original dataframe.
So, I've defined augment method which I want to use in apply as following:
def augment(row: pd.Series, column_name: str, target_df: pd.DataFrame, num_samples: int):
# print(type(row))
target_df_start_index = target_df.shape[0]
raw_img = row[column_name].astype('uint8')
bin_image = convert_image_to_binary_image(raw_img)
bin_3dimg = tf.expand_dims(input=bin_image, axis=2)
bin_img_reshaped = tf.image.resize_with_pad(image=bin_3dimg, target_width=128, target_height=128, method="bilinear")
for i in range(num_samples + 1):
new_row = row.copy(deep=True)
if i == 0:
new_row[column_name] = np.squeeze(bin_img_reshaped, axis=2)
else:
aug_image = data_augmentation0(bin_img_reshaped)
new_row[column_name] = np.squeeze(aug_image, axis=2)
# display.display(new_row)
target_df.loc[target_df_start_index + i] = new_row
# print(target_df.shape)
# display.display(target_df)
When I call this as following, everything works:
tmp_df = pd.DataFrame(None, columns=testDF.columns)
augment(testDF.iloc[0], column_name='binMap', target_df=tmp_df, num_samples=4)
augment(testDF.iloc[1], column_name='binMap', target_df=tmp_df, num_samples=4)
However, when I try it using 'apply' method, I get the prints or the display working fine but the resultant dataframe shows error
tmp_df = pd.DataFrame(None, columns=testDF.columns)
testDF.apply(augment, args=('binMap', tmp_df, 4, ), axis=1)
This is how the o/p data looks like after the apply call -
,data
<Error>, <Error>
<Error>, <Error>
What am I doing wrong?
Your test is very nice, thank you for the clear exposition.
I am happy to be your rubber duck.
In test A, you (successfully) mess with
testDF.iloc[0] and [1],
using kind of a Fortran-style API
for augment(), leaving a side effect result in tmp_df.
Test B is carefully constructed to
be "the same" except for the .apply() call.
So let's see, what's different?
Hard to say.
Let's go examine the docs.
Oh, right!
We're using the .apply() API,
so we'd better follow it.
Down at the end it explains:
Returns: Series or DataFrame
Result of applying func along the given axis of the DataFrame.
But you're offering return None instead.
Now, I'm not here to pass judgement on
whether it's best to have side effects
on a target df -- that's up to you.
But .apply() will be bent out of shape
until you give it something nice
to store as its own result.
Happy hunting!
Tiny little style nit.
You wrote
args=('binMap', tmp_df, 4, )
to offer a 3-tuple. Better to write
args=('binMap', tmp_df, 4)
As written it tends to suggest 1-tuple notation.
When is trailing comma helpful?
in a 1-tuple it is essential: x = (7,)
in multiline dict / list expressions it minimizes git diffs, when inevitably another entry ('cherry'?) will later be added
fruits = [
'apple',
'banana',
]
This change worked for me -
def augment(row: pd.Series, column_name: str, target_df: pd.DataFrame, num_samples: int) -> pd.Series:
# print(type(row))
target_df_start_index = target_df.shape[0]
raw_img = row[column_name].astype('uint8')
bin_image = convert_image_to_binary_image(raw_img)
bin_3dimg = tf.expand_dims(input=bin_image, axis=2)
bin_img_reshaped = tf.image.resize_with_pad(image=bin_3dimg, target_width=128, target_height=128, method="bilinear")
for i in range(num_samples + 1):
new_row = row.copy(deep=True)
if i == 0:
new_row[column_name] = np.squeeze(bin_img_reshaped, axis=2)
else:
aug_image = data_augmentation0(bin_img_reshaped)
new_row[column_name] = np.squeeze(aug_image, axis=2)
# display.display(new_row)
target_df.loc[target_df_start_index + i] = new_row
# print(target_df.shape)
# display.display(target_df)
return row
And updated call to apply as following:
testDF = testDF.apply(augment, args=('binMap', tmp_df, 4, ), result_type='broadcast', axis=1)
Thank you #J_H.
If there are better to way to achieve what I'm doing, please feel free to suggest the improvements.

How to create a dataframe from extracted hashtags?

I have used below code to extract hashtags from tweets.
def find_tags(row_string):
tags = [x for x in row_string if x.startswith('#')]
return tags
df['split'] = df['text'].str.split(' ')
df['hashtags'] = df['split'].apply(lambda row : find_tags(row))
df['hashtags'] = df['hashtags'].apply(lambda x : str(x).replace('\\n', ',').replace('\\', '').replace("'", ""))
df.drop('split', axis=1, inplace=True)
df
However, when I am counting them using the below code I am getting output that is counting each character.
from collections import Counter
d = Counter(df.hashtags.sum())
data = pd.DataFrame([d]).T
data
Output I am getting is:
I think the problem lies with the code that I am using to extract hashtags. But I don't know how to solve this issue.
Change find_tags by replace in list comprehension with split and for count values use Series.explode with Series.value_counts:
def find_tags(row_string):
return [x.replace('\\n', ',').replace('\\', '').replace("'", "")
for x in row_string.split() if x.startswith('#')]
df['hashtags'] = df['text'].apply(find_tags)
and then:
data = df.hashtags.explode().value_counts().rename_axis('val').reset_index(name='count')

Using a loop to count missing values in the columns of a datset and creating a dictionary with the results

I am creating a function with a loop because I want to count the missing values in a dataset and add the results to a dictionary. I am using Python on Jupyter lab. This is the code:
def find_missing_values(dataframe, columns):
missing_values = {}
df_rows = len(dataframe)
for column in columns:
tot_column_values = dataframe[column].value_counts().sum()
missing_values[column] = df_rows - tot_column_values
return missing_values
missing_values = find_missing_values(amazon_data, columns = amazon_data.columns)
missing_values
This is the result when I run it:
{'uniq_id': 0}
I would like it to do the same with all the columns in the dataset (product_name, manufacturer etc.), not only the first one. These are all the columns:
amazon_data.columns
Index(['uniq_id', 'product_name', 'manufacturer', 'price',
'number_available_in_stock', 'number_of_reviews',
'number_of_answered_questions', 'average_review_rating',
'amazon_category_and_sub_category',
'customers_who_bought_this_item_also_bought', 'description',
'product_information', 'product_description',
'items_customers_buy_after_viewing_this_item',
'customer_questions_and_answers', 'customer_reviews', 'sellers'],
dtype='object')
I do not understand why they are not included in the result/dictionary. Can somebody help me and explain where I am making a mistake, please?
you for loop will return the results after the first iteration
you need to get the return from the loop like this:
def find_missing_values(dataframe, columns):
missing_values = {}
df_rows = len(dataframe)
for column in columns:
tot_column_values = dataframe[column].value_counts().sum()
missing_values[column] = df_rows - tot_column_values
return missing_values
if you use pandas, you can use dataframe.isna().sum().to_dict().

Look up a number inside a list within a pandas cell, and return corresponding string value from a second DF

(I've edited the first column name in the labels_df for clarity)
I have two DataFrames, train_df and labels_df. train_df has integers that map to attribute names in the labels_df. I would like to look up each number within a given train_df cell and return in the adjacent cell, the corresponding attribute name from the labels_df.
So fore example, the first observation in train_df has attribute_ids of 147, 616 and 813 which map to (in the labels_df) culture::french, tag::dogs, tag::men. And I would like to place those strings inside one cell on the same row as the corresponding integers.
I've tried variations of the function below but fear I am wayyy off:
def my_mapping(df1, df2):
tags = df1['attribute_ids']
for i in tags.iteritems():
df1['new_col'] = df2.iloc[i]
return df1
The data are originally from two csv files:
train.csv
labels.csv
I tried this from #Danny :
sample_train_df['attribute_ids'].apply(lambda x: [sample_labels_df[sample_labels_df['attribute_name'] == i]
['attribute_id_num'] for i in x])
*please note - I am running the above code on samples of each DF due to run times on the original DFs.
which returned:
I hope this is what you are looking for. i am sure there's a much more efficient way using look up.
df['new_col'] = df['attribute_ids'].apply(lambda x: [labels_df[labels_df['attribute_id'] == i]['attribute_name'] for i in x])
This is super ugly and one day, hopefully sooner than later, i'll be able to accomplish this task in an elegant fashion though, until then, this is what got me the result I need.
split train_df['attribute_ids'] into their own cell/column
helper_df = train_df['attribute_ids'].str.split(expand=True)
combine train_df with the helper_df so I have the id column (they are photo id's)
train_df2 = pd.concat([train_df, helper_df], axis=1)
drop the original attribute_ids column
train_df2.drop(columns = 'attribute_ids', inplace=True)
rename the new columns
train_df2.rename(columns = {0:'attr1', 1:'attr2', 2:'attr3', 3:'attr4', 4:'attr5', 5:'attr6',
6:'attr7', 7:'attr8', 8:'attr9', 9:'attr10', 10:'attr11'})
convert the labels_df into a dictionary
def create_file_mapping(df):
mapping = dict()
for i in range(len(df)):
name, tags = df['attribute_id_num'][i], df['attribute_name'][i]
mapping[str(name)] = tags
return mapping
map and replace the tag numbers with their corresponding tag names
train_df3 = train_df2.applymap(lambda s: my_map.get(s) if s in my_map else s)
create a new column of the observations tags in a list of concatenated values
helper1['new_col'] = helper1[helper1.columns[0:10]].apply(lambda x: ','.join(x.astype(str)), axis = 1)

How do you replace characters in a column of a dataframe using pandas?

From a dataframe, one column has int64 values and also some '?' where the data is not present.
The task is to replace the '?' with the mean of the integers in the column.
The column looks something like this:
30.82
26.67
17.56
?
34.99
?
.
.
.
Till now i tried using a for loop to calculate the mean while skipping the index where s[i] == '?'.
But once i try to replace the characters with mean value it gives me an error.
def fillreal(column)
s = pd.Series(column)
count = 0
summ = 0
for i in range(s.size):
if s[i] == '?':
continue
else:
summ += pd.to_numeric(s[i])
count = count+1
av = round(summ/count,2)
column.replace('?', str(av))
return column
function call is:
dataR = fillreal(df['col2'])
How should i correct the code so that it works fine, and also which functions can be used to optimise the code?
TIA
df.replace('?', np.mean(pd.to_numeric(df['30.82'], errors='coerce')))
30.82 here is the name of the column.
Make sure you have inplace=True if you want the dataframe itself modifed. as shown below. you can assign the above statement to a new variable (ex:new_df) and you will get a new df will ? repalce (original remains as it is)
df.replace('?', np.mean(pd.to_numeric(df['30.82'], errors='coerce')),inplace=True)

Resources