I'm trying to learn how to create interactions with a dataframe. I already read the Ipywidget doc, trying to get the most I can, plus some howto blogs, but when it comes to the real world somehow I can't reproduce the same results. I know this is common, though.
Anyway, so far I've tried doing this:
week = [value for value in canali_weekly.index.get_level_values(level=1).unique()]
values = [value for value in canali_weekly.columns]
#widgets.interact(
x = widgets.IntSlider(min=0, max=canali_weekly["Self"].max(), step=10, description='N. Self minime vendute:'),
s = widgets.IntRangeSlider(
value = [max(week)-1, max(week)],
min=min(week),
max=max(week),
step=1,
description='Confronto settimane:'
),
m = widgets.SelectMultiple(
options=values,
#rows=10,
description='Metriche:',
disabled=False
)
)
def filtra(x,s,m):
return canali_weekly.loc[ (canali_weekly["Self"] > x,s), m]
Which is pretty good because I get the main goal, that is allowing interactions. However, this code doesn't allow me changing the layout (for example I would prefer displaying widgets horizontally).
So I started with this new code :
week = [value for value in canali_weekly.index.get_level_values(level=1).unique()]
s = widgets.IntRangeSlider(
value = [max(week)-1, max(week)],
min=min(week),
max=max(week),
step=1,
description='Confronto settimane:'
)
def filtra(s):
display(canali_weekly.loc[ (slice(None),s), ["Spending"]])
s.observe(filtra, 'value')
display(s)
The problem is that in this case the dataframe is not shown.
ADDENDUM:
Here a repr of how is the dataframe:
channel = ["A", "B", "C", "D"]
week = [1,2,3,4,5,6,7,8,9]
columns = ["col1","col2","col3"]
index = pd.MultiIndex.from_product([channel,week], names=('channel', 'week'))
df = pd.DataFrame(np.random.randn(36, 3), columns=columns, index=index)
UPDATE:
I tried with this new code, very similar to second one I pasted here:
def crea_opzioni(array):
unique = array.unique().tolist()
unique.sort()
unique.insert(0, "ALL")
return unique
dropdown_week = widgets.Dropdown(options = crea_opzioni(canali_weekly.index.get_level_values(level=1)))
def filtra_dati(change):
if (change.new == "ALL"):
display(canali_weekly)
else:
display(canali_weekly.loc[(slice(None), change.new), ["Spending"]])
dropdown_week.observe(filtra_dati, names='value')
display(dropdown_week)
Even though the dataframe is not rendered in the noteboook, I can see it in the log tab
Here it is my solution:
output = w`enter code here`idgets.Output()
def crea_opzioni(array):
unique = array.unique().tolist()
unique.sort()
unique.insert(0, "ALL")
return unique
week = [value for value in canali_weekly.index.get_level_values(level=1).unique()]
intRange_week = widgets.IntRangeSlider(value = [max(week)-1, max(week)], min=min(week), max=max(week), step=1, description='Confronto settimane:')
#widgets.Dropdown(options = crea_opzioni(canali_weekly.index.get_level_values(level=1)))
#dropdown_canali = widgets.Dropdown(options = crea_opzioni(canali_weekly.index.get_level_values(level=0)))
def filtra_dati(selezione):
output.clear_output()
df = canali_weekly.loc[(slice(None), slice(selezione[0],selezione[1])), ["Spending"]]
with output:
display(df)
def dropdown_week_eventhandler(change):
filtra_dati(change.new)
intRange_week.observe(dropdown_week_eventhandler, names='value')
display(intRange_week`enter code here`
Related
I want to create augmented data in a new dataframe for every row of an original dataframe.
So, I've defined augment method which I want to use in apply as following:
def augment(row: pd.Series, column_name: str, target_df: pd.DataFrame, num_samples: int):
# print(type(row))
target_df_start_index = target_df.shape[0]
raw_img = row[column_name].astype('uint8')
bin_image = convert_image_to_binary_image(raw_img)
bin_3dimg = tf.expand_dims(input=bin_image, axis=2)
bin_img_reshaped = tf.image.resize_with_pad(image=bin_3dimg, target_width=128, target_height=128, method="bilinear")
for i in range(num_samples + 1):
new_row = row.copy(deep=True)
if i == 0:
new_row[column_name] = np.squeeze(bin_img_reshaped, axis=2)
else:
aug_image = data_augmentation0(bin_img_reshaped)
new_row[column_name] = np.squeeze(aug_image, axis=2)
# display.display(new_row)
target_df.loc[target_df_start_index + i] = new_row
# print(target_df.shape)
# display.display(target_df)
When I call this as following, everything works:
tmp_df = pd.DataFrame(None, columns=testDF.columns)
augment(testDF.iloc[0], column_name='binMap', target_df=tmp_df, num_samples=4)
augment(testDF.iloc[1], column_name='binMap', target_df=tmp_df, num_samples=4)
However, when I try it using 'apply' method, I get the prints or the display working fine but the resultant dataframe shows error
tmp_df = pd.DataFrame(None, columns=testDF.columns)
testDF.apply(augment, args=('binMap', tmp_df, 4, ), axis=1)
This is how the o/p data looks like after the apply call -
,data
<Error>, <Error>
<Error>, <Error>
What am I doing wrong?
Your test is very nice, thank you for the clear exposition.
I am happy to be your rubber duck.
In test A, you (successfully) mess with
testDF.iloc[0] and [1],
using kind of a Fortran-style API
for augment(), leaving a side effect result in tmp_df.
Test B is carefully constructed to
be "the same" except for the .apply() call.
So let's see, what's different?
Hard to say.
Let's go examine the docs.
Oh, right!
We're using the .apply() API,
so we'd better follow it.
Down at the end it explains:
Returns: Series or DataFrame
Result of applying func along the given axis of the DataFrame.
But you're offering return None instead.
Now, I'm not here to pass judgement on
whether it's best to have side effects
on a target df -- that's up to you.
But .apply() will be bent out of shape
until you give it something nice
to store as its own result.
Happy hunting!
Tiny little style nit.
You wrote
args=('binMap', tmp_df, 4, )
to offer a 3-tuple. Better to write
args=('binMap', tmp_df, 4)
As written it tends to suggest 1-tuple notation.
When is trailing comma helpful?
in a 1-tuple it is essential: x = (7,)
in multiline dict / list expressions it minimizes git diffs, when inevitably another entry ('cherry'?) will later be added
fruits = [
'apple',
'banana',
]
This change worked for me -
def augment(row: pd.Series, column_name: str, target_df: pd.DataFrame, num_samples: int) -> pd.Series:
# print(type(row))
target_df_start_index = target_df.shape[0]
raw_img = row[column_name].astype('uint8')
bin_image = convert_image_to_binary_image(raw_img)
bin_3dimg = tf.expand_dims(input=bin_image, axis=2)
bin_img_reshaped = tf.image.resize_with_pad(image=bin_3dimg, target_width=128, target_height=128, method="bilinear")
for i in range(num_samples + 1):
new_row = row.copy(deep=True)
if i == 0:
new_row[column_name] = np.squeeze(bin_img_reshaped, axis=2)
else:
aug_image = data_augmentation0(bin_img_reshaped)
new_row[column_name] = np.squeeze(aug_image, axis=2)
# display.display(new_row)
target_df.loc[target_df_start_index + i] = new_row
# print(target_df.shape)
# display.display(target_df)
return row
And updated call to apply as following:
testDF = testDF.apply(augment, args=('binMap', tmp_df, 4, ), result_type='broadcast', axis=1)
Thank you #J_H.
If there are better to way to achieve what I'm doing, please feel free to suggest the improvements.
I have used below code to extract hashtags from tweets.
def find_tags(row_string):
tags = [x for x in row_string if x.startswith('#')]
return tags
df['split'] = df['text'].str.split(' ')
df['hashtags'] = df['split'].apply(lambda row : find_tags(row))
df['hashtags'] = df['hashtags'].apply(lambda x : str(x).replace('\\n', ',').replace('\\', '').replace("'", ""))
df.drop('split', axis=1, inplace=True)
df
However, when I am counting them using the below code I am getting output that is counting each character.
from collections import Counter
d = Counter(df.hashtags.sum())
data = pd.DataFrame([d]).T
data
Output I am getting is:
I think the problem lies with the code that I am using to extract hashtags. But I don't know how to solve this issue.
Change find_tags by replace in list comprehension with split and for count values use Series.explode with Series.value_counts:
def find_tags(row_string):
return [x.replace('\\n', ',').replace('\\', '').replace("'", "")
for x in row_string.split() if x.startswith('#')]
df['hashtags'] = df['text'].apply(find_tags)
and then:
data = df.hashtags.explode().value_counts().rename_axis('val').reset_index(name='count')
Sample DataFrame:
id date price
93 6021501535 2014-07-25 430000
93 6021501535 2014-12-23 700000
313 4139480200 2014-06-18 1384000
313 4139480200 2014-12-09 1400000
first_list = []
second_list = []
I need to add the first price that corresponds to a specific ID to the first list and the second price for that same ID to the second list.
Example:
first_list = [430,000, 1,384,000]
second_list = [700,000, 1,400,000]
After which, I'm going to plot the values from both lists on a lineplot to compare the difference in price between the first and second list.
I've tried doing this with groupby and loc and I kept running into errors. I then tried iterating over each row using a simple for loop but ran into more problems...
I would appreciate some help.
Based on your question I think it's not necessary to save them into a list because you could also store them somewhere else (e.g. another DataFrame) and plot them. The functions below should help with filling wherever you want to store your data.
def date(your_id):
first_date = df.loc[(df['id']==your_id)].iloc[0,1]
second_date = df.loc[(df['id']==your_id)].iloc[1,1]
return first_date, second_date
def price(your_id):
first_date, second_date = date(your_id)
price_first_date = df.loc[(df['id']==6021501535) & (df['date']==first_date)].iloc[0,2]
price_second_date = df.loc[(df['id']==6021501535) & (df['date']==second_date)].iloc[0,2]
return price_first_date, price_second_date
price_first_date, price_second_date = price(6021501535)
If now for example you want to store your data in a new df you could do something like:
selected_ids = [6021501535, 4139480200]
new_df = pd.DataFrame(index=np.arange(1,len(selected_ids)+1), columns=['price_first_date', 'price_second_date'])
for i in range(len(selected_ids)):
your_id = selected_ids[i]
new_df.iloc[i, 0], new_df.iloc[i, 1] = price(your_id)
new_df then contains all 'first date prices' in the first column and all 'second date prices' in the second column. Plotting should work out.
Suppose I have 3 dataframe variables: res_df_union is the main dataframe and df_res and df_vacant are subdataframes created from res_df_union. They all share 2 columns called uniqueid and vacant_has_res. My goal is to compare the uniqueid column values in df_res and df_vacant, and if they match, to assign vacant_has_res in res_df_union with the value of 1.
*Note: I am using geoPandas (gpd Dataframe) instead of just pandas because I am working with spatial data but the concept is the same.
res_df_union = gpd.read_file(union, layer=union_feat)
df_parc_res = res_df_union[res_df_union.Parc_Res == 1]
unq_id_res = df_parc_res.uniqueid.unique()
df_parc_vacant = res_df_union[res_df_union.Parc_Vacant == 1]
unq_id_vac = df_parc_vacant.uniqueid.unique()
vacant_res_ids = []
for id_a in unq_id_vac:
for id_b in unq_id_res:
if id_a == id_b:
vacant_res_ids.append(id_a)
The code up to this point works. I have a list of uniqueid's that match. Now I just want to look for those unique id's in res_df_union and then assign res_df_union['vacant_has_res'] = 1. When I run the following, it either causes my IDE to crash, or never finishes running (after several hours). What am I doing wrong and is there a more efficient way to do this?
def u_row(row, id_val):
if row['uniqueid'] == id_val:
return 1
for item in res_df_union['uniqueid']:
if item in vacant_res_ids:
res_df_union['Has_Res_Association'] = res_df_union.apply(lambda row: u_row(row, item), axis = 1)
I'm trying to compare two values in the same column in a pandas.DataFrame.
If the two values are different I want to create a new value.
My code looks like this:
def f(x, var1, var2):
if (x[var1].shift(1) != x[var1]):
x[var2] = 1
else:
x[var2] = 0
return x
sdf['2008':'2009'].apply(lambda x: f(x, 'ROW1','ROW2'),axis = 1)
Unfortunatly, this one doesn't work. I get the following error massage
'numpy.float64' object has no attribute 'shift'", 'occurred at index 2008-01-01 00:00:00'
Thanks for your help.
I think you need:
df0 = df.shift()
df['Row2'] = np.where(df0['Row1']!=df['Row1'], 1, 0)
EDIT:
As #jpp suggested in comments:
df['Row2'] = (df0['Row1']!=df['Row1']).astype(int)