How to find if a column value is undervalued or overvalued based on other column? - python-3.x

So i'm trying to solve this pandas exercise. I got this data set of Real estate firm from Kaggle and the data frame df looks like this.
id location type price
0 44525 Golden Mile House 4400000
1 44859 Nagüeles House 2400000
2 45465 Nagüeles House 1900000
3 50685 Nagüeles Plot 4250000
4 130728 Golden Mile House 32000000
5 130856 Nagüeles Plot 2900000
6 130857 Golden Mile House 3900000
7 130897 Golden Mile House 3148000
8 3484102 Marinha Plot 478000
9 3484124 Marinha Plot 2200000
10 3485461 Marinha House 1980000
So now,I have to find which property is undervalued or overvalued and which one has the genuine price on the basis of columns location and type. The desired result should look like this:
id location type price Over_val Under_val Norm_val
0 44525 Golden Mile House 4400000 0 0 1
1 44859 Nagüeles House 2400000 0 0 1
2 45465 Nagüeles House 1900000 0 0 1
3 50685 Nagüeles Plot 4250000 0 1 0
4 130728 Golden Mile House 32000000 1 0 0
5 130856 Nagüeles Plot 2900000 0 1 0
6 130857 Golden Mile House 3900000 0 0 1
7 130897 Golden Mile House 3148000 0 0 1
8 3484102 Marinha Plot 478000 0 0 1
9 3484124 Marinha Plot 2200000 0 0 1
10 3485461 Marinha House 1980000 0 1 0
Have been stuck on it for a while. What logic should I try in solving this problem?

Here's my solution. Explanation included as inline comments. There are probably ways to do this in lesser number of steps; I'll be interested to learn too.
import pandas as pd
# Replace this with whatever you have to load your data. This is set up for a sample data file I used
df = pd.read_csv('my_sample_data.csv', encoding='latin-1')
# Mean by location - type
mdf = df.set_index('id').groupby(['location','type'])['price'].mean().rename('mean').to_frame().reset_index()
# StdDev by location - type
sdf = df.set_index('id').groupby(['location','type'])['price'].std().rename('sd').to_frame().reset_index()
# Merge back into the original dataframe
df = df.set_index(['location','type']).join(mdf.set_index(['location','type'])).reset_index()
df = df.set_index(['location','type']).join(sdf.set_index(['location','type'])).reset_index()
# Add the indicator columns
df['Over_val'] = 0
df['Under_val'] = 0
df['Normal_val'] = 0
# Update the indicators
df.loc[df['price'] > df['mean'] + 2 * df['sd'], 'Over_val'] = 1
df.loc[df['price'] < df['mean'] - 2 * df['sd'], 'Under_val'] = 1
df['Normal_val'] = df['Over_val'] + df['Under_val']
df['Normal_val'] = df['Normal_val'].apply(lambda x: 1 if x == 0 else 0)

Here is another possible method. At 2 standard deviations there are no qualifying properties. There is one property at one std dev.
import pandas as pd
df = pd.DataFrame(data={}, columns=["id", "location", "type", "price"])
# data is already entered, left out for this example
df["id"] = prop_id
df["location"] = location
df["type"] = prop_type
df["price"] = price
# a function that returns the mean and standard deviation
def mean_std_dev(row):
mask1 = df["location"] == row["location"]
mask2 = df["type"] == row["type"]
df_filt = df[mask1 & mask2]
mean_price = df_filt["price"].mean()
std_dev_price = df_filt["price"].std()
return [mean_price, std_dev_price]
# create two columns and populate with the mean and std dev from function mean_std_dev
df[["mean", "standard deviation"]] = df.apply(
lambda row: pd.Series(mean_std_dev(row)), axis=1
)
# create final columns
df["Over_val"] = df.apply(
lambda x: 1 if x["price"] > x["mean"] + x["standard deviation"] else 0, axis=1
)
df["Under_val"] = df.apply(
lambda x: 1 if x["price"] < x["mean"] - x["standard deviation"] else 0, axis=1
)
df["Norm_val"] = df.apply(
lambda x: 1 if x["Over_val"] + x["Under_val"] == 0 else 0, axis=1
)
# delete the mean and standard deviation columns
df.drop(["mean", "standard deviation"], axis=1)

Related

Getting Dummy Back to Categorical

I have a df called X like this:
Index Class Family
1 Mid 12
2 Low 6
3 High 5
4 Low 2
Created this to dummy variables using below code:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
ohe = OneHotEncoder()
X_object = X.select_dtypes('object')
ohe.fit(X_object)
codes = ohe.transform(X_object).toarray()
feature_names = ohe.get_feature_names(['V1', 'V2'])
X = pd.concat([df.select_dtypes(exclude='object'),
pd.DataFrame(codes,columns=feature_names).astype(int)], axis=1)
Resultant df is like:
V1_Mid V1_Low V1_High V2_12 V2_6 V2_5 V2_2
1 0 0 1 0 0 0
..and so on
Question: How to do I convert my resultant df back to original df ?
I have seen this but it gives me NameError: name 'Series' is not defined.
First we can regroup each original column from your resultant df into the original column names as the first level of a column multi-index:
>>> df.columns = pd.MultiIndex.from_tuples(df.columns.str.split('_', 1).map(tuple))
>>> df = df.rename(columns={'V1': 'Class', 'V2': 'Family'}, level=0)
>>> df
Class Family
Mid Low High 12 6 5 2
0 1 0 0 1 0 0 0
Now we see the second-level of columns are the values. Thus, within each top-level we want to get the column name that has a 1, knowing all the other entries are 0. This can be done with idxmax():
>>> orig_df = pd.concat({col: df[col].idxmax(axis='columns') for col in df.columns.levels[0]}, axis='columns')
>>> orig_df
Class Family
0 Mid 12
An even more simple way is to just stick to pandas.
df = pd.DataFrame({"Index":[1,2,3,4],"Class":["Mid","Low","High","Low"],"Family":[12,6,5,2]})
# Combine features in new column
df["combined"] = list(zip(df["Class"], df["Family"]))
print(df)
Out:
Index Class Family combined
0 1 Mid 12 (Mid, 12)
1 2 Low 6 (Low, 6)
2 3 High 5 (High, 5)
3 4 Low 2 (Low, 2)
You can get the one hot encoding using pandas directly.
one_hot = pd.get_dummies(df["combined"])
print(one_hot)
Out:
(High, 5) (Low, 2) (Low, 6) (Mid, 12)
0 0 0 0 1
1 0 0 1 0
2 1 0 0 0
3 0 1 0 0
Then to get back you just can check the name of the column and select the row in the original dataframe with same tuple.
print(df[df["combined"]==one_hot.columns[0]])
Out:
Index Class Family combined
2 3 High 5 (High, 5)

Faster way to count number of timestamps before another timestamp

I have two dataframe "train" and "log". "log" has datetime columns "time1" while train has datetime column "time2". For every row in "train" I want to find out counts of "time1" when "time1" is before "time2".
I already tried the apply method with dataframe.
def log_count(row):
return sum((log['user_id'] == row['user_id']) & (log['time1'] < row['time2']))
train.apply(log_count, axis = 1)
It is taking very long with this approach.
Since you want to do this once for each (paired) user_id group, you could do the following:
Create a column called is_log which is 1 in log and 0 in train:
log['is_log'] = 1
train['is_log'] = 0
The is_log column will be used to keep track of whether or not a row comes from log or train.
Concatenate the log and train DataFrames:
combined = pd.concat(
[log.rename(columns=dict(time1="time")), train.rename(columns=dict(time2="time"))],
axis=0,
ignore_index=True,
sort=False,
)
Sort the combined DataFrame by user_id and time:
combined = combined.sort_values(by=["user_id", "time"])
So now combined looks something like this:
time user_id is_log
6 2000-01-17 0 0
0 2000-03-13 0 1
1 2000-06-08 0 1
7 2000-06-25 0 0
4 2000-07-09 0 1
8 2000-07-18 0 0
10 2000-03-13 1 0
5 2000-04-16 1 0
3 2000-08-04 1 1
9 2000-08-17 1 0
2 2000-10-20 1 1
Now the count that you are looking for can be expressed as a cumulative sum of the is_log column, grouped by user_id:
combined["count"] = combined.groupby("user_id")["is_log"].cumsum()
train = combined.loc[combined["is_log"] == 0]
This is the main idea: Counting the number of 1s in the is_log column is equivalent to counting the number of times in log which come before each time in train.
For example,
import numpy as np
import pandas as pd
np.random.seed(2019)
def random_dates(N):
return np.datetime64("2000-01-01") + np.random.randint(
365, size=N
) * np.timedelta64(1, "D")
N = 5
log = pd.DataFrame({"time1": random_dates(N), "user_id": np.random.randint(2, size=N)})
train = pd.DataFrame(
{
"time2": np.r_[random_dates(N), log.loc[0, "time1"]],
"user_id": np.random.randint(2, size=N + 1),
}
)
log["is_log"] = 1
train["is_log"] = 0
combined = pd.concat(
[log.rename(columns=dict(time1="time")), train.rename(columns=dict(time2="time"))],
axis=0,
ignore_index=True,
sort=False,
)
combined = combined.sort_values(by=["user_id", "time"])
combined["count"] = combined.groupby("user_id")["is_log"].cumsum()
train = combined.loc[combined["is_log"] == 0]
print(log)
# time1 user_id is_log
# 0 2000-03-13 0 1
# 1 2000-06-08 0 1
# 2 2000-10-20 1 1
# 3 2000-08-04 1 1
# 4 2000-07-09 0 1
print(train)
yields
time user_id is_log count
6 2000-01-17 0 0 0
7 2000-06-25 0 0 2
8 2000-07-18 0 0 3
10 2000-03-13 1 0 0
5 2000-04-16 1 0 0
9 2000-08-17 1 0 1

Removing outliers using percentile in panda dataframe groupby

I have dataframe df
Transportation_Mode time_delta trip_id segmentid Vincenty_distance velocity acceleration jerk
walk 1 1 1 1.551676553 1.551676553 0.550163852 -1.017629555
walk 1 1 1 1.70920675 1.70920675 0.16257622 -0.39166534
walk 1 1 1 1.871782971 1.871782971 -0.22908912 -0.734438511
walk 12 1 1 23.16466284 1.93038857 0.324972586 -0.331839143
walk 1 1 1 5.830059603 5.830059603 -3.657097132 2.614438854
bus 1 16 5 8.418372046 8.418372046 -7.259019484 7.40735053
bus 23 16 5 26.66510892 1.159352562 0.148331046 -0.036318522
bus 1 16 5 4.570966614 4.570966614 -0.68699497 -0.889126918
I want to remove outlier values within each group of Transportation_Mode based on percentile values [0.05,0.95]
My problem is similar to discussion Remove outliers in Pandas dataframe with groupby
The code I write is
res = df.groupby("Transportation_Mode")["Vincenty_distance"].quantile([0.05, 0.95]).unstack(level=1)
df.loc[ (res.loc[ df.Transportation_Mode, 0.05] < df.Vincenty_distance.values) & (df.Vincenty_distance.values < res.loc[df.Transportation_Mode, 0.95]) ]
but I get the error, ValueError: cannot reindex from a duplicate axis. I don't know where I am wrong here.
Complete input data is available at the link https://drive.google.com/file/d/1JjvS7igTmrtLA4E5Rs5D6tsdAXqzpYqX/view?usp=sharing
Actually if we see,
(res.loc[ df.Transportation_Mode, 0.05] < df.Vincenty_distance.values) & (df.Vincenty_distance.values < res.loc[df.Transportation_Mode, 0.95])
returns a series of type bool which can be to select rows in original df. We just need to give the value of the series for which just add .values while giving it to the df.loc[]. Below should work:
df.loc[ ((res.loc[ df.Transportation_Mode, 0.05] < df.Vincenty_distance.values) & (df.Vincenty_distance.values < res.loc[df.Transportation_Mode, 0.95])).values]
Use map for Series with same size as original DataFrame, so possible filtering:
m1 = (df.Transportation_Mode.map(res[0.05]) < df.Vincenty_distance)
m2 = (df.Vincenty_distance.values < df.Transportation_Mode.map(res[0.95]))
df = df[m1 & m2]
print (df)
Transportation_Mode time_delta trip_id segmentid Vincenty_distance \
1 walk 1 1 1 1.709207
2 walk 1 1 1 1.871783
4 walk 1 1 1 5.830060
5 bus 1 16 5 8.418372
velocity acceleration jerk
1 1.709207 0.162576 -0.391665
2 1.871783 -0.229089 -0.734439
4 5.830060 -3.657097 2.614439
5 8.418372 -7.259019 7.407351

pandas pd.read_html heading shifted to the right

I'm trying to convert wiki page table to dataframe. Headings are shifted to the
right, 'Launches' should be there were it is now 'Successes'.
I have used skiprows option, but it did not work.
df = pd.read_html(r'https://en.wikipedia.org/wiki/2018_in_spaceflight',skiprows=[1,2])[7]
df2 = df[df.columns[1:5]]
1 2 3 4
0 Launches Successes Failures Partial failures
1 India 1 1 0
2 Japan 3 3 0
3 New Zealand 1 1 0
4 Russia 3 3 0
5 United States 8 8 0
6 24 23 0 1
The problem is there are merged cells in the first column of the original table. If you want to parse it exactly, you should write a parser. Provisionally, you can try:
df = pd.read_html(r'https://en.wikipedia.org/wiki/2018_in_spaceflight', header=0)[7]
df.columns = [""] + list(df.columns[:-1])
df.iloc[-1] = [""] + list(df.iloc[-1][:-1])

delete specific rows from csv using pandas

I have a csv file in the format shown below:
I have written the following code that reads the file and randomly deletes the rows that have steering value as 0. I want to keep just 10% of the rows that have steering value as 0.
df = pd.read_csv(filename, header=None, names = ["center", "left", "right", "steering", "throttle", 'break', 'speed'])
df = df.drop(df.query('steering==0').sample(frac=0.90).index)
However, I get the following error:
df = df.drop(df.query('steering==0').sample(frac=0.90).index)
locs = rs.choice(axis_length, size=n, replace=replace, p=weights)
File "mtrand.pyx", line 1104, in mtrand.RandomState.choice
(numpy/random/mtrand/mtrand.c:17062)
ValueError: a must be greater than 0
Can you guys help me?
sample DataFrame built with #andrew_reece's code
In [9]: df
Out[9]:
center left right steering throttle brake
0 center_54.jpg left_75.jpg right_39.jpg 1 0 0
1 center_20.jpg left_81.jpg right_49.jpg 3 1 1
2 center_34.jpg left_96.jpg right_11.jpg 0 4 2
3 center_98.jpg left_87.jpg right_34.jpg 0 0 0
4 center_67.jpg left_12.jpg right_28.jpg 1 1 0
5 center_11.jpg left_25.jpg right_94.jpg 2 1 0
6 center_66.jpg left_27.jpg right_52.jpg 1 3 3
7 center_18.jpg left_50.jpg right_17.jpg 0 0 4
8 center_60.jpg left_25.jpg right_28.jpg 2 4 1
9 center_98.jpg left_97.jpg right_55.jpg 3 3 0
.. ... ... ... ... ... ...
90 center_31.jpg left_90.jpg right_43.jpg 0 1 0
91 center_29.jpg left_7.jpg right_30.jpg 3 0 0
92 center_37.jpg left_10.jpg right_15.jpg 1 0 0
93 center_18.jpg left_1.jpg right_83.jpg 3 1 1
94 center_96.jpg left_20.jpg right_56.jpg 3 0 0
95 center_37.jpg left_40.jpg right_38.jpg 0 3 1
96 center_73.jpg left_86.jpg right_71.jpg 0 1 0
97 center_85.jpg left_31.jpg right_0.jpg 3 0 4
98 center_34.jpg left_52.jpg right_40.jpg 0 0 2
99 center_91.jpg left_46.jpg right_17.jpg 0 0 0
[100 rows x 6 columns]
In [10]: df.steering.value_counts()
Out[10]:
0 43 # NOTE: 43 zeros
1 18
2 15
4 12
3 12
Name: steering, dtype: int64
In [11]: df.shape
Out[11]: (100, 6)
your solution (unchanged):
In [12]: df = df.drop(df.query('steering==0').sample(frac=0.90).index)
In [13]: df.steering.value_counts()
Out[13]:
1 18
2 15
4 12
3 12
0 4 # NOTE: 4 zeros (~10% from 43)
Name: steering, dtype: int64
In [14]: df.shape
Out[14]: (61, 6)
NOTE: make sure that steering column has numeric dtype! If it's a string (object) then you would need to change your code as follows:
df = df.drop(df.query('steering=="0"').sample(frac=0.90).index)
# NOTE: ^ ^
after that you can save the modified (reduced) DataFrame to CSV:
df.to_csv('/path/to/filename.csv', index=False)
Here's a one-line approach, using concat() and sample():
import numpy as np
import pandas as pd
# first, some sample data
# generate filename fields
positions = ['center','left','right']
N = 100
fnames = ['{}_{}.jpg'.format(loc, np.random.randint(100)) for loc in np.repeat(positions, N)]
df = pd.DataFrame(np.array(fnames).reshape(3,100).T, columns=positions)
# generate numeric fields
values = [0,1,2,3,4]
probas = [.5,.2,.1,.1,.1]
df['steering'] = np.random.choice(values, p=probas, size=N)
df['throttle'] = np.random.choice(values, p=probas, size=N)
df['brake'] = np.random.choice(values, p=probas, size=N)
print(df.shape)
(100,3)
The first few rows of sample output:
df.head()
center left right steering throttle brake
0 center_72.jpg left_26.jpg right_59.jpg 3 3 0
1 center_75.jpg left_68.jpg right_26.jpg 0 0 2
2 center_29.jpg left_8.jpg right_88.jpg 0 1 0
3 center_22.jpg left_26.jpg right_23.jpg 1 0 0
4 center_88.jpg left_0.jpg right_56.jpg 4 1 0
5 center_93.jpg left_18.jpg right_15.jpg 0 0 0
Now drop all but 10% of rows with steering==0:
newdf = pd.concat([df.loc[df.steering!=0],
df.loc[df.steering==0].sample(frac=0.1)])
With the probability weightings I used in this example, you'll see somewhere between 50-60 remaining entries in newdf, with about 5 steering==0 cases remaining.
Using a mask on steering combined with a random number should work:
df = df[(df.steering != 0) | (np.random.rand(len(df)) < 0.1)]
This does generate some extra random values, but it's nice and compact.
Edit: That said, I tried your example code and it worked as well. My guess is the error is coming from the fact that your df.query() statement is returning an empty dataframe, which probably means that the "sample" column does not contain any zeros, or alternatively that the column is read as strings rather than numeric. Try converting the column to integer before running the above snippet.

Resources