How to encode the data based on range of numbers - python-3.x

import pandas as pd
from sklearn import preprocessing
my_data = {
"Marks" : [50, 62, 42, 90, 12],
"Exam" : ['FirstSem', 'SecondSem', 'ThirdSem', 'FourthSem','FifthSem']
}
blk = pd.DataFrame( my_data )
print( blk )
Required solution
Marks Exam
0 1 FirstSem
1 1 SecondSem
2 0 ThirdSem
3 1 FourthSem
4 0 FifthSem
Is there any solution to encode the values if marks greater than 45 is 1 and marks less than 45 is 0

blk["Marks"] = np.where(blk["Marks"]>45,1,0)
blk
Marks Exam
0 1 FirstSem
1 1 SecondSem
2 0 ThirdSem
3 1 FourthSem
4 0 FifthSem

Related

Edited: K means clustering and finding points closest to the centroid

I am trying to apply k means to cluster actors based on the information in the following columns
Actors Movies TvGuest Awards Shorts Special LiveShows
Robert De Niro 111 2 6 0 0 0
Jack Nicholson 70 2 4 0 5 0
Marlon Brando 64 2 5 0 0 28
Denzel Washington 25 2 3 24 0 0
Katharine Hepburn 90 1 2 0 0 0
Humphrey Bogart 105 2 1 0 0 52
Meryl Streep 27 2 2 5 0 0
Daniel Day-Lewis 90 2 1 0 71 22
Sidney Poitier 63 2 3 0 0 0
Clark Gable 34 2 4 0 3 0
Ingrid Bergman 22 2 2 3 0 4
Tom Hanks 82 11 6 21 11 22
#began by scaling my data
X = StandardScaler().fit_transform(data)
#used an elbow plot to find optimal k value
sum_of_squared_distances = []
K = range(1,15)
for k in K:
k_means = KMeans(n_clusters=k)
model = k_means.fit(X)
sum_of_squared_distances.append(k_means.inertia_)
plt.plot(K, sum_of_squared_distances, 'bx-')
plt.show()
#found yhat for the calculated k value
kmeans = KMeans(n_clusters=3)
model = kmeans.fit(X)
yhat = kmeans.predict(X)
Unable to figure out create scatter plots by actors.
EDIT:
Is there a way to find which actors are closest to centroids if the centroids were also plotted using
centers = kmeans.cluster_centers_ (The kmeans here refers to Eric's solution below)
plt.scatter(centers[:,0],centers[:,1],color='purple',marker='*',label='centroid')
K means clustering in Pandas - Scatter plot
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
df = pd.DataFrame(columns=['Actors', 'Movies', 'TvGuest', "Awards", "Shorts"])
df.loc[0] = ["Robert De Niro", 111, 2, 6, 0]
df.loc[1] = ["Jack Nicholson", 70, 2, 4, 0]
df.loc[2] = ["Marlon Brando", 64, 4, 5, 0]
df.loc[3] = ["Denzel Washington", 25, 2, 3, 24]
df.loc[4] = ["Katharine Hepburn", 90, 1, 2, 0]
df.loc[5] = ["Humphrey Bogart", 105, 2, 1, 0]
df.loc[6] = ["Meryl Streep", 27, 3, 2, 5]
df.loc[7] = ["Daniel Day-Lewis", 90, 2, 1, 0]
df.loc[8] = ["Sidney Poitier", 63, 2, 3, 0]
df.loc[9] = ["Clark Gable", 34, 2, 4, 0]
df.loc[10] = ["Ingrid Bergman", 22, 5, 2, 3]
kmeans = KMeans(n_clusters=4)
y = kmeans.fit_predict(df[['Movies', 'TvGuest', 'Awards']])
df['Cluster'] = y
plt.scatter(df.Movies, df.TvGuest, c=df.Cluster, alpha = 0.6)
plt.title('K-means Clustering 2 dimensions and 4 clusters')
plt.show()
Shows:
Notice the data points presented on the 2 dimensional scatterplot is Movies and TvGuest, however the Kmeans fit was given 3 variables: Movies, TvGuest, Awards. Imagine there is an additional dimension going into the screen which are used to calculate membership to a cluster.
Source links:
https://datasciencelab.wordpress.com/2013/12/12/clustering-with-k-means-in-python/
https://datascience.stackexchange.com/questions/48693/perform-k-means-clustering-over-multiple-columns
https://towardsdatascience.com/visualizing-clusters-with-pythons-matplolib-35ae03d87489
You can calculate Euclidean distance between points and centroid and find the min distance which indicates closest point to centroids
dist = numpy.linalg.norm(centroid-point)

Python 3 Pandas fast lookup in dictionary for column

I have a Pandas DataFrame where I need to add new columns of data from lookup Dictionaries. I am looking for the fastest way to do this. I have a way that works using DataFrame.map() with a lambda but I wanted to know if this was the best practice and best performance I could achieve. I am used to doing with work with R and the excellent data.table library. I am working in a Jupyter notebook which is what is letting me use %time on the final line.
Here is what I have:
import numpy as np
import pandas as pd
np.random.seed(123)
num_samples = 100_000_000
ids = np.arange(0, num_samples)
states = ['Oregon', 'Michigan']
cities = ['Portland', 'Detroit']
state_data = {
0:{'Name': 'Oregon', 'mean': 100, 'std_dev': 5},
1:{'Name': 'Michigan', 'mean':90, 'std_dev': 8}
}
city_data = {
0:{'Name': 'Portland', 'mean': 8, 'std_dev':3},
1:{'Name': 'Detroit','mean': 4, 'std_dev':3}
}
state_df = pd.DataFrame.from_dict(state_data,orient='index')
print(state_df)
city_df = pd.DataFrame.from_dict(city_data,orient='index')
print(city_df)
sample_df = pd.DataFrame({'id':ids})
sample_df['state_id'] = np.random.randint(0, 2, num_samples)
sample_df['city_id'] = np.random.randint(0, 2, num_samples)
%time sample_df['state_mean'] = sample_df['state_id'].map(state_data).map(lambda x : x['mean'])
The last line is what I am most focused on.
I have also tried the following but saw no significant performance difference:
%time sample_df['state_mean'] = sample_df['state_id'].map(lambda x : state_data[x]['mean'])
What I ultimately want is to get sample_df to have columns for each of the states and cities. So I would have the following columns in the table:
id | state | state_mean | state_std_dev | city | city_mean | city_std_dev
Use DataFrame.join if you want add all columns:
sample_df = sample_df.join(state_df,on = 'state_id')
# id state_id city_id Name mean std_dev
#0 0 0 0 Oregon 100 5
#1 1 1 1 Michigan 90 8
#2 2 0 0 Oregon 100 5
#3 3 0 0 Oregon 100 5
#4 4 0 0 Oregon 100 5
#... ... ... ... ... ... ...
#9995 9995 1 0 Michigan 90 8
#9996 9996 1 1 Michigan 90 8
#9997 9997 0 1 Oregon 100 5
#9998 9998 1 1 Michigan 90 8
#9999 9999 1 0 Michigan 90 8
for one column
sample_df['state_mean'] = sample_df['state_id'].map(state_df['mean'])

Faster way to count number of timestamps before another timestamp

I have two dataframe "train" and "log". "log" has datetime columns "time1" while train has datetime column "time2". For every row in "train" I want to find out counts of "time1" when "time1" is before "time2".
I already tried the apply method with dataframe.
def log_count(row):
return sum((log['user_id'] == row['user_id']) & (log['time1'] < row['time2']))
train.apply(log_count, axis = 1)
It is taking very long with this approach.
Since you want to do this once for each (paired) user_id group, you could do the following:
Create a column called is_log which is 1 in log and 0 in train:
log['is_log'] = 1
train['is_log'] = 0
The is_log column will be used to keep track of whether or not a row comes from log or train.
Concatenate the log and train DataFrames:
combined = pd.concat(
[log.rename(columns=dict(time1="time")), train.rename(columns=dict(time2="time"))],
axis=0,
ignore_index=True,
sort=False,
)
Sort the combined DataFrame by user_id and time:
combined = combined.sort_values(by=["user_id", "time"])
So now combined looks something like this:
time user_id is_log
6 2000-01-17 0 0
0 2000-03-13 0 1
1 2000-06-08 0 1
7 2000-06-25 0 0
4 2000-07-09 0 1
8 2000-07-18 0 0
10 2000-03-13 1 0
5 2000-04-16 1 0
3 2000-08-04 1 1
9 2000-08-17 1 0
2 2000-10-20 1 1
Now the count that you are looking for can be expressed as a cumulative sum of the is_log column, grouped by user_id:
combined["count"] = combined.groupby("user_id")["is_log"].cumsum()
train = combined.loc[combined["is_log"] == 0]
This is the main idea: Counting the number of 1s in the is_log column is equivalent to counting the number of times in log which come before each time in train.
For example,
import numpy as np
import pandas as pd
np.random.seed(2019)
def random_dates(N):
return np.datetime64("2000-01-01") + np.random.randint(
365, size=N
) * np.timedelta64(1, "D")
N = 5
log = pd.DataFrame({"time1": random_dates(N), "user_id": np.random.randint(2, size=N)})
train = pd.DataFrame(
{
"time2": np.r_[random_dates(N), log.loc[0, "time1"]],
"user_id": np.random.randint(2, size=N + 1),
}
)
log["is_log"] = 1
train["is_log"] = 0
combined = pd.concat(
[log.rename(columns=dict(time1="time")), train.rename(columns=dict(time2="time"))],
axis=0,
ignore_index=True,
sort=False,
)
combined = combined.sort_values(by=["user_id", "time"])
combined["count"] = combined.groupby("user_id")["is_log"].cumsum()
train = combined.loc[combined["is_log"] == 0]
print(log)
# time1 user_id is_log
# 0 2000-03-13 0 1
# 1 2000-06-08 0 1
# 2 2000-10-20 1 1
# 3 2000-08-04 1 1
# 4 2000-07-09 0 1
print(train)
yields
time user_id is_log count
6 2000-01-17 0 0 0
7 2000-06-25 0 0 2
8 2000-07-18 0 0 3
10 2000-03-13 1 0 0
5 2000-04-16 1 0 0
9 2000-08-17 1 0 1

Reorder columns in groups by number embedded in column name?

I have a very large dataframe with 1,000 columns. The first few columns occur only once, denoting a customer. The next few columns are representative of multiple encounters with the customer, with an underscore and the number encounter. Every additional encounter adds a new column, so there is NOT a fixed number of columns -- it'll grow with time.
Sample dataframe header structure excerpt:
id dob gender pro_1 pro_10 pro_11 pro_2 ... pro_9 pre_1 pre_10 ...
I'm trying to re-order the columns based on the number after the column name, so all _1 should be together, all _2 should be together, etc, like so:
id dob gender pro_1 pre_1 que_1 fre_1 gen_1 pro2 pre_2 que_2 fre_2 ...
(Note that the re-order should order the numbers correctly; the current order treats them like strings, which orders 1, 10, 11, etc. rather than 1, 2, 3)
Is this possible to do in pandas, or should I be looking at something else? Any help would be greatly appreciated! Thank you!
EDIT:
Alternatively, is it also possible to re-arrange column names based on the string part AND number part of the column names? So the output would then look similar to the original, except the numbers would be considered so that the order is more intuitive:
id dob gender pro_1 pro_2 pro_3 ... pre_1 pre_2 pre_3 ...
EDIT 2.0:
Just wanted to thank everyone for helping! While only one of the responses worked, I really appreciate the effort and learned a lot about other approaches / ways to think about this.
Here is one way you can try:
# column names copied from your example
example_cols = 'id dob gender pro_1 pro_10 pro_11 pro_2 pro_9 pre_1 pre_10'.split()
# sample DF
df = pd.DataFrame([range(len(example_cols))], columns=example_cols)
df
# id dob gender pro_1 pro_10 pro_11 pro_2 pro_9 pre_1 pre_10
#0 0 1 2 3 4 5 6 7 8 9
# number of columns excluded from sorting
N = 3
# get a list of columns from the dataframe
cols = df.columns.tolist()
# split, create an tuple of (column_name, prefix, number) and sorted based on the 2nd and 3rd item of the tuple, then retrieved the first item.
# adjust "key = lambda x: x[2]" to group cols by numbers only
cols_new = cols[:N] + [ a[0] for a in sorted([ (c, p, int(n)) for c in cols[N:] for p,n in [c.split('_')]], key = lambda x: (x[1], x[2])) ]
# get the new dataframe based on the cols_new
df_new = df[cols_new]
# id dob gender pre_1 pre_10 pro_1 pro_2 pro_9 pro_10 pro_11
#0 0 1 2 8 9 3 6 7 4 5
Luckily there is a one liner in python that can fix this:
df = df.reindex(sorted(df.columns), axis=1)
For Example lets say you had this dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name': [2, 4, 8, 0],
'ID': [2, 0, 0, 0],
'Prod3': [10, 2, 1, 8],
'Prod1': [2, 4, 8, 0],
'Prod_1': [2, 4, 8, 0],
'Pre7': [2, 0, 0, 0],
'Pre2': [10, 2, 1, 8],
'Pre_2': [10, 2, 1, 8],
'Pre_9': [10, 2, 1, 8]}
)
print(df)
Output:
Name ID Prod3 Prod1 Prod_1 Pre7 Pre2 Pre_2 Pre_9
0 2 2 10 2 2 2 10 10 10
1 4 0 2 4 4 0 2 2 2
2 8 0 1 8 8 0 1 1 1
3 0 0 8 0 0 0 8 8 8
Then used
df = df.reindex(sorted(df.columns), axis=1)
Then the dataframe will then look like:
ID Name Pre2 Pre7 Pre_2 Pre_9 Prod1 Prod3 Prod_1
0 2 2 10 2 10 10 2 10 2
1 0 4 2 0 2 2 4 2 4
2 0 8 1 0 1 1 8 1 8
3 0 0 8 0 8 8 0 8 0
As you can see, the columns without underscore will come first, followed by an ordering based on the number after the underscore. However this also sorts of the column names, so the column names that come first in the alphabet will be first.
You need to split you column on '_' then convert to int:
c = ['A_1','A_10','A_2','A_3','B_1','B_10','B_2','B_3']
df = pd.DataFrame(np.random.randint(0,100,(2,8)), columns = c)
df.reindex(sorted(df.columns, key = lambda x: int(x.split('_')[1])), axis=1)
Output:
A_1 B_1 A_2 B_2 A_3 B_3 A_10 B_10
0 68 11 59 69 37 68 76 17
1 19 37 52 54 23 93 85 3
Next case, you need human sorting:
import re
def atoi(text):
return int(text) if text.isdigit() else text
def natural_keys(text):
'''
alist.sort(key=natural_keys) sorts in human order
http://nedbatchelder.com/blog/200712/human_sorting.html
(See Toothy's implementation in the comments)
'''
return [ atoi(c) for c in re.split(r'(\d+)', text) ]
df.reindex(sorted(df.columns, key = lambda x:natural_keys(x)), axis=1)
Output:
A_1 A_2 A_3 A_10 B_1 B_2 B_3 B_10
0 68 59 37 76 11 69 68 17
1 19 52 23 85 37 54 93 3
Try this.
To re-order the columns based on the number after the column name
cols_fixed = df.columns[:3] # change index no based on your df
cols_variable = df.columns[3:] # change index no based on your df
cols_variable = sorted(cols_variable, key=lambda x : int(x.split('_')[1])) # split based on the number after '_'
cols_new = cols_fixed + cols_variable
new_df = pd.DataFrame(df[cols_new])
To re-arrange column names based on the string part AND number part of the column names
cols_fixed = df.columns[:3] # change index no based on your df
cols_variable = df.columns[3:] # change index no based on your df
cols_variable = sorted(cols_variable)
cols_new = cols_fixed + cols_variable
new_df = pd.DataFrame(df[cols_new])

delete specific rows from csv using pandas

I have a csv file in the format shown below:
I have written the following code that reads the file and randomly deletes the rows that have steering value as 0. I want to keep just 10% of the rows that have steering value as 0.
df = pd.read_csv(filename, header=None, names = ["center", "left", "right", "steering", "throttle", 'break', 'speed'])
df = df.drop(df.query('steering==0').sample(frac=0.90).index)
However, I get the following error:
df = df.drop(df.query('steering==0').sample(frac=0.90).index)
locs = rs.choice(axis_length, size=n, replace=replace, p=weights)
File "mtrand.pyx", line 1104, in mtrand.RandomState.choice
(numpy/random/mtrand/mtrand.c:17062)
ValueError: a must be greater than 0
Can you guys help me?
sample DataFrame built with #andrew_reece's code
In [9]: df
Out[9]:
center left right steering throttle brake
0 center_54.jpg left_75.jpg right_39.jpg 1 0 0
1 center_20.jpg left_81.jpg right_49.jpg 3 1 1
2 center_34.jpg left_96.jpg right_11.jpg 0 4 2
3 center_98.jpg left_87.jpg right_34.jpg 0 0 0
4 center_67.jpg left_12.jpg right_28.jpg 1 1 0
5 center_11.jpg left_25.jpg right_94.jpg 2 1 0
6 center_66.jpg left_27.jpg right_52.jpg 1 3 3
7 center_18.jpg left_50.jpg right_17.jpg 0 0 4
8 center_60.jpg left_25.jpg right_28.jpg 2 4 1
9 center_98.jpg left_97.jpg right_55.jpg 3 3 0
.. ... ... ... ... ... ...
90 center_31.jpg left_90.jpg right_43.jpg 0 1 0
91 center_29.jpg left_7.jpg right_30.jpg 3 0 0
92 center_37.jpg left_10.jpg right_15.jpg 1 0 0
93 center_18.jpg left_1.jpg right_83.jpg 3 1 1
94 center_96.jpg left_20.jpg right_56.jpg 3 0 0
95 center_37.jpg left_40.jpg right_38.jpg 0 3 1
96 center_73.jpg left_86.jpg right_71.jpg 0 1 0
97 center_85.jpg left_31.jpg right_0.jpg 3 0 4
98 center_34.jpg left_52.jpg right_40.jpg 0 0 2
99 center_91.jpg left_46.jpg right_17.jpg 0 0 0
[100 rows x 6 columns]
In [10]: df.steering.value_counts()
Out[10]:
0 43 # NOTE: 43 zeros
1 18
2 15
4 12
3 12
Name: steering, dtype: int64
In [11]: df.shape
Out[11]: (100, 6)
your solution (unchanged):
In [12]: df = df.drop(df.query('steering==0').sample(frac=0.90).index)
In [13]: df.steering.value_counts()
Out[13]:
1 18
2 15
4 12
3 12
0 4 # NOTE: 4 zeros (~10% from 43)
Name: steering, dtype: int64
In [14]: df.shape
Out[14]: (61, 6)
NOTE: make sure that steering column has numeric dtype! If it's a string (object) then you would need to change your code as follows:
df = df.drop(df.query('steering=="0"').sample(frac=0.90).index)
# NOTE: ^ ^
after that you can save the modified (reduced) DataFrame to CSV:
df.to_csv('/path/to/filename.csv', index=False)
Here's a one-line approach, using concat() and sample():
import numpy as np
import pandas as pd
# first, some sample data
# generate filename fields
positions = ['center','left','right']
N = 100
fnames = ['{}_{}.jpg'.format(loc, np.random.randint(100)) for loc in np.repeat(positions, N)]
df = pd.DataFrame(np.array(fnames).reshape(3,100).T, columns=positions)
# generate numeric fields
values = [0,1,2,3,4]
probas = [.5,.2,.1,.1,.1]
df['steering'] = np.random.choice(values, p=probas, size=N)
df['throttle'] = np.random.choice(values, p=probas, size=N)
df['brake'] = np.random.choice(values, p=probas, size=N)
print(df.shape)
(100,3)
The first few rows of sample output:
df.head()
center left right steering throttle brake
0 center_72.jpg left_26.jpg right_59.jpg 3 3 0
1 center_75.jpg left_68.jpg right_26.jpg 0 0 2
2 center_29.jpg left_8.jpg right_88.jpg 0 1 0
3 center_22.jpg left_26.jpg right_23.jpg 1 0 0
4 center_88.jpg left_0.jpg right_56.jpg 4 1 0
5 center_93.jpg left_18.jpg right_15.jpg 0 0 0
Now drop all but 10% of rows with steering==0:
newdf = pd.concat([df.loc[df.steering!=0],
df.loc[df.steering==0].sample(frac=0.1)])
With the probability weightings I used in this example, you'll see somewhere between 50-60 remaining entries in newdf, with about 5 steering==0 cases remaining.
Using a mask on steering combined with a random number should work:
df = df[(df.steering != 0) | (np.random.rand(len(df)) < 0.1)]
This does generate some extra random values, but it's nice and compact.
Edit: That said, I tried your example code and it worked as well. My guess is the error is coming from the fact that your df.query() statement is returning an empty dataframe, which probably means that the "sample" column does not contain any zeros, or alternatively that the column is read as strings rather than numeric. Try converting the column to integer before running the above snippet.

Resources