How to get the column name of a dataframe from values in a numpy array - python-3.x

I have a df with 15 columns:
df.columns:
0 class
1 name
2 location
3 income
4 edu_level
--
14 marital_status
after some transformations I got an numpy.ndarray with shape (15,3) named loads:
0.52 0.33 0.09
0.20 0.53 0.23
0.60 0.28 0.23
0.13 0.45 0.41
0.49 0.9
so on so on so on
So, 3 columns with 15 values.
What I need to do:
I want to get the df column name of the values from the first column of loads that are greater then .50
For this example, the columns of df related to the first column of loadswith values higher than 0.5 should return:
0 Class
2 Location
Same for the second column of loads, should return:
1 name
3 income
4 edu_level
and the same logic to the 3rd column of loads.
I managed to get the numparray loads they way I need it but I am having a bad time with this last part. I know I can simple manually pick the columns but this will be a hard task when df has more than 15 features.
Can anyone help me, please?

given your threshold you can create a boolean array in order to filter df.columns:
threshold = .5
for j in range(loads.shape[1]):
print(df.columms[loads[:,j]>threshold])

Related

Python/how to insert data from one table into another according to the data condition

I have a the first table
user_cnt cost acquisition_cost
channel
facebook_ads 2726 2140.904643 0.79
instagram_new_adverts 3347 2161.441691 0.65
yandex_direct 4817 2233.111449 0.46
youtube_channel_reklama 2686 1068.119204 0.40
and the second
user_id profit source cost_per_user income
8a28b283ac30 0.91 facebook_ads ? 0.12
d7cf130a0105 0.63 youtube_channel ? 0.17
The second table has more 200k rows, but i showed only two. So, i need to put "acquisition_cost" value from the first table to column "cost_per_user" in the second table according to the name of channel/source. For instance, on the first row in the second table cost_per_user should has value - 0.79 due to it's facebook_ads.
I will be grateful if someone can help me to solve this task.
First of all i tried to use the function:
I tried the function:
def cost (source_type):
if source_type == 'instagram_new_adverts':
return 0.65
elif source_type == 'facebook_ads':
return 0.79
elif source_type == 'youtube_channel_reklama':
return 0.40
else: return 0.46
target_build['cost_per_user'] = target_build['source'].apply(cost)`
but i have to find another desicion without using of constants(return 0.65).
Another attemption was like this
for row in first_table['channel'].unique():
second_table.loc[second_table['source'] == row, 'cost_per_user'] = first_table['acquisition_cost']
this code works only for the first four lines and for another it put zero value.
and the last idea was
second_table['cost_per_user'] = second_table['cost_per_user'].where(
second_table['source'].isin(b.index), b['acquisition_cost'])
and again it didn't work.
It looks like you are using pandas. One possible solution is to use pandas version of inner join: merge.
Example
Suppose you don't want to modify either your first or second table, you can create a temporary table including just channel and acquisition_cost from the first table, but also changing the column names to source and cost_per_user. This can be implemented in various ways. One possible way, presuming your channel is an index, is shown below.
temp_df = first_df.acquisition_cost.reset_index().rename(
columns={'channel': 'source', 'acquisition_cost': 'cost_per_user'},
)
temp_df looks like this
source cost_per_user
0 facebook_ads 0.79
1 instaram_new_adverts 0.65
2 yandex_direct 0.46
3 youtube_channel_reklama 0.40
Say your second table looks like this:
user_id profit source income
0 c3519e80c071 0.773956 yandex_direct 0.227239
1 cc39ba469a08 0.438878 instaram_new_adverts 0.554585
2 a44a621e0222 0.858598 facebook_ads 0.063817
3 9dbf921b0959 0.697368 youtube_channel_reklama 0.827631
4 d45bf8fcab75 0.094177 youtube_channel_reklama 0.631664
5 57dbe1efd8b1 0.975622 yandex_direct 0.758088
6 1e0e3f1e13f7 0.761140 instaram_new_adverts 0.354526
7 27a7a7470ef4 0.786064 youtube_channel_reklama 0.970698
8 360dfd543fb5 0.128114 yandex_direct 0.893121
9 a31f46c26abb 0.450386 instaram_new_adverts 0.778383
You can run the merge call to attach cost_per_user to each source.
new_df = pd.merge(left=second_df, right=temp_df, on='source', how='inner')
new_df would look like this
user_id profit source income cost_per_user
0 c3519e80c071 0.773956 yandex_direct 0.227239 0.46
1 57dbe1efd8b1 0.975622 yandex_direct 0.758088 0.46
2 360dfd543fb5 0.128114 yandex_direct 0.893121 0.46
3 cc39ba469a08 0.438878 instaram_new_adverts 0.554585 0.65
4 1e0e3f1e13f7 0.761140 instaram_new_adverts 0.354526 0.65
5 a31f46c26abb 0.450386 instaram_new_adverts 0.778383 0.65
6 a44a621e0222 0.858598 facebook_ads 0.063817 0.79
7 9dbf921b0959 0.697368 youtube_channel_reklama 0.827631 0.40
8 d45bf8fcab75 0.094177 youtube_channel_reklama 0.631664 0.40
9 27a7a7470ef4 0.786064 youtube_channel_reklama 0.970698 0.40
Notes
Complication would arise if the source column of the second table does not have a one-to-one match to the channel in the first table. You will need to read the doc on merge to decide how you would want to handle that situation (e.g. use inner join to discard any mismatch, or left join to keep the unmatched source but receiving no cost_per_user).

Pandas Dataframe: Dropping Selected rows with 0.0 float type values

Please I have a dataset that contains amount as float type. Some of the rows contain values of 0.00 and because they skew the dataset, I need to drop them. I have temporarily set the "Amount" to index and sorted the value as well.
Afterwards, I attempted to drop the rows after subsetting with iloc but eep getting error message in the form ValueError: Buffer has wrong number of dimensions (expected 1, got 3)
'''mortgage = mortgage.set_index('Gross Loan Amount').sort_values('Gross Loan Amount')
mortgage.drop([mortgage.loc[0.0]])'''
I equally tried this:
'''mortgage.drop(mortgage.loc[0.0])'''
it flagged the error of the form KeyError: "[Column_names] not found in axis"
Please how else can I accomplish the task?
You could make a boolean frame and then use any
df = df[~(df == 0).any(axis=1)]
in this code, all rows that have at least one zero in their data has been removed
Let me see if I get your problem. I created this sample dataset:
df = pd.DataFrame({'Values': [200.04,100.00,0.00,150.15,69.98,0.10,2.90,34.6,12.6,0.00,0.00]})
df
Values
0 200.04
1 100.00
2 0.00
3 150.15
4 69.98
5 0.10
6 2.90
7 34.60
8 12.60
9 0.00
10 0.00
Now, in order to get rid of the 0.00 values, you just have to do this:
df = df[df['Values'] != 0.00]
Output:
df
Values
0 200.04
1 100.00
3 150.15
4 69.98
5 0.10
6 2.90
7 34.60
8 12.60

Creating a new column into a dataframe based on conditions

For the dataframe df :
dummy_data1 = {'category': ['White', 'Black', 'Hispanic','White'],
'Pop':['75','85','90','100'],'White_ratio':[0.6,0.4,0.7,0.35],'Black_ratio':[0.3,0.2,0.1,0.45], 'Hispanic_ratio':[0.1,0.4,0.2,0.20] }
df = pd.DataFrame(dummy_data1, columns = ['category', 'Pop','White_ratio', 'Black_ratio', 'Hispanic_ratio'])
I want to add a new column to this data frame,'pop_n', by first checking the category, and then multiplying the value in 'Pop' by the corresponding ratio value in the columns. For the first row,
the category is 'White' so it should multiply 75 with 0.60 and put 45 in pop_n column.
I thought about writing something like :
df['pop_n']= (df['Pop']*df['White_ratio']).where(df['category']=='W')
this works but just for one category.
I will appreciate any helps with this.
Thanks.
Using DataFrame.filter and DataFrame.lookup:
First we use filter to get the columns with ratio in the name. Then split and keep the first word before the underscore only.
Finally we use lookup to match the category values to these columns.
# df['Pop'] = df['Pop'].astype(int)
df2 = df.filter(like='ratio').rename(columns=lambda x: x.split('_')[0])
df['pop_n'] = df2.lookup(df.index, df['category']) * df['Pop']
category Pop White_ratio Black_ratio Hispanic_ratio pop_n
0 White 75 0.60 0.30 0.1 45.0
1 Black 85 0.40 0.20 0.4 17.0
2 Hispanic 90 0.70 0.10 0.2 18.0
3 White 100 0.35 0.45 0.2 35.0
Locate the columns that have underscores in their names:
to_rename = {x: x.split("_")[0] for x in df if "_" in x}
Find the matching factors:
stack = df.rename(columns=to_rename)\
.set_index('category').stack()
factors = stack[map(lambda x: x[0]==x[1], stack.index)]\
.reset_index(drop=True)
Multiply the original data by the factors:
df['pop_n'] = df['Pop'].astype(int) * factors
# category Pop White_ratio Black_ratio Hispanic_ratio pop_n
#0 White 75 0.60 0.30 0.1 45
#1 Black 85 0.40 0.20 0.4 17
#2 Hispanic 90 0.70 0.10 0.2 18
#3 White 100 0.35 0.45 0.2 35

Python LIfe Expectancy

Trying to use panda to calculate life expectanc with complex equations.
Multiply or divide column by column is not difficult to do.
My data is
A b
1 0.99 1000
2 0.95 =0.99*1000=990
3 0.93 = 0.95*990
Field A is populated and field be has only the 1000
Field b (b2) = A1*b1
Tried shift function, got result for b2 only and the rest zeros any help please thanks mazin
IIUC, if you're starting with:
>>> df
A b
0 0.99 1000.0
1 0.95 NaN
2 0.93 NaN
Then you can do:
df.loc[df.b.isnull(),'b'] = (df.A.cumprod()*1000).shift()
>>> df
A b
0 0.99 1000.0
1 0.95 990.0
2 0.93 940.5
Or more generally:
df['b'] = (df.A.cumprod()*df.b.iloc[0]).shift().fillna(df.b.iloc[0])

Effective reasonable indexing for numeric vector search?

I have a long numeric table where 7 columns are a key and 4 columns is a value to find.
Actually I have rendered an object with different distances and perspective angles and have calculated Hu moments for it's contour. But this is not important to the question, just a sample to imagine.
So, when I have 7 values, I need to scan a table, find closest values in that 7 columns and extract corresponding 4 values.
So, the task aspects to consider is follows:
1) numbers have errors
2) the scale in function domain is not the same as the scale in function value; i.e. the "distance" from point in 7-dimensional space should depend on that 4 values, how it affect
3) search should be fast
So the question is follows: isn't some algorithm out there to solve this task efficiently, i.e. perform some indexing on that 7 columns, but do this no like conventional databases do, but taking into account point above.
If I understand the problem correctly, you might consider using scipy.cluster.vq (vector quantization):
Suppose your 7 numeric columns look like this (let's call the array code_book):
import scipy.cluster.vq as vq
import scipy.spatial as spatial
import numpy as np
np.random.seed(2013)
np.set_printoptions(precision=2)
code_book = np.random.random((3,7))
print(code_book)
# [[ 0.68 0.96 0.27 0.6 0.63 0.24 0.7 ]
# [ 0.84 0.6 0.59 0.87 0.7 0.08 0.33]
# [ 0.08 0.17 0.67 0.43 0.52 0.79 0.11]]
Suppose the associated 4 columns of values looks like this:
values = np.arange(12).reshape(3,4)
print(values)
# [[ 0 1 2 3]
# [ 4 5 6 7]
# [ 8 9 10 11]]
And finally, suppose we have some "observations" of 7-column values like this:
observations = np.random.random((5,7))
print(observations)
# [[ 0.49 0.39 0.41 0.49 0.9 0.89 0.1 ]
# [ 0.27 0.96 0.16 0.17 0.72 0.43 0.64]
# [ 0.93 0.54 0.99 0.62 0.63 0.81 0.36]
# [ 0.17 0.45 0.84 0.02 0.95 0.51 0.26]
# [ 0.51 0.8 0.2 0.9 0.41 0.34 0.36]]
To find the 7-valued row in code_book which is closest to each observation, you could use vq.vq:
index, dist = vq.vq(observations, code_book)
print(index)
# [2 0 1 2 0]
The index values refer to rows in code_book. However, if the rows in values are ordered the same way as code_book, we can "lookup" the associated value with values[index]:
print(values[index])
# [[ 8 9 10 11]
# [ 0 1 2 3]
# [ 4 5 6 7]
# [ 8 9 10 11]
# [ 0 1 2 3]]
The above assumes you have all your observations arranged in an array. Thus, to find all the indices you need only one call to vq.vq.
However, if you obtain the observations one at a time and need to find the closest row in code_book before going on to the next observation, then it would be inefficient to call vq.vq each time. Instead, generate a KDTree once, and then find the nearest neighbor(s) in the tree:
tree = spatial.KDTree(code_book)
for observation in observations:
distances, indices = tree.query(observation)
print(indices)
# 2
# 0
# 1
# 2
# 0
Note that the number of points in your code_book (N) must be large compared to the dimension of the data (e.g. N >> 2**7) for the KDTree to be fast compared to simple exhaustive search.
Using vq.vq or KDTree.query may or may not be faster than exhaustive search, depending on the size of your data (code_book and observations). To find out which is faster, be sure to benchmark these versus an exhaustive search using timeit.
i don't know if i understood well your question,but i will try giving an answer.
for each row K in the table compute the distance of your key from the key in that row:
( (X1-K1)^2 + (X2-K2)^2 + (X3-K3)^2 + (X4-K4)^2 + (X5-K5)^2 + (X6-K6)^2 + (X7-K7)^2 )^0.5
where {X1,X2,X3,X4,X5,X6,X7} is the key and {K1,K2,K3,K4,K5,K6,K7}is the key at row K
you could make one factor of the key more or less relevant of the others multiplying it while computing distance,for example you could replace (X1-K1)^2 in the formula above with 5*(X1-K1)^2 to make that more influent.
and store in a variable the distance ,in a second variable the row number
do the same with the following rows and if the new distance is lower then the one you stored then replace the distance and the row number.
when you have checked all the rows in your table the second variable you have used will show you the nearest row to the key
here is some pseudo-code:
int Row= 0
float Key[7] #suppose it is already filled with some values
float ClosestDistance= +infinity
int ClosestRow= 0
while Row<NumberOfRows{
NewDistance= Distance(Key,Table[Row][0:7])#suppose Distance is a function that outputs the distance and Table is the table you want to control Table[Row= NumberOfRows][Column= 7+4]
if NewDistance<ClosestDistance{
ClosestDistance= NewDistance
ClosestRow= Row}
increase row by 1}
ValueFound= Table[ClosestRow][7:11]#this should be the value you were looking for
i know it isn't fast but it is the best i could do,hope it helped.
P.S. i haven't considered measurement errors,i know.

Resources