Python/how to insert data from one table into another according to the data condition - python-3.x

I have a the first table
user_cnt cost acquisition_cost
channel
facebook_ads 2726 2140.904643 0.79
instagram_new_adverts 3347 2161.441691 0.65
yandex_direct 4817 2233.111449 0.46
youtube_channel_reklama 2686 1068.119204 0.40
and the second
user_id profit source cost_per_user income
8a28b283ac30 0.91 facebook_ads ? 0.12
d7cf130a0105 0.63 youtube_channel ? 0.17
The second table has more 200k rows, but i showed only two. So, i need to put "acquisition_cost" value from the first table to column "cost_per_user" in the second table according to the name of channel/source. For instance, on the first row in the second table cost_per_user should has value - 0.79 due to it's facebook_ads.
I will be grateful if someone can help me to solve this task.
First of all i tried to use the function:
I tried the function:
def cost (source_type):
if source_type == 'instagram_new_adverts':
return 0.65
elif source_type == 'facebook_ads':
return 0.79
elif source_type == 'youtube_channel_reklama':
return 0.40
else: return 0.46
target_build['cost_per_user'] = target_build['source'].apply(cost)`
but i have to find another desicion without using of constants(return 0.65).
Another attemption was like this
for row in first_table['channel'].unique():
second_table.loc[second_table['source'] == row, 'cost_per_user'] = first_table['acquisition_cost']
this code works only for the first four lines and for another it put zero value.
and the last idea was
second_table['cost_per_user'] = second_table['cost_per_user'].where(
second_table['source'].isin(b.index), b['acquisition_cost'])
and again it didn't work.

It looks like you are using pandas. One possible solution is to use pandas version of inner join: merge.
Example
Suppose you don't want to modify either your first or second table, you can create a temporary table including just channel and acquisition_cost from the first table, but also changing the column names to source and cost_per_user. This can be implemented in various ways. One possible way, presuming your channel is an index, is shown below.
temp_df = first_df.acquisition_cost.reset_index().rename(
columns={'channel': 'source', 'acquisition_cost': 'cost_per_user'},
)
temp_df looks like this
source cost_per_user
0 facebook_ads 0.79
1 instaram_new_adverts 0.65
2 yandex_direct 0.46
3 youtube_channel_reklama 0.40
Say your second table looks like this:
user_id profit source income
0 c3519e80c071 0.773956 yandex_direct 0.227239
1 cc39ba469a08 0.438878 instaram_new_adverts 0.554585
2 a44a621e0222 0.858598 facebook_ads 0.063817
3 9dbf921b0959 0.697368 youtube_channel_reklama 0.827631
4 d45bf8fcab75 0.094177 youtube_channel_reklama 0.631664
5 57dbe1efd8b1 0.975622 yandex_direct 0.758088
6 1e0e3f1e13f7 0.761140 instaram_new_adverts 0.354526
7 27a7a7470ef4 0.786064 youtube_channel_reklama 0.970698
8 360dfd543fb5 0.128114 yandex_direct 0.893121
9 a31f46c26abb 0.450386 instaram_new_adverts 0.778383
You can run the merge call to attach cost_per_user to each source.
new_df = pd.merge(left=second_df, right=temp_df, on='source', how='inner')
new_df would look like this
user_id profit source income cost_per_user
0 c3519e80c071 0.773956 yandex_direct 0.227239 0.46
1 57dbe1efd8b1 0.975622 yandex_direct 0.758088 0.46
2 360dfd543fb5 0.128114 yandex_direct 0.893121 0.46
3 cc39ba469a08 0.438878 instaram_new_adverts 0.554585 0.65
4 1e0e3f1e13f7 0.761140 instaram_new_adverts 0.354526 0.65
5 a31f46c26abb 0.450386 instaram_new_adverts 0.778383 0.65
6 a44a621e0222 0.858598 facebook_ads 0.063817 0.79
7 9dbf921b0959 0.697368 youtube_channel_reklama 0.827631 0.40
8 d45bf8fcab75 0.094177 youtube_channel_reklama 0.631664 0.40
9 27a7a7470ef4 0.786064 youtube_channel_reklama 0.970698 0.40
Notes
Complication would arise if the source column of the second table does not have a one-to-one match to the channel in the first table. You will need to read the doc on merge to decide how you would want to handle that situation (e.g. use inner join to discard any mismatch, or left join to keep the unmatched source but receiving no cost_per_user).

Related

How to get the column name of a dataframe from values in a numpy array

I have a df with 15 columns:
df.columns:
0 class
1 name
2 location
3 income
4 edu_level
--
14 marital_status
after some transformations I got an numpy.ndarray with shape (15,3) named loads:
0.52 0.33 0.09
0.20 0.53 0.23
0.60 0.28 0.23
0.13 0.45 0.41
0.49 0.9
so on so on so on
So, 3 columns with 15 values.
What I need to do:
I want to get the df column name of the values from the first column of loads that are greater then .50
For this example, the columns of df related to the first column of loadswith values higher than 0.5 should return:
0 Class
2 Location
Same for the second column of loads, should return:
1 name
3 income
4 edu_level
and the same logic to the 3rd column of loads.
I managed to get the numparray loads they way I need it but I am having a bad time with this last part. I know I can simple manually pick the columns but this will be a hard task when df has more than 15 features.
Can anyone help me, please?
given your threshold you can create a boolean array in order to filter df.columns:
threshold = .5
for j in range(loads.shape[1]):
print(df.columms[loads[:,j]>threshold])

Pandas Dataframe: Dropping Selected rows with 0.0 float type values

Please I have a dataset that contains amount as float type. Some of the rows contain values of 0.00 and because they skew the dataset, I need to drop them. I have temporarily set the "Amount" to index and sorted the value as well.
Afterwards, I attempted to drop the rows after subsetting with iloc but eep getting error message in the form ValueError: Buffer has wrong number of dimensions (expected 1, got 3)
'''mortgage = mortgage.set_index('Gross Loan Amount').sort_values('Gross Loan Amount')
mortgage.drop([mortgage.loc[0.0]])'''
I equally tried this:
'''mortgage.drop(mortgage.loc[0.0])'''
it flagged the error of the form KeyError: "[Column_names] not found in axis"
Please how else can I accomplish the task?
You could make a boolean frame and then use any
df = df[~(df == 0).any(axis=1)]
in this code, all rows that have at least one zero in their data has been removed
Let me see if I get your problem. I created this sample dataset:
df = pd.DataFrame({'Values': [200.04,100.00,0.00,150.15,69.98,0.10,2.90,34.6,12.6,0.00,0.00]})
df
Values
0 200.04
1 100.00
2 0.00
3 150.15
4 69.98
5 0.10
6 2.90
7 34.60
8 12.60
9 0.00
10 0.00
Now, in order to get rid of the 0.00 values, you just have to do this:
df = df[df['Values'] != 0.00]
Output:
df
Values
0 200.04
1 100.00
3 150.15
4 69.98
5 0.10
6 2.90
7 34.60
8 12.60

Create unique list from 2 columns and sum values per row based on that unique list from 2 value columns

Having scoured numerous posts I am still struggling to find a solution for a report I am trying to transition over to PowerBI, from MS Excel.
Problem
Create a table in the report section of PowerBI, which has a unique list of currencies (based on 2 columns) and their corresponding FXexposure, which are defined based on each currency leg from 2 columns. Below I have shown the source data and workings I use in Excel, which i am trying to replicate.
Source data (from database table)
a
b
d
d
e
f
g
Instrument
Currency 1
Currency 2
FX nominal 1
FX nominal 2
FXNom1 - Gross
FXNom2 - Gross
FWD EUR/USD
EUR
USD
-7.965264529
7.90296523
7.97
7.90
FWD USD/JPY
USD
JPY
1.030513307
-1.070305687
1.03
1.07
Instrument 1
USD
1.75862819
1.76
0.00
Instrument 2
USD
TRY
0
3.45E-04
0.00
0.00
Instrument 3
JPY
1.121782037
1.12
0.00
Instrument 4
EUR
6.2505079
6.25
0.00
FWD EUR/CNH
EUR
CNH
0.007591392
3.00E-09
0.01
0.00
Instrument 5
RUB
6.209882675
6.21
0.00
F2 = ABS(FX nominal 1)
G2 = ABS(FX nominal 2)
Report output in excel
a
b
c
d
e
FX
Long
Short
Net
**Gross **
0
0.00
0.00
0.00
0.00
RUB
6.21
0.00
6.21
6.21
EUR
6.26
-7.97
-1.71
14.22
JPY
1.12
-1.07
0.05
2.19
USD
10.69
0.00
10.69
10.69
CNH
0.00
0.00
0.00
0.00
TRY
0.00
0.00
0.00
0.00
My Excel formulas are below to recreate what i am looking for.
A2: =IFERROR(LOOKUP(2, 1/(COUNTIF(Report!$A$1:A1,Data!$B$2:$B$553)=0), Data!$B$2:$B$553), LOOKUP(2, 1/(COUNTIF(Report!$A$1:A1, Data!$C$2:$C$553)=0), Data!$C$2:$C$553))
B2: =((SUMIFS(Data!$D$2:$D$553, Data!$B$2:$B$553, Report!$A2, Data!$D$2:$D$553, ">0"))+(SUMIFS(Data!$E$2:$E$553, Data!$C$2:$C$553, Report!$A2, Data!$E$2:$E$553, ">0")))
C2: =((SUMIFS(Data!$D$2:$D$553, Data!$B$2:$B$553, Report!$A3, Data!$D$2:$D$553, "<0"))+(SUMIFS(Data!$E$2:$E$553, Data!$C$2:$C$553, Report!$A3, Data!$E$2:$E$553, "<0")))
D2: =(SUMIF(Data!$B$1:$B$553,Report!$A3,Data!$D$1:$D$553)+SUMIF(Data!$C$1:$C$553,Report!$A3,Data!$E$1:$E$553))
E2: =(SUMIF(Data!$B$1:$B$554,Report!$A3,Data!$F$1:$F$554)+SUMIF(Data!$C$1:$C$554,Report!$A3,Data!$G$1:$G$554))
Now I believe I've managed to find a hack by using the UNIQUE/SELECTCOLUMNS function, but when you try and graph the output it is very small (as if there is other data it is trying to find behind the scenes). Note i tend to filter on date to get the output I need (this is mapped using relationships across other data tables).
FX =
DISTINCT (
UNION (
SELECTCOLUMNS ( DATA, "Date", [DATE], "Currency", [CURRENCY1], "FXNom", [FXNOMINAL1] ),
SELECTCOLUMNS ( DATA, "Date", [DATE], "Currency", [CURRENCY2], ,"FXNom", [FXNOMINAL2] )
)
)
If anyone has any ideas I would be very grateful as I still feel my workaround is more of a lucky hack.
Thanks!
The approach that you're using looks nearly ideal. From a dimensional model perspective, you want one column for values and one column for currency labels. So selecting those pairs as different tables and appending with UNION is the right way to go. Generally, I think it's better to do all the transformation you can in power query, using DAX this way can lead to some limitations.
But if we're going with DAX, I do think you want to get rid of DISTINCT. This could cause identical positions to be collapsed into a single row and you'd lose data this way.
FX =
UNION (
SELECTCOLUMNS ( FX_Raw, "Date", "FakeDate", "Currency", [CURRENCY 1], "FXNom", [FX nominal 1] ),
SELECTCOLUMNS ( FX_Raw, "Date", "FakeDate", "Currency", [CURRENCY 2], "FXNom", [FX nominal 2] )
)
And then a few measures:
Long =
CALCULATE(sum(FX[FXNom]), FX[FXNom] >= 0)
Short =
CALCULATE(sum(FX[FXNom]), FX[FXNom] < 0)
Gross =
SUMX( FX, if(FX[FXNom] > 0, FX[FXNom], 0-FX[FXNom]))
Net =
SUM(FX[FXNom])
Seems to produce the desired result:

Creating a new column into a dataframe based on conditions

For the dataframe df :
dummy_data1 = {'category': ['White', 'Black', 'Hispanic','White'],
'Pop':['75','85','90','100'],'White_ratio':[0.6,0.4,0.7,0.35],'Black_ratio':[0.3,0.2,0.1,0.45], 'Hispanic_ratio':[0.1,0.4,0.2,0.20] }
df = pd.DataFrame(dummy_data1, columns = ['category', 'Pop','White_ratio', 'Black_ratio', 'Hispanic_ratio'])
I want to add a new column to this data frame,'pop_n', by first checking the category, and then multiplying the value in 'Pop' by the corresponding ratio value in the columns. For the first row,
the category is 'White' so it should multiply 75 with 0.60 and put 45 in pop_n column.
I thought about writing something like :
df['pop_n']= (df['Pop']*df['White_ratio']).where(df['category']=='W')
this works but just for one category.
I will appreciate any helps with this.
Thanks.
Using DataFrame.filter and DataFrame.lookup:
First we use filter to get the columns with ratio in the name. Then split and keep the first word before the underscore only.
Finally we use lookup to match the category values to these columns.
# df['Pop'] = df['Pop'].astype(int)
df2 = df.filter(like='ratio').rename(columns=lambda x: x.split('_')[0])
df['pop_n'] = df2.lookup(df.index, df['category']) * df['Pop']
category Pop White_ratio Black_ratio Hispanic_ratio pop_n
0 White 75 0.60 0.30 0.1 45.0
1 Black 85 0.40 0.20 0.4 17.0
2 Hispanic 90 0.70 0.10 0.2 18.0
3 White 100 0.35 0.45 0.2 35.0
Locate the columns that have underscores in their names:
to_rename = {x: x.split("_")[0] for x in df if "_" in x}
Find the matching factors:
stack = df.rename(columns=to_rename)\
.set_index('category').stack()
factors = stack[map(lambda x: x[0]==x[1], stack.index)]\
.reset_index(drop=True)
Multiply the original data by the factors:
df['pop_n'] = df['Pop'].astype(int) * factors
# category Pop White_ratio Black_ratio Hispanic_ratio pop_n
#0 White 75 0.60 0.30 0.1 45
#1 Black 85 0.40 0.20 0.4 17
#2 Hispanic 90 0.70 0.10 0.2 18
#3 White 100 0.35 0.45 0.2 35

Which statsmodels ANOVA model for within- and between-subjects design?

I have a classic ANOVA design: two experimental conditions with two levels each; one participant answers on two of the four resulting conditions. A sample of my data looks like this:
participant_ID Condition_1 Condition_2 dependent_var
1 1 1 0.71
1 2 1 0.43
2 1 1 0.77
2 2 1 0.37
3 1 1 0.58
3 2 1 0.69
4 2 1 0.72
4 1 1 0.12
26 2 2 0.91
26 1 2 0.53
27 1 2 0.29
27 2 2 0.39
28 2 2 0.75
28 1 2 0.51
29 1 2 0.42
29 2 2 0.31
Using statsmodels, I wish to identify the effects of both conditions on the dependent variable, allowing for the fact that each participant answers twice and that there may be interactions. My expectation would be that I would use the repeat-measures ANOVA option as follows:
from statsmodels.stats.anova import AnovaRM
aovrm = AnovaRM(data, 'dependent_var', 'participant_ID', within=['Condition_1'], between = ['Condition_2'], aggregate_func= 'mean').fit()
However, when I do this, I get the following error:
NotImplementedError: Between subject effect not yet supported!
Does anyone know of a workaround for this that doesn't involve learning R? My instinct would be to try a mixed linear model, but I don't know how to account for the fact that each participant answered twice.
Apologies if this turns out to really be a Cross Validated question!
You could try out the pingouin package: https://pingouin-stats.org/index.html
It seems to cover mixed anovas, which are not yet fully implemented in statsmodels.

Resources