machine learning problem:i made a new column in test data but instead of median value it is filled with NaN - python-3.x

Here i am trying to predict the sales price by taking median of price with respect to area and mzzone
here are the values:
combo=pd.pivot_table(train,values=['SALES_PRICE'],index=['MZZONE','AREA'],aggfunc='median')
combo
output:
SALES_PRICE
MZZONE AREA
A Adyar 7144042.5
Karapakkam 5468500.0
Velachery 8428745.0
C Adyar 7877645.0
Karapakkam 6443000.0
Velachery 9170660.0
I Adyar 8785350.0
but when i try to put it in test data by making a new column, it is filling NaN in full column
here is the code i used to put the median values in test data:
test['super_mean']=0
s2 = 'MZZONE'
s1 = 'AREA'
for i in test[s1].unique():
for j in test[s2].unique():
test['super_mean'][ (test[s1]==str(i)) & (test[s2]==str(j)) ] = train['SALES_PRICE'][ (train[s1]==str(i)) & (train[s2]==str(i)) ].median()
why this is happening so??

You have a mistake in the code within the 'j' for loop. You have an 'i' where should be a 'j'. This is the correct for loop:
test['super_mean']=0
s2 = 'MZZONE'
s1 = 'AREA'
for i in test[s1].unique():
for j in test[s2].unique():
test['super_mean'][ (test[s1]==str(i)) & (test[s2]==str(j)) ] = train['SALES_PRICE'][ (train[s1]==str(i)) & (train[s2]==str(j)) ].median()

Related

What is the simplest way to complete a function on every row of a large table?

so I want to do a fisher exact test (one sided) on every row of a 3000+ row table with a format matching the below example
gene
sample_alt
sample_ref
population_alt
population_ref
One
4
556
770
37000
Two
5
555
771
36999
Three
6
554
772
36998
I would ideally like to make another column of the table equivalent to
[(4+556)!(4+770)!(770+37000)!(556+37000)!]/[4!(556!)770!(37000!)(4+556+770+37000)!]
for the first row of data, and so on and so forth for each row of the table.
I know how to do a fisher test in R for simple 2x2 tables, but I wouldn't know how I would apply the fisher.test() function to each row of a large table. I also can't use an excel formula because the numbers get so big with the factorials that they reach excel's digit limit and result in a #NUM error. What's the best way to simply complete this? Thanks in advance!
Beginning with a tab-delimited text file on desktop (table.txt) with the same format as shown in the stem question
if(!require(psych)){install.packages("psych")}
multiFisher = function(file="Desktop/table.txt", saveit=TRUE,
outfile="Desktop/table.csv", progress=T,
verbose=FALSE, digits=3, ... )
{
require(psych)
Data = read.table(file, skip=1, header=F,
col.names=c("Gene", "MD", "WTD", "MC", "WTC"), ...)
if(verbose){print(str(Data))}
Data$Fisher.p = NA
Data$phi = NA
Data$OR1 = format(0.123, nsmall=3)
Data$OR2 = NA
if(progress){cat("\n")}
for(i in 1:length(Data$Gene)){
Matrix = matrix(c(Data$WTC[i],Data$MC[i],Data$WTD[i],Data$MD[i]), nrow=2)
Fisher = fisher.test(Matrix, alternative = 'greater')
Data$Fisher.p[i] = signif(Fisher$p.value, digits=digits)
Data$phi[i] = phi(Matrix, digits=digits)
OR1 = (Data$WTC[i]*Data$MD[i])/(Data$MC[i]*Data$WTD[i])
OR2 = 1 / OR1
Data$OR1[i] = format(signif(OR1, digits=digits), nsmall=3)
Data$OR2[i] = signif(OR2, digits=digits)
if(progress) {cat(".")}
}
if(progress){cat("\n"); cat("\n")}
if(saveit){write.csv(Data, outfile)}
return(Data)
}
multiFisher()

Normalising units/Replace substrings based on lists using Python

I am trying to normalize weight units in a string.
Eg:
1.SUCO MARACUJA COM GENGIBRE PCS 300 Millilitre - SUCO MARACUJA COM GENGIBRE PCS 300 ML
2. OVOS CAIPIRAS ANA MARIA BRAGA 10UN - OVOS CAIPIRAS ANA MARIA BRAGA 10U
3. SUCO MARACUJA MAMAO PCS 300 Gram - SUCO MARACUJA MAMAO PCS 300 G
4. SUCO ABACAXI COM MACA PCS 300Milli litre - SUCO ABACAXI COM MACA PCS 300ML
The keyword table is :
unit = ['Kilo','Kilogram','Gram','Milligram','Millilitre','Milli
litre','Dozen','Litre','Un','Und','Unid','Unidad','Unidade','Unidades']
norm_unit = ['KG','KG','G','MG','ML','ML','DZ','L','U','U','U','U','U','U']
I tried to take up these lists as a table but am having difficulty in comparing two dataframes or tables in python.
I tried the below code.
unit = ['Kilo','Kilogram','Gram','Milligram','Millilitre','Milli
litre','Dozen','Litre','Un','Und','Unid','Unidad','Unidade','Unidades']
norm_unit = ['KG','KG','G','MG','ML','ML','DZ','L','U','U','U','U','U','U']
z='SUCO MARACUJA COM GENGIBRE PCS 300 Millilitre'
#for row in mongo_docs:
#z = row['clean_hntproductname']
for x in unit:
for y in norm_unit:
if (re.search(r'\s'+x+r'$',z,re.I)):
# clean_hntproductname = t.lower().replace(x.lower(),y.lower())
# myquery3 = { "_id" : row['_id']}
# newvalues3 = { "$set": {"clean_hntproductname" : 'clean_hntproductname'} }
# ds_hnt_prod_data.update_one(myquery3, newvalues3)
I'm using Python(Jupyter) with MongoDb(Compass). Fetching data from Mongo and writing back to it.
From my understanding you want to:
Update all the rows in a table which contain the words in the unit array, to the ones in norm_unit.
(Disclaimer: I'm not familiar with MongoDB or Python.)
What you want is to create a mapping (using a hash) of the words you want to change.
Here's a trivial solution (i.e. not best solution but would probably point you in the right direction.)
unit_conversions = {
'Kilo': 'KG'
'Kilogram': 'KG',
'Gram': 'G'
}
# pseudo-code
for each row that you want to update
item_description = get the value of the string in the column
for each key in unit_conversion (e.g. 'Kilo')
see if the item_description contains the key
if it does, replace it with unit_convertion[key] (e.g. 'KG')
update the row

Replace values in observations (i.e., multiple columns within multiple rows) based on multiple conditionals

I am trying to replace the values of 3 columns within multiple observations based on two conditionals ( e.g., specific ID after a particular date).
I have seen similar questions.
Pandas Multiple Conditions Function based on Column
Pandas replace, multi column criteria
Pandas: How do I assign values based on multiple conditions for existing columns?
Replacing values in a pandas dataframe based on multiple conditions
However, they did not quite address my problem or I can't quite manipulate them to solve my problem.
This code will generate a dataframe similar to mine:
df = pd.DataFrame({'SUR_ID': {0:'SUR1', 1:'SUR1', 2:'SUR1', 3:'SUR1', 4:'SUR2', 5:'SUR2'}, 'DATE': {0:'05-01-2019', 1:'05-11-2019', 2:'06-15-2019', 3:'06-20-2019', 4: '05-15-2019', 5:'06-20-2019'}, 'ACTIVE_DATE': {0:'05-01-2019', 1:'05-01-2019', 2:'05-01-2019', 3:'05-01-2019', 4: '05-01-2019', 5:'05-01-2019'}, 'UTM_X': {0:'444895', 1:'444895', 2:'444895', 3:'444895', 4: '445050', 5:'445050'}, 'UTM_Y': {0:'4077528', 1:'4077528', 2:'4077528', 3:'4077528', 4: '4077762', 5:'4077762'}})
Output Dataframe:
What I am trying to do:
I am trying to replace UTM_X,UTM_Y, AND ACTIVE_DATE with
[444917, 4077830, '06-04-2019']
when
SUR_ID is "SUR1" and DATE >= "2019-06-04 12:00:00"
This is a poorly adapted version of the solution for question 1 in attempts to fix my problem- throws error:
df.loc[[df['SUR_ID'] == 'SUR1' and df['DATE'] >='2019-06-04 12:00:00'], ['UTM_X', 'UTM_Y', 'Active_Date']] = [444917, 4077830, '06-04-2019']
First ensure that the column Date is of type datetime, and then when using 2 conditions, they need to be between parenthesis individually. so you can do:
df.DATE = pd.to_datetime(df.DATE)
df.loc[ (df['SUR_ID'] == 'SUR1') & (df['DATE'] >= pd.to_datetime('2019-06-04 12:00:00')),
['UTM_X', 'UTM_Y', 'ACTIVE_DATE']] = [444917, 4077830, '06-04-2019']
See the difference between what you wrote for the boolean mask:
[df['SUR_ID'] == 'SUR1' and df['DATE'] >='2019-06-04 12:00:00']
and what is here with parenthesis
(df['SUR_ID'] == 'SUR1') & (df['DATE'] >= pd.to_datetime('2019-06-04 12:00:00'))
Use:
df['UTM_X']=df['UTM_X'].mask(df['SUR_ID'].eq('SUR1') & (pd.to_datetime(df['DATE'])>= pd.to_datetime("2019-06-04 12:00:00")),444917)
df['UTM_Y']=df['UTM_Y'].mask(df['SUR_ID'].eq('SUR1') & (pd.to_datetime(df['DATE'])>= pd.to_datetime("2019-06-04 12:00:00")),4077830)
df['ACTIVE_DATE']=df['ACTIVE_DATE'].mask(df['SUR_ID'].eq('SUR1') & (pd.to_datetime(df['DATE'])>= pd.to_datetime("2019-06-04 12:00:00")),'06-04-2019')
Output:
SUR_ID DATE ACTIVE_DATE UTM_X UTM_Y
0 SUR1 05-01-2019 05-01-2019 444895 4077528
1 SUR1 05-11-2019 05-01-2019 444895 4077528
2 SUR1 06-15-2019 06-04-2019 444917 4077830
3 SUR1 06-20-2019 06-04-2019 444917 4077830
4 SUR2 05-15-2019 05-01-2019 445050 4077762
5 SUR2 06-20-2019 05-01-2019 445050 4077762

Calculate the average of Spearman correlation

I have 2 columns A and B which contain the Spearman's correlation values as follows:
0.127272727 -0.260606061
-0.090909091 -0.224242424
0.345454545 0.745454545
0.478787879 0.660606061
-0.345454545 -0.333333333
0.151515152 -0.127272727
0.478787879 0.660606061
-0.321212121 -0.284848485
0.284848485 0.515151515
0.36969697 -0.139393939
-0.284848485 0.272727273
How can I calculate the average of those correlation values in these 2 columns in Excel or Matlab ? I found a close answer in this link : https://stats.stackexchange.com/questions/8019/averaging-correlation-values
The main point is we can not use mean or average in this case, as explained in the link. They proposed a nice way to do that, but I dont know how to implement it in Excel or Matlab.
Following the second answer of the link you provided, which is the most general case, you can calculate the average Spearman's rho in Matlab as follows:
M = [0.127272727 -0.260606061;
-0.090909091 -0.224242424;
0.345454545 0.745454545;
0.478787879 0.660606061;
-0.345454545 -0.333333333;
0.151515152 -0.127272727;
0.478787879 0.660606061;
-0.321212121 -0.284848485;
0.284848485 0.515151515;
0.36969697 -0.139393939;
-0.284848485 0.272727273];
z = atanh(M);
meanRho = tanh(mean(z));
As you can see it gives mean values of
meanRho =
0.1165 0.1796
whereas the simple mean is quite close:
mean(M)
ans =
0.1085 0.1350
Edit: more information on Fisher's transformation here.
In MATLAB, define a matrix with these values and use mean function as follows:
%define a matrix M
M = [0.127272727 -0.260606061;
-0.090909091 -0.224242424;
0.345454545 0.745454545;
0.478787879 0.660606061;
-0.345454545 -0.333333333;
0.151515152 -0.127272727;
0.478787879 0.660606061;
-0.321212121 -0.284848485;
0.284848485 0.515151515;
0.36969697 -0.139393939;
-0.284848485 0.272727273];
%calculates the mean of each column
meanVals = mean(M);
Result
meanVals =
0.1085 0.1350
It is also possible to calculate the total meanm and the mean of each row as follows:
meanVals = mean(M); %total mean
meanVals = mean(M,2); %mean of each row

data.frame slicing

I hope this question is not too simple for this board.
I have created a data.frame df:
CAS Name CID
89 13010-47-4 Lomustine 3950
90 130209-82-4 Latanoprost 5311221,5282380,46705340,3890
91 130636-43-0 Nifekalant 268083
92 130929-57-6 Entacapone 5281081
and a vector vec
[1] 5282380 18471829 45923789 44308022 44266812 24883465 24867475 24867460
I would like to extract the rows of df which contains any number of vec. I tried to solve this problem by this code:
df$GC[(df$CID %in% vec)] = 1
df[df$GC==1,]
But the problem with this solution is, that I only get the rows, which contain only one number in the CID column. Rows which contain several values in CID like line 90 do not appear.
Is there an elegant solution for this problem?
Thanks in advance
Given your comment on EDi's answer (which I like) I thought I'd make a suggestion.
Squeezing comma separated values into a single column of a data frame is awkward and (in my experience) just leads to frustration. I often find it simpler to keep it in a separate data structure, a list:
dat <- read.table(text = " CAS Name CID
13010-47-4 Lomustine 3950
130209-82-4 Latanoprost 5311221,5282380,46705340,3890
130636-43-0 Nifekalant 268083
130929-57-6 Entacapone 5281081",sep = "",header = TRUE)
cid <- sapply(dat$CID,strsplit,",",USE.NAMES = FALSE)
In this form, things are often easier to work with:
ID <- c(5282380, 18471829, 45923789, 44308022, 44266812, 24883465, 24867475, 24867460, 3950)
dat[sapply(cid,function(x) {any(x %in% as.character(ID))}),]
CAS Name CID
1 13010-47-4 Lomustine 3950
2 130209-82-4 Latanoprost 5311221,5282380,46705340,3890
You can always use rownames in dat and the names of the list to keep each item straight, if you're worried about orderings changing.
(Also note that my anonymous function is assuming that ID will be found eventually by R's scoping rules; you can alter the function to pass in ID explicitly if you like.)
One way is to use grep():
> txt <- " CAS Name CID
+ 13010-47-4 Lomustine 3950
+ 130209-82-4 Latanoprost 5311221,5282380,46705340,3890
+ 130636-43-0 Nifekalant 268083
+ 130929-57-6 Entacapone 5281081
+ "
> con <- textConnection(txt)
> df <- read.table(con, header = TRUE)
> close(con)
> ID <- c(5282380, 18471829, 45923789, 44308022, 44266812, 24883465, 24867475, 24867460, 3950)
> grep(paste("\\b", ID, "\\b", sep="", collapse = "|"), dat$CID)
[1] 1 2

Resources