How to deal with missing values in Azure Machine Learning Studio - azure

Looks like I have 672 mission values, according to statistics.
There are NULL value in QuotedPremium column.
I implemented Clean Missing Data module where it should substitute missing values with 0, but for some reason I'm still seeing NULL values as QuotedPremium, but...it says that missing values are = 0
Here you see it tells me that missing values = 0, but there are still NULLs
So what really happened after I ran Clean Missing Data module? Why it ran succesfully but there are still NULL values, even though it tells that number of missing values are 0.

NULL is indeed a value; entries containing NULLs are not missing, hence they are neither cleaned with the 'Clean Missing Data' operator nor reported as missing.

Since they are not really missing values, its a string NULL which is added to all these cells. So, in order to substitute these values with 0 you can use this below:
Use Execute R Script module, and add this code in it.
dataset1 <- maml.mapInputPort(1); # class: data.frame
dataset1[dataset1 == "NULL"] = 0; # Wherever cell's value is "NULL", replace it with 0
maml.mapOutputPort("dataset1"); # return the modified data.frame
Image for same:

Related

Python Warning Panda Dataframe "Simple Issue!" - "A value is trying to be set on a copy of a slice from a DataFrame"

first post / total Python novice so be patient with my slow understanding!
I have a dataframe containing a list of transactions by order of transaction date.
I've appended an additional new field/column called ["DB/CR"], that dependant on the presence of "-" in the ["Amount"] field populates 'Debit', else 'Credit' in the absence of "-".
Noting the transactions are in date order, I've included another new field/column called [Top x]. The output of which is I want to populate and incremental independent number (starting at 1) for both debits and credits on a segregated basis.
As such, I have created a simple loop with a associated 'if' / 'elif' (prob could use else as it's binary) statement that loops through the data sent row 0 to the last row in the df and using an if statement 1) "Debit" or 2) "Credit" increments the number for each independently by "Debit" 'i' integer, and "Credit" 'ii' integer.
The code works as expected in terms of output of the 'Top x'; however, I always receive a warning "A value is trying to be set on a copy of a slice from a DataFrame".
Trying to perfect my script, without any warnings I've been trying to understand what I'm doing incorrect but not getting it in terms of my use case scenario.
Appreciate if someone can kindly shed light on / propose how the code needs to be refactored to avoid receiving this error.
Code (the df source data is an imported csv):
#top x debits/credits
i = 0
ii = 0
for ind in df.index:
if df["DB/CR"][ind] == "Debit":
i = i+1
df["Top x"][ind] = i
elif df["DB/CR"][ind] == "Credit":
ii = ii+1
df["Top x"][ind] = ii
Interpreter
df["Top x"][ind] = i
G:\Finances Backup\venv\Statementsv.03.py:173: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df["Top x"][ind] = ii
Many thanks :)
You should use df.loc["DB/CR", ind] = "Debit"
Use iterrows() to iterate over the DF. However, updating DF while iterating is not preferable
see documentation here
Refer to the documentation here Iterrows()
You should never modify something you are iterating over. This is not
guaranteed to work in all cases. Depending on the data types, the
iterator returns a copy and not a view, and writing to it will have no
effect.

Cognos Report Studio: CASE and IF Statements

I'm very new in using Cognos report studio and trying to filter some of the values and replace them into others.
I currently have values that are coming out as blanks and want to replace them as string "Property Claims"
what i'm trying to use in my main query is
CASE WHEN [Portfolio] is null
then 'Property Claims'
ELSE [Portfolio]
which is giving me an error. Also have a different filter i want to put in to replace windscreen flags to a string value rather than a number. For example if the flag is 1 i want to place it as 'Windscreen Claims'.
if [Claim Windscreen Flag] = 1
then ('Windscreen')
Else [Claim Windscreen Flag]
None of this works with the same error....can someone give me a hand?
Your first CASE statement is missing the END. The error message should be pretty clear. But there is a simpler way to do that:
coalesce([Portfolio], 'Property Claims')
The second problem is similar: Your IF...THEN...ELSE statement is missing a bunch of parentheses. But after correcting that you may have problems with incompatible data types. You may need to cast the numbers to strings:
case
when [Claim Windscreen Flag] = 1 then ('Windscreen')
else cast([Claim Windscreen Flag], varchar(50))
end
In future, please include the error messages.
it might be syntax
IS NULL (instead of = null)
NULL is not blank. You might also want = ' '
case might need an else and END at the bottom
referring to a data type as something else can cause errors. For example a numeric like [Sales] = 'Jane Doe'
For example (assuming the result is a string and data item 2 is also a string),
case
when([data item 1] IS NULL)Then('X')
when([data item 1] = ' ')Then('X')
else([data item 2])
end
Also, if you want to show a data item as a different type, you can use CAST

How to resolve the problem with the SQL LIKE operator

There is a table 'Phones', which includes a column 'phone_no', declared as varchar(20). It can be NULL, too.
Some of the values stored in this column are:
'(310) 369-1000', '(415) 623-1000', '(310) 449-3000', '(323) 956-8398', and '(800) 864-8377'.
I would like to filter out all the records where the phone number ends with '0', so I use the expression phone_no LIKE '%0'. However, the resulting recordset is empty! The same happens when using any number (not just 0) at the end of the pattern. Why? Where is the problem?

Dict key getting overwritten when created in a loop

I'm trying to create individual dictionary entries while looping through some input data. Part of the data is used for the key, while a different part is used as the value associated with that key. I'm running into a problem (due to Python's "everything is an object, and you reference that object" operations method) with this as ever iteration through my loop alters the key set in previous iterations, thus overwriting the previously set value, instead of creating a new dict key and setting it with its own value.
popcount = {}
for oneline of datafile:
if oneline[:3] == "POP":
dat1, dat2, dat3, dat4, dat5, dat6 = online.split(":")
datid = str.join(":", [dat2, dat3])
if datid in popcount:
popcount[datid] += int(dat4)
else:
popcount = { datid : int(dat4) }
This iterates over seven lines of data (datafile is a list containing that information) and should create four separate keys for datid, each with their own value. However, what ends up happening is that only the last value for datid exist in the dictionary when the code is run. That happens to be the one that has duplicates, and they get summed properly (so, at least i know that part of the code works, but the other key entries just are ... gone.
The data is read from a file, is colon (:) separated, and treated like a string even when its numeric (thus the int() call in the if datid in popcount).
What am I missing/doing wrong here? So far I haven't been able to find anything that helps me out on this one (though you folks have answered a lot of other Python questions i've run into, even if you didn't know it). I know why its failing; or, i think i do -- it is because when I update the value of datid the key gets pointed to the new datid value object even though I don't want it to, correct? I just don't know how to fix or work around this behavior. To be honest, its the one thing I dislike about working in Python (hopefully once I grok it, I'll like it better; until then...).
Simply change your last line
popcount = { datid : int(dat4) } # This does not do what you want
This creates a new dict and assignes it to popcount, throwing away your previous data.
What you want to do is add an entry to your dict instead:
popcount[datid] = int(dat4)

Strange SELECT behavior

I have this strange problem. i have a table with 10 columns of type character varying.
I need to have a function that searches all records and returns the id of the record which has all strings. Lets say records:
1. a,b,c,d,e
2. a,k,l,h
3. f,t,r,e,w,q
if i call this function func(a,d) it should return 1, if i call func(e,w,q) its should return 3.
The function is
CREATE OR REPLACE FUNCTION func(ma1 character varying,ma2 character varying,ma3 character varying,ma4 character varying)
DECLARE name numeric;
BEGIN
SELECT Id INTO name from Table WHERE
ma1 IN (col1,col2,col3,col4) AND
ma2 IN (col1,col2,col3,col4) AND
ma3 IN (col1,col2,col3,col4) AND
ma4 IN (col1,col2,col3,col4);
RETURN name;
END;
It's working 90% of the time, the weird problem is that some rows are not found.
Its not uppercase or lowercase problem.
What can be wrong, its version 9.1 on 64 bit win 7. I feel its like encoding or string problem but i can't see where and what.
//Ok i found the problem, it has to do with all column, if all 24 columns are filled in then its not working ?? but why ? are there limitations becouse there are 24 columns that i must compare with//
Can someone help me plz.
thanks.
The problem is (probably) that some of your columns have nulls.
In SQL, any equality comparison with a null is always false. This extends to the list of values used with the IN (...) condition.
If any of the values in the list are null, the comparison will be false, even if the value being sought is in the list.
The work-around is to make sure no values are null. which unfortunately results in a verbose solution:
WHERE ma1 IN (COALESCE(col1, ''), COALESCE(col2, ''), ...)
I suspect Bohemian is correct that the problem is related to nulls in your IN clauses. An alternative approach is to use Postgres's array contained in operator to perform your test.
where ARRAY[ma1,ma2,ma3,ma4] <# ARRAY[col1,col2,...,colN]

Resources