Error converting string feature to numeric in Azure ML studio - azure

QuotedPremium column is a string feature so I need to convert it to numeric value in order to use algorithm.
So, for that I am using Edit Metadata module, where I specify data type to be converted is Floating Point.
After I run it - I got an error:
Could not convert type System.String to type System.Double, inner exception message: Input string was not in a correct format.
What am I missing here?

As mentioned in comments, you must change column where numbers are handled as text to numeric type data and it shouldn't have any null values. Now answering the question of how to substitute NULL's in data using ML studio and converting to numeric type.
Substitute NULL's in data
Use Execute R Script module for that, and add this code in it.
dataset1 <- maml.mapInputPort(1); # class: data.frame
dataset1[dataset1 == "NULL"] = 0; # Wherever cell's value is "NULL", replace it with 0
maml.mapOutputPort("dataset1"); # return the modified data.frame
Image for same:
Convert to numeric data
As you have added in your answer, this can be done using the Edit Metadata module.

Related

PySpark - data mismatch error when trying to split a column content

I'm trying to use PySpark's split() method on a column that has data formatted like:
[6b87587f-54d4-11eb-95a7-8cdcd41d1310, 603, landing-content, landing-content-provider]
my intent is to extract the 4th element after the last comma.
I'm using a syntax like:
mydf.select("primary_component").withColumn("primary_component_01",f.split(mydf.primary_component, "\,").getItem(0)).limit(10).show(truncate=False)
But I'm consistently getting this error:
"cannot resolve 'split(mydf.primary_component, ',')' due to data
type mismatch: argument 1 requires string type, however,
'mydf.primary_component' is of
structuuid:string,id:int,project:string,component:string
type.;;\n'Project [primary_component#17,
split(split(primary_component#17, ,)[1], \,)...
I've also tried escaping the "," using \, \\ or not escaping it at all and this doesn't make any difference. Also, removing the ".getItem(0)" produces no difference.
What am I doing wrong? Feeling a dumbass but I don't know how to fix this...
Thank you for any suggestions
You are getting the error:
"cannot resolve 'split(mydf.`primary_component`, ',')' due to data
type mismatch: argument 1 requires string type, however,
'mydf.`primary_component`' is of
struct<uuid:string,id:int,project:string,component:string>
because your column primary_component is using a struct type when split expects string columns.
Since primary_component is already a struct and you are interested in the value after your last comma you may try the following using dot notation
mydf.withColumn("primary_component_01","primary_component.component")
In the error message, spark has shared the schema for your struct as
struct<uuid:string,id:int,project:string,component:string>
i.e.
column
data type
uuid
string
id
int
project
string
component
string
For future debugging purposes, you may use mydf.printSchema() to show the schema of the spark dataframe in use.

If i store index number fetched from db in variable & using in select from list by index, m getting err as expected string, int found-Robot Framework

enter image description here
select from list by index ${locator_var} ${inp_msge_type}
--getting error as expected string, int found
select from list by index ${locator_var} 7
-----not getting any error
${inp_msge_type}----contains 7 from DB query the result is stored in this variable, to avoid hard coding we need to do this
Is there any way to write
Do not add links to screenshots of code, or error messages, and format the code pieces accordingly - use the ` (tick) symbol to surround them.
The rant now behind us, your issue is that the keyword Select From List By Index expects the type of the index argument to be a string.
When you called it
Select From List By Index ${locator_var} 7
, that "7" is actually a string (though it looks like a number), because this is what the framework defaults to on any typed text. And so it works.
When you get the value from the DB, it is of the type that the DB stores it with; and probably the table schema says it is int. So now you pass an int to the keyword - and it fails.
The fix is simple - just cast (convert) the variable to a string type:
${inp_msge_type}= Convert To String ${inp_msge_type}
, and now you can call the keyword as you did before.

How to stop Python Pandas from converting specific column from int to float

trying to out put a dataframe into txt file (for a feed). a few specific columns are getting automatically converted into float instead of int as intented.
how can i specific those columns to use int as dtype?
i tried to output the whole dataframd as string and that did not work.
the columns i would like to specify are named [CID1] and [CID2]
data = pd.read_sql(sql,conn)
data = data.astype(str)
data.to_csv('data_feed_feed.txt', sep ='\t',index=True)
Based on the code you provided, you turn all of your data to strings just before export.
Thus, you either need to turn some cols back to desired type, such as:
data["CID1"] = data["CID1"].astype(int)
or not convert them in the first place.
It is not clear from what you provided why you'd have issues with ints being converted to floats.
this post provides heaps of info:
stackoverflow.com/a/28648923/9249533

Treat all cells as strings while using the Apache POI XSSF API

I'm using the Apache POI framework for parsing large Excel spreadsheets. I'm using this example code as a guide: XLSX2CSV.java
I'm finding that cells that contain just numbers are implicitly being treated as numeric fields, while I wanted them to be treated always as strings. So rather than getting 1.00E+13 (which I'm currently getting) I'll get the original string value: 10020300000000.
The example code uses a XSSFSheetXMLHandler which is passed an instance of DataFormatter. Is there a way to use that DataFormatter to treat all cells as strings?
Or as an alternative: in the implementation of the interface SheetContentsHandler.cell method there is string value that is the cellReference. Is there a way to convert a cellReference into an index so that I can use the SharedStringsTable.getEntryAt(int idx) method to read directly from the strings table?
To reproduce the issue, just run the sample code on an xlsx file of your choice with a number like the one in my example above.
UPDATE: It turns out that the string value I get seems to match what you would see in Excel. So I guess that's going to be "good enough" generally. I'd expect the data I'm sent to "look right" and therefore it'll get parsed correctly. However, I'm sure there will be mistakes and in those cases it'd be nice if I could get at the raw string value using the streaming API.
To resolve this issue I created my own class based on XSSFSheetXMLHandler
I copied that class, renamed it and then in the endElement method I changed this part of the code which is formatting the raw string:
case NUMBER:
String n = value.toString();
if (this.formatString != null && n.length() > 0)
thisStr = formatter.formatRawCellContents(Double.parseDouble(n), this.formatIndex, this.formatString);
else
thisStr = n;
break;
I changed it so that it would not format the raw string:
case NUMBER:
thisStr = value.toString();
break;
Now every number in my spreadsheet has its raw value returned rather than a formatted version.

Pentaho convert string to integer with decimal point

I am importing text values into a transformation using a Fixed Width input step. Everything is coming in as a string. I want to convert some of the string values to integers with a decimal point at a specified spot. Here are some examples of the before (left hand side) and expected results (right hand side):
00289 --> 0028.9
01109 --> 0110.9
003201 --> 0032.01
I've tried numerous combinations of the Format mask in a Select Values step (meta data tab) but I can't get the values I'm looking for.
Can you anyone tell me what combination I can try for* Type/Length/Precision/Format/Encoding/Decimal/Group* attributes for these fields to get the desired output?
Have you tried another step the reach your goal? You can try to use e.g. User Defined Java Expression setting it in this way:
Java expression: new java.math.BigDecimal(text.substring(0,4) + "." + text.substring(4,text.length()))
Value type: BigNumber
But this will convert your input to:
00289 --> 28.9
01109 --> 110.9
003201 --> 32.01
Because its output is BigNumber format. BigNumber or Number format can be used for decimal numbers. You cannot use Integer for decimals because it has no decimal part.
If you want a String output leave out the new java.math.BigDecimal() part from the expression above and set Value type to String. It will produce these results:
00289 --> 0028.9
01109 --> 0110.9
003201 --> 0032.01
This is the one suggestion. Of course there are another ways of how to reach your goal.

Resources