I'm trying to use PySpark's split() method on a column that has data formatted like:
[6b87587f-54d4-11eb-95a7-8cdcd41d1310, 603, landing-content, landing-content-provider]
my intent is to extract the 4th element after the last comma.
I'm using a syntax like:
mydf.select("primary_component").withColumn("primary_component_01",f.split(mydf.primary_component, "\,").getItem(0)).limit(10).show(truncate=False)
But I'm consistently getting this error:
"cannot resolve 'split(mydf.primary_component, ',')' due to data
type mismatch: argument 1 requires string type, however,
'mydf.primary_component' is of
structuuid:string,id:int,project:string,component:string
type.;;\n'Project [primary_component#17,
split(split(primary_component#17, ,)[1], \,)...
I've also tried escaping the "," using \, \\ or not escaping it at all and this doesn't make any difference. Also, removing the ".getItem(0)" produces no difference.
What am I doing wrong? Feeling a dumbass but I don't know how to fix this...
Thank you for any suggestions
You are getting the error:
"cannot resolve 'split(mydf.`primary_component`, ',')' due to data
type mismatch: argument 1 requires string type, however,
'mydf.`primary_component`' is of
struct<uuid:string,id:int,project:string,component:string>
because your column primary_component is using a struct type when split expects string columns.
Since primary_component is already a struct and you are interested in the value after your last comma you may try the following using dot notation
mydf.withColumn("primary_component_01","primary_component.component")
In the error message, spark has shared the schema for your struct as
struct<uuid:string,id:int,project:string,component:string>
i.e.
column
data type
uuid
string
id
int
project
string
component
string
For future debugging purposes, you may use mydf.printSchema() to show the schema of the spark dataframe in use.
Related
I want to use a string I've created in a for loop to determine the dtypes of a Dataframe to send through sqlalchemy.
dtype_str = "'var1' : sqlalchemy.types.VARCHAR(8), 'var2' : sqlalchemy.types.VARCHAR(4)"
df.to_sql('table',connect,dtype={dtype_str})
When I run that I get an error saying "ValueError: The type of var1 is not a SQLAlchemy type.
If I just copy and paste it into without the quotes it works. I'm trying to build something that can work on different datasets so I don't want to have to copy/paste that string each time.
enter image description here
select from list by index ${locator_var} ${inp_msge_type}
--getting error as expected string, int found
select from list by index ${locator_var} 7
-----not getting any error
${inp_msge_type}----contains 7 from DB query the result is stored in this variable, to avoid hard coding we need to do this
Is there any way to write
Do not add links to screenshots of code, or error messages, and format the code pieces accordingly - use the ` (tick) symbol to surround them.
The rant now behind us, your issue is that the keyword Select From List By Index expects the type of the index argument to be a string.
When you called it
Select From List By Index ${locator_var} 7
, that "7" is actually a string (though it looks like a number), because this is what the framework defaults to on any typed text. And so it works.
When you get the value from the DB, it is of the type that the DB stores it with; and probably the table schema says it is int. So now you pass an int to the keyword - and it fails.
The fix is simple - just cast (convert) the variable to a string type:
${inp_msge_type}= Convert To String ${inp_msge_type}
, and now you can call the keyword as you did before.
I'm getting this message in Redshift: invalid input syntax for type numeric: " " , even after trying to implement the advice found in SO.
I am trying to convert text to number.
In my inner join, I try to make sure that the text being processed is first converted to null when there is an empty string, like so:
nullif(trim(atl.original_pricev::text),'') as original_price
... I noticed from a related post on coalesce that you have to convert the value to text before you can try and nullif it.
Then in the outer join, I test to see that there's a limited set of acceptable characters and if this test is met I try to do the to_number conversion:
,case
when regexp_instr(trim(atl.original_price),'[^0-9.$,]')=0
then to_number(atl.original_price,'FM999999999D00')
else null
end as original_price2
At this point I get the above error and unfortunately I can't see the details in datagrip to get the offending value.
So my questions are:
I notice that there is an empty space in my error message:
invalid input syntax for type numeric: " " . Does this error have the exact same meaning as
invalid input syntax for type numeric:'' which is what I see in similar posts??
Of course: what am I doing wrong?
Thanks!
It's hard to know for sure without some data and the complete code to try and reproduce the example, but as some have mentioned in the comments the most likely cause is the to_number() function you are using.
In the earlier code fragment you are converting original_price to text (string) and then substituting an empty string ('') if the value is NULL. Calling the to_number() function on an empty string will give you the error described.
Without the full SQL statement it's not clear why you're putting the nullif() function around the original_price in the "inner join" or how whether the CASE statement is really in an outer join clause or one of the columns returned by the query. However you could perhaps alter the nullif() to substitute a value that can be converted to a number e.g. '0.00' instead of ''.
Sorry I couldn't share real data. I spent the weekend testing small sets to try and trap the error. I found that the error was caused by the input string having no numbers, which is permitted by my regex filter:
when regexp_instr(trim(atl.original_price),'[^0-9.$,]') .
I wrongly expected that a non numeric string like "$" would evaluate to NULL and then the to_number function would = NULL . But from experimenting it seems that it needs at least one number somewhere in the string. Otherwise it reduces the string argument to an empty string prior to running the to_number formatting and chokes.
For example select to_number(trim('$1'::text),'FM999999999999D00') will evaluate to 1 but select to_number(trim('$A'::text),'FM999999999999D00') will throw the empty string error.
My fix was to add an additional regex to my initial filter:
and regexp_instr(atl.original_price2,'[0-9]')>0 .
This ensures that at least one number will be in the string and after that the empty string error went away.
Hope my learning experience helps someone else.
QuotedPremium column is a string feature so I need to convert it to numeric value in order to use algorithm.
So, for that I am using Edit Metadata module, where I specify data type to be converted is Floating Point.
After I run it - I got an error:
Could not convert type System.String to type System.Double, inner exception message: Input string was not in a correct format.
What am I missing here?
As mentioned in comments, you must change column where numbers are handled as text to numeric type data and it shouldn't have any null values. Now answering the question of how to substitute NULL's in data using ML studio and converting to numeric type.
Substitute NULL's in data
Use Execute R Script module for that, and add this code in it.
dataset1 <- maml.mapInputPort(1); # class: data.frame
dataset1[dataset1 == "NULL"] = 0; # Wherever cell's value is "NULL", replace it with 0
maml.mapOutputPort("dataset1"); # return the modified data.frame
Image for same:
Convert to numeric data
As you have added in your answer, this can be done using the Edit Metadata module.
I have created a UDT made up of fields from three or four columns of data. One of the field contains a letter inside parens, for example (c) or (d). When importing the csv file using cqlsh's COPY FROM, I get an error message:
Syntax error in CQL query …..mismatched import ‘(‘ expecting ‘)’ (….column 3, column 4) VALUES (10.2[(]c…).
I have tried importing csv file with fields where the letter does not have brackets and get:
Syntax error in CQL query …..mismatched import ‘c‘ expecting ‘)’ (….column 3, column 4) VALUES (10.2[c]…)
I have tried importing csv file without a letter in the field and get:
Syntax error in CQL query …..mismatched import ‘,‘ expecting ‘)’ (….column 4) VALUES (10.2,…)
The UDT is made up of integers and text. It appears that importing the csv file containing the UDT which includes a letter inside a bracket (e.g. (c)) generates a data violation as does a letter with no bracket as does and as does field with no value in it.
Have you tried character escaping using double dollar ($$) or double quotes ('') ? http://docs.datastax.com/en/cql/3.3/cql/cql_reference/escape_char_r.html