I am using SS FromCsv<MyType>() to deserialize data from a third party service.
It works fine if data is exactly as defined but sometimes the third party service has issues with a record and instead of returning a number in a column it returns the string "unknown".
If the csv has any row with "unknown" instead of the expected number then deserializing the CSV fails.
Is there any way to make it skip these rows and just deserialize the correctly matching data?
No, but you can do a string.Replace before deserializing it:
var rows = csv.Replace("unknown",-1).FromCsv<MyType>();
Related
I have a delimited file separated by hashes that looks somewhat like this,
value#value#value#value#value#value##value
value#value#value#value##value#####value#####value
value#value#value#value###value#value####value##value
As you can see, when separated by hashes, there are more columns in the 2nd and 3rd rows than there is in the first. I want to be able to ingest this into a database using a ADF Data Flow after some transformations. However, whenever I try to do any kind of mapping, I always only see 7 columns (the number of columns in the first row).
Is there any way to get all of the values? As many columns as there are in the row with most number of items? I do not mind the nulls.
Note: I do not have a header row for this.
Azure Data Factory directly will not be able to Import schema -row with the maximum number of column. Hence, it is important to make sure you have same number of columns in your file.
You can use Azure functions to validate your file and update it to get equal number of columns in all rows.
You could give it a try to have a local file with row with the maximum number of column and import the schema from the file, else you have to go for Azure Functions where you have to convert the file and then trigger the pipeline.
I have a table in an Oracle DB that contains a column stored as a RAW type. I'm making a JDBC connection to read that column and, when I print the schema of the resulting dataframe, I notice that I have a column with a binary data type. This was what I was expecting to happen.
The thing is that I need to be able to read that column as a String so I thought that a simple data type conversion would solve it.
df.select("COLUMN").withColumn("COL_AS_STRING", col("COLUMN").cast(StringType)).show
But what I got was a bunch of random characters. As I'm dealing with a RAW type it was possible that a string representation of this data doesn't exist so, just to be safe, I did simple select to get the first rows from the source (using sqoop-eval) and somehow sqoop can display this column as a string.
I then thought that this could be an encoding problem so I tried this:
df.selectExpr("decode(COLUMN,'utf-8')").show
With utf-8 and a bunch of other encodings. But again all I got was random characters.
Does anyone know how can I do this data type conversion?
I have a parquet file with 400+ columns, when I read it, the default datatypes attached to a lot of columns is String (may be due to the schema specified by someone else).
I was not able to find a parameter similar to
inferSchema=True' #for spark.read.parquet, present for spark.read.csv
I tried changing
mergeSchema=True #but it doesn't improve the results
To manually cast columns as float, I used
df_temp.select(*(col(c).cast("float").alias(c) for c in df_temp.columns))
this runs without error, but converts all the actual string column values to Null. I can't wrap this in a try, catch block as its not throwing any error.
Is there a way where i can check whether the columns contains only 'integer/ float' values and selectively cast those columns to float?
Parquet columns are typed, so there is no such thing as schema inference when loading Parquet files.
Is there a way where i can check whether the columns contains only 'integer/ float' values and selectively cast those columns to float?
You can use the same logic as Spark - define preferred type hierarchy and attempt to cast, until you get to the point, where you find the most selective type, that parses all values in the column.
How to force inferSchema for CSV to consider integers as dates (with "dateFormat" option)?
Spark data type guesser UDAF
There's no easy way currently,
there's a Github issue already existing which can be referred
https://github.com/databricks/spark-csv/issues/264
somthing like https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala
existing for scala this can be created for pyspark
I'm using Cognos Framework Manager and I'm creating a Data Item for a dynamic sort. I'm creating the Data Item using a CASE WHEN, here's my sample code:
CASE #prompt('SortOrder', 'string')#
WHEN 'Date' THEN <Date Column>
WHEN 'ID' THEN <String Column>
END
I'm getting this error QE-DEF-0405 Incompatible data types in case statement. Although I can cast the date column into a string wouldn't that make sort go wrong for the 'date' option? Should I cast the date column in a different way, cast the whole case, or am I barking at the wrong tree? In line with my question, should there be a general rule when creating dynamic columns via CASE with multiple column data types?
Column in Framework Manager should have datatype. Only one datatype.
So you need to cast your date column to correctly sortable string.
E.g. 'yyyy-mm-dd' format.
You are using the two different types of data format, so in prompt function use token instead of string (#prompt('sortorder','token')#)
I have XML schema with some data. I need to convert this schema to Flat-File AND add the constant header, which is given separately as a string.
I have 2 possible solutions:
Since header values are fixed and happen only once, I will create a separate record for header.
In this case I will have 2 records level. 1. HeaderTitles and 2. Records. So I use the HeaderTitle record as a filter.
We can create 2 schemas:
(1) Header - This will have one string element type, "Name Age Country". (This is the column header)
(2) Body - This will be the actual data records. This will have 3 elements, the name, age & the country as repeating records.
In the pipeline assembler, there is a property where we can decide whether we want to include the header info or not in the final message. We can just disable this.
Can I do this in some other way?
I would recommend Option 1 where you have the header in the Flat File schema and you either have default values specified in the schema or set them in the map would be the best and easiest in in my opinion the correct approach.
The only time I would use option 2 is if you had the flat file incoming and needed to disassemble it and actually needed to debatch the record lines into separate messages, which you would do be defining the Body record as occurs 1.