How to avoid dataset from renaming columns to value while mapping? - apache-spark

While mapping a dataset I keep having the problem that columns are being renamed from _1, _2 ect to value, value.
What is it which is causing the rename?

That's because map on Dataset causes that query is serialized and deserialized in Spark.
To Serialize it, Spark must now the Encoder. That's ewhy there is an object ExpressionEncoder with method apply. It's JavaDoc says:
A factory for constructing encoders that convert objects and primitives to and from the
internal row format using catalyst expressions and code generation. By default, the
expressions used to retrieve values from an input row when producing an object will be created as
follows:
- Classes will have their sub fields extracted by name using [[UnresolvedAttribute]] expressions
and [[UnresolvedExtractValue]] expressions.
- Tuples will have their subfields extracted by position using [[BoundReference]] expressions.
- Primitives will have their values extracted from the first ordinal with a schema that defaults
to the name `value`.
Please look at the last point. Your query is just mapped to primitives, so Catalyst uses name "value".
If you add .select('value.as("MyPropertyName")).as[CaseClass], the field names will be correct.
Types that will have column name "value":
Option(_)
Array
Collection types like Seq, Map
types like String, Timestamp, Date, BigDecimal

Related

Azure Cognitive Search filter fields having mixed datatype

I have created a field (named as 'value') in my Azure Cognitive Search Index which may have values of different data types (for example, string, string array, object array). While creating the Index, I have configured type for this value field as "Edm.String" and due to which data in my Index is stored as:
For string fields: "value": "value1"
For string arrays: "value": "["value1","value2"]"
For object arrays: "value": "[ {"key1":"value1"},{"key2":"value2"}]"
Basically, my complex fields are getting stored in form of strings as I have defined these as "Edm.String". Hence, filters are not working properly in this.
For example: If I try to filter data where "key1":"value1" (in point 3), the data is not getting matched as the actual value is "[ {"key1":"value1"},{"key2":"value2"}]".
Can anyone please guide on how to proceed in this case?
Note: I cannot make the value field of type "Collection(Edm.ComplexType)" because of the values are in string format and Indexer fails in this case. Also, I cannot modify the way database is structured.
We have a method called search.in where we can apply filter on the collection created using Edm.String. We have different other methods mentioned like here, which speaks about the $filter and other operations using Edm.String.
We have the choice of using the below syntax for filtering
keyphrases/any(t: search.in(t, 'database'))

Cassandra 2.2.11 add new map column from text column

Let's say I have table with 2 columns
primary key: id - type varchar
and non-primary-key: data - type text
Data column consist only of json values for example like:
{
"name":"John",
"age":30
}
I know that i can not alter this column to map type but maybe i can add new map column with values from data column or maybe you have some other idea?
What can i do about it ? I want to get map column in this table with values from data
You might want to make use of the CQL COPY command to export all your data to a CSV file.
Then alter your table and create a new column of type map.
Convert the exported data to another file containing UPDATE statements where you only update the newly created column with values converted from JSON to a map. For conversion use a tool or language of your choice (be it bash, python, perl or whatever).
BTW be aware, that with map you specify what data type is your map's key and what data type is your map's value. So you will most probably be limited to use strings only if you want to be generic, i.e. a map<text, text>. Consider whether this is appropriate for your use case.

BigQuery record of repeated fields

I have a Bigquery table with (what is conceptually) a field containing repeated record.
However, this field is stored as a record of repeated fields. This is either caused by the export from AppEngine DataStore (using Mache), or by the representation of the data (using Objectify 3); I don't know.
So what I have is a field (exercises) that looks like this:
exercises RECORD NULLABLE exercises
exercises.id INTEGER REPEATED id
exercises.weight FLOAT REPEATED weight
exercises.duration STRING REPEATED duration
instead of
exercises RECORD REPEATED exercises
exercises.id INTEGER NULLABLE id
exercises.weight FLOAT NULLABLE weight
exercises.duration STRING NULLABLE duration
The latter can be queried easily using FLATTEN (legacy SQL) or UNNEST (standard SQL). However, with the schema I have now, I seem to be stuck.
I guess I would have to to transpose the exercises field in some way, from a records of arrays to an array of records.
The sub-fields of exercises always have the same length, so that should not be a problem.
How can I query and select this field?
I have tried UNNEST WITH OFFSET, as suggested here:
SELECT
exerciseId, exofs,
exercises.weight[OFFSET(exofs)] AS exerciseWeight,
exercises.duration[OFFSET(exofs)] AS exerciseDuration
FROM Session, UNNEST(exercises.id) AS exerciseId WITH OFFSET exofs
This works! This feature is only available in standard SQL. The FLATTEN in legacy SQL does not support WITH OFFSET.

Auto infer schema from parquet/ selectively convert string to float

I have a parquet file with 400+ columns, when I read it, the default datatypes attached to a lot of columns is String (may be due to the schema specified by someone else).
I was not able to find a parameter similar to
inferSchema=True' #for spark.read.parquet, present for spark.read.csv
I tried changing
mergeSchema=True #but it doesn't improve the results
To manually cast columns as float, I used
df_temp.select(*(col(c).cast("float").alias(c) for c in df_temp.columns))
this runs without error, but converts all the actual string column values to Null. I can't wrap this in a try, catch block as its not throwing any error.
Is there a way where i can check whether the columns contains only 'integer/ float' values and selectively cast those columns to float?
Parquet columns are typed, so there is no such thing as schema inference when loading Parquet files.
Is there a way where i can check whether the columns contains only 'integer/ float' values and selectively cast those columns to float?
You can use the same logic as Spark - define preferred type hierarchy and attempt to cast, until you get to the point, where you find the most selective type, that parses all values in the column.
How to force inferSchema for CSV to consider integers as dates (with "dateFormat" option)?
Spark data type guesser UDAF
There's no easy way currently,
there's a Github issue already existing which can be referred
https://github.com/databricks/spark-csv/issues/264
somthing like https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala
existing for scala this can be created for pyspark

Dynamic Query Item Used for Sorting

I'm using Cognos Framework Manager and I'm creating a Data Item for a dynamic sort. I'm creating the Data Item using a CASE WHEN, here's my sample code:
CASE #prompt('SortOrder', 'string')#
WHEN 'Date' THEN <Date Column>
WHEN 'ID' THEN <String Column>
END
I'm getting this error QE-DEF-0405 Incompatible data types in case statement. Although I can cast the date column into a string wouldn't that make sort go wrong for the 'date' option? Should I cast the date column in a different way, cast the whole case, or am I barking at the wrong tree? In line with my question, should there be a general rule when creating dynamic columns via CASE with multiple column data types?
Column in Framework Manager should have datatype. Only one datatype.
So you need to cast your date column to correctly sortable string.
E.g. 'yyyy-mm-dd' format.
You are using the two different types of data format, so in prompt function use token instead of string (#prompt('sortorder','token')#)

Resources