AnalysisException: CSV data source does not support array<struct<

AnalysisException: CSV data source does not support array<struct< - apache-spark

I am at work and I need immediate help please
I have a parquet file and I need to convert it to csv. could u please help me?
error:
AnalysisException: CSV data source does not support array<struct<company:string,dateRange:string,description:string,location:string,title:string>> data type.
I have never worked with this format so I can't even print schema. sorry
printshema:
root
|-- _id: string (nullable = true)
|-- Locale: string (nullable = true)
|-- workExperience: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- company: string (nullable = true)
| | |-- dateRange: string (nullable = true)
| | |-- description: string (nullable = true)
| | |-- location: string (nullable = true)
| | |-- title: string (nullable = true)

The parquet schema can be flattened using explode:
df=spark.read.parquet(...)
flattened_df = df.withColumn("tmp", F.explode("workExperience")) \
.selectExpr("_id", "Locale", "tmp.*")
flattened_df.write.csv(...)

You can't save a dataframe which contains column with array/struct type to CSV. You need to cast the column to string before writing.
df.withColumn('workExperience', col('workExperience').cast('string')).write.csv('path')

Related

Pyspark structured streaming - Union data from 2 nested JSON

I have 2 kafka streaming dataframes. The spark schema looks like this:
root
|-- key: string (nullable = true)
|-- pmudata1: struct (nullable = true)
| |-- pmu_id: byte (nullable = true)
| |-- time: timestamp (nullable = true)
| |-- stream_id: byte (nullable = true)
| |-- stat: string (nullable = true)
and
root
|-- key: string (nullable = true)
|-- pmudata2: struct (nullable = true)
| |-- pmu_id: byte (nullable = true)
| |-- time: timestamp (nullable = true)
| |-- stream_id: byte (nullable = true)
| |-- stat: string (nullable = true)
How can I union all rows from both streams as they come by specific batch window? Positions of columns in both streams is same.
Each stream have different pmu_id value so I can differentiate records per that value.
UnionByName or union produces stream from single dataframe.
I would need to explode column names I guess, something like this but this is for scala.
Is there a way to automatically explode whole JSON in columns and union them?

You can use explode function only with array and map types. In your case, the column pmudata2 has type StructType so simply use star * to select all sub-fields like this:
df1 = df.selectExpr("key", "pmudata2.*")
#root
#|-- key: string (nullable = true)
#|-- pmu_id: byte (nullable = true)
#|-- time: timestamp (nullable = true)
#|-- stream_id: byte (nullable = true)
#|-- stat: string (nullable = true)

Way to concatenate Array of structs

I have a column that contains array of structs. It looks like this:
|-- Network: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Code: string (nullable = true)
| | |-- Signal: string (nullable = true)
This is just a small sample, there are many more columns inside the struct than this. Is there a way to take the arrays in the column for each row, concatenate them and make them into one string? For example, we could have something like this:
[["example", 2], ["example2", 3]]
Is there a way to make into:
"example2example3"?

Assuming having a dataframe df with the following schema:
df.printSchema
df with sample data:
df.show(false)
You need to first explode the Network array to select the struct elements Code and signal.
var myDf = df.select(explode($"Network").as("Network"))
Then you need to concat the two columns using the concat() function and then pass the output to the collect_list() function which will aggregate all rows into one row of type array<string>
myDf = myDf.select(collect_list(concat($"Network.code",$"Network.signal")).as("data"))
Finally, you need to concat into the required format which can be done using concat_ws() function which takes two arguments, the first being the separator to be placed between two string and the second argument being a column with array<string> type which is our output from our previous step. As per your use case, we don't need any separator to be placed between two concatenates strings hence we keep the separator argument as an empty quote.
myDf = myDf.select(concat_ws("",$"data").as("data"))
All the above steps can be done in one line
myDf= myDf.select(explode($"Network").as("Network")).select(concat_ws("",collect_list(concat($"Network.code",$"Network.signal"))).as("data")).show(false)
If you want the output directly into a String variable then use:
val myStr = myDf.first.get(0).toString
print(myStr)

There is a library called spark-hats (Github, small article) that you might find very useful in these situations.
With its use, you can map the array easily and output the concatenation next to the elements or even somewhere else if you provide a fully qualified name.
Setup
import org.apache.spark.sql.functions._
import za.co.absa.spark.hats.Extensions._
scala> df.printSchema
root
|-- info: struct (nullable = true)
| |-- drivers: struct (nullable = true)
| | |-- carName: string (nullable = true)
| | |-- carNumbers: string (nullable = true)
| | |-- driver: string (nullable = true)
|-- teamName: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- team1: string (nullable = true)
| | |-- team2: string (nullable = true)
scala> df.show(false)
+---------------------------+------------------------------+
|info |teamName |
+---------------------------+------------------------------+
|[[RB7, 33, Max Verstappen]]|[[Redbull, rb], [Monster, mt]]|
+---------------------------+------------------------------+
Command you are looking for
scala> val dfOut = df.nestedMapColumn(inputColumnName = "teamName", outputColumnName = "nextElementInArray", expression = a => concat(a.getField("team1"), a.getField("team2")) )
dfOut: org.apache.spark.sql.DataFrame = [info: struct<drivers: struct<carName: string, carNumbers: string ... 1 more field>>, teamName: array<struct<team1:string,team2:string,nextElementInArray:string>>]
Output
scala> dfOut.printSchema
root
|-- info: struct (nullable = true)
| |-- drivers: struct (nullable = true)
| | |-- carName: string (nullable = true)
| | |-- carNumbers: string (nullable = true)
| | |-- driver: string (nullable = true)
|-- teamName: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- team1: string (nullable = true)
| | |-- team2: string (nullable = true)
| | |-- nextElementInArray: string (nullable = true)
scala> dfOut.show(false)
+---------------------------+----------------------------------------------------+
|info |teamName |
+---------------------------+----------------------------------------------------+
|[[RB7, 33, Max Verstappen]]|[[Redbull, rb, Redbullrb], [Monster, mt, Monstermt]]|
+---------------------------+----------------------------------------------------+

How to add an optional column inside struct field with pyspark

I currently had created a struct field in this way:
df = df.withColumn('my_struct', struct(
col('id').alias('id_test')
col('value').alias('value_test')
).alias('my_struct'))
The think is that now I need to add and extra field to my_struct called "optional". This field must be there when it exits and remove it when it's not. Sadly values like null/none not an option.
So far I have two different dataframes, one with the desired value and the column by id and another one without the value/column and all the information.
df_optional = df_optional.select('id','optional')
df = df.select('id','value','my_struct')
I want to add into df.my_struct the optional value when df_optional.id join df.join plus the rest.
Till this point I have this:
df_with_option = df.join(df_optional,on=['id'],how='inner') \
.withColumn('my_struct', struct(
col('id').alias('id_test')
col('value').alias('value_test')
col(optional).alias('optional')
).alias('my_struct')).drop('optional')
df_without = df.join(df_optional,on=['id'],how='leftanti') # it already have my_struct
But union should have similar columns so my code breaks.
df_result = df_without .unionByName(df_with_option)
I want to union both dataframes because at the end I write a json file partitioned by id:
df_result.repartitionByRange(df_result.count(),df['id']).write.format('json').mode('overwrite').save('my_path')
Those json files should have the 'optional' column when it has values, otherwise it should be out of the schema.
Any help will be appreciate.
--ADITIONAL INFO.
Schema input:
df_root
|-- id: string (nullable = true)
|-- optional: string (nullable = true)
df_optional
|-- id: string (nullable = true)
|-- value: string (nullable = true)
|-- my_struct: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- value: string (nullable = true)
Schema output:
df_result
|-- id: string (nullable = true)
|-- value: string (nullable = true)
|-- my_struct: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- value: string (nullable = true)
| |-- optional: string (nullable = true) (*)
(*) Only when it exists.
--UPDATE
I think that it jsut not posible in that way. I probably need to keep both dataframes appart and just write it two times. Something like this:
df_without.repartitionByRange(df_result.count(),df['id']).write.format('json').mode('overwrite').save('my_path')
df_with_option.repartitionByRange(df_result.count(),df['id']).write.format('json').mode('append').save('my_path')
Then I will had in my path the files by it's own way.

Pyspark issue loading xml files with com.databricks:spark-xml

I'm trying to push some academic POC to work that rely on pyspark with com.databricks:spark-xml. The goal is to load the Stack Exchange Data Dump xml format (https://archive.org/details/stackexchange) to pyspark df.
It works like a charm with correctly formatted xml with proper tags but fail with Stack Exchange Dump as follows:
<users>
<row Id="-1" Reputation="1" CreationDate="2014-07-30T18:05:25.020" DisplayName="Community" LastAccessDate="2014-07-30T18:05:25.020" Location="on the server farm" AboutMe=" I feel pretty, Oh, so pretty" Views="0" UpVotes="26" DownVotes="701" AccountId="-1" />
</users>
Depending on the root tag, row tag I'm getting empty schema or..something:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.xml').option("rowTag", "users").load('./tmp/test/Users.xml')
df.printSchema()
df.show()
root
|-- row: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _AboutMe: string (nullable = true)
| | |-- _AccountId: long (nullable = true)
| | |-- _CreationDate: string (nullable = true)
| | |-- _DisplayName: string (nullable = true)
| | |-- _DownVotes: long (nullable = true)
| | |-- _Id: long (nullable = true)
| | |-- _LastAccessDate: string (nullable = true)
| | |-- _Location: string (nullable = true)
| | |-- _ProfileImageUrl: string (nullable = true)
| | |-- _Reputation: long (nullable = true)
| | |-- _UpVotes: long (nullable = true)
| | |-- _VALUE: string (nullable = true)
| | |-- _Views: long (nullable = true)
| | |-- _WebsiteUrl: string (nullable = true)
+--------------------+
| row|
+--------------------+
|[[Hi, I'm not ......|
+--------------------+
Spark : 1.6.0
Python : 2.7.15
Com.databricks : spark-xml_2.10:0.4.1
I would be extremely grateful for any advise.
Kind Regards,
P.

I tried the same method (spark-xml on stackoverflow dump files) some time ago and I failed... Mostly because DF is seen as an array of structures and the processing performance was really bad. Instead, I recommend to use standard text reader and map Key="Value" in every line with UDF like this:
pattern = re.compile(' ([A-Za-z]+)="([^"]*)"')
parse_line = lambda line: {key:value for key,value in pattern.findall(line)}
You can also use my code to get the proper data types: https://github.com/szczeles/pyspark-notebooks/blob/master/stackoverflow/stackexchange-convert.ipynb (the schema matches dumps for March 2017).

Spark LuceneRDD for JSON data

Can we use LuceneRDD to Index JSON data.I tried to Index JSON format data using LuceneRDD, but it doesn't show correct result
Code:
read.filter($"influencer" === "markpantoni").show(truncate = false)
val luceneRDD = LuceneRDD(read)
val influencerName = "markpantoni"
val result= luceneRDD.termQuery("influencer", "markpantoni",1)
result.take(1).foreach(println)
read dataframe scheme:
root
|-- influencer: string (nullable = true)
|-- matches: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- influencer: string (nullable = true)
| | |-- totalNumberOfOverlaps: string (nullable = true)
Result:
+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
|influencer |matches |
+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
|markpantoni|[[chefsymon,4], [TheSchott,3], [RyanJohansen19,2], [builtincbus,1], [AAAOhio,1], [RMHCofCentralOH,1], [NASA,1], [CityScene,1], [daytonpulse,1], [wexarts,1]]|
+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
[score: 5.685845/docId: 1/doc: Text fields:influencer:[markpantoni]]
[score: 5.685845/docId: 1/doc: Text fields:influencer:[markpantoni]]

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

AnalysisException: CSV data source does not support array<struct< - apache-spark

The parquet schema can be flattened using explode: df=spark.read.parquet(...) flattened_df = df.withColumn("tmp", F.explode("workExperience")) \ .selectExpr("_id", "Locale", "tmp.*") flattened_df.write.csv(...)

You can't save a dataframe which contains column with array/struct type to CSV. You need to cast the column to string before writing. df.withColumn('workExperience', col('workExperience').cast('string')).write.csv('path')

Related

Pyspark structured streaming - Union data from 2 nested JSON

Way to concatenate Array of structs

How to add an optional column inside struct field with pyspark

Pyspark issue loading xml files with com.databricks:spark-xml

Spark LuceneRDD for JSON data

Categories

Resources